Draft for basic NUMA observability
As I have promised to Andres on the Discord hacking server some time
ago, I'm attaching the very brief (and potentially way too rushed)
draft of the first step into NUMA observability on PostgreSQL that was
based on his presentation [0]https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf. It might be rough, but it is to get us
started. The patches were not really even basically tested, they are
more like input for discussion - rather than solid code - to shake out
what should be the proper form of this.
Right now it gives:
postgres=# select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+-------
| 16127
6 | 256
1 | 1
Changes since the version posted on Discord:
1. libnuma to centralize dependency in the build process (to be future
proof; gives opportunity to use e.g. numa_set_localalloc()). BTW: why
is a specific autoconf version (2.69) required?
2. per-page get_mempolicy(2) syscall was changed to 1x call of
migrate_pages(2) by Bertrand
3. enhancement to support huge pages (with the above) and code to
reduce no of pages for inquiry by doing DB block <-> OS memory pages
mapping. This is a bit hard for me and I'm pretty sure it could be
done somewhat better.
Some other points:
a. plenty of FIXMEs inside and I bet I could screw-up the void *ptr
calculations , but we somehow need to support scenarios like BLCKSZ=2k
.. 32kB @ page sizes 4kB,2M,16M
b. I don't think it makes sense to expose users to bitmaps or int[]
arrays, so there's no support showing that potentially 1 DB block
spans 2 OS memory pages (I think it should be rare!)
c. we probably should switch to numa_move_pages(3) from libnuma, right?
d. earlier Andres wrote:
IME using pg_buffercache_pages() is often too expensive due to the per-row overhead. I think we'd probably want a number-of-pages-per-numa-node function
that does the grouping in C. Compare how fast pg_buffercache_summary() is to doing the grouping in SQL when using larger shared_buffers settings.
I think it doesn't make a lot of sense to introduce *new*
pg_buffercache_numa_usage_summary() for this, if we can go straight
for pg_shmallocations_numa view instead, shouldn't we? It will give a
much better picture for everything else for free.
Patches and co-authors are more than welcome!
-J.
[0]: https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf
Attachments:
0001-Extend-pg_buffercache-to-also-show-NUMA-zone-id-allo.patchapplication/octet-stream; name=0001-Extend-pg_buffercache-to-also-show-NUMA-zone-id-allo.patchDownload
From 3d3c8bac2197288ef625123c11f06b1c980c79d9 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 7 Feb 2025 14:06:32 +0100
Subject: [PATCH] Extend pg_buffercache to also show NUMA zone id allocated
---
contrib/pg_buffercache/Makefile | 3 +-
contrib/pg_buffercache/meson.build | 1 +
.../pg_buffercache--1.5--1.6.sql | 15 +++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 106 +++++++++++++++++-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/storage/pg_shmem.h | 1 +
7 files changed, 124 insertions(+), 6 deletions(-)
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..9b2e9393410 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..bb59ee08a71
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,15 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, numa_zone_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3ae0a018e10..25a8b9e2ba0 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -6,6 +6,7 @@
* contrib/pg_buffercache/pg_buffercache_pages.c
*-------------------------------------------------------------------------
*/
+#include "pg_config.h"
#include "postgres.h"
#include "access/htup_details.h"
@@ -13,10 +14,16 @@
#include "funcapi.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#include <unistd.h>
+#endif
+#include "storage/pg_shmem.h"
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -43,6 +50,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_zone_id;
} BufferCachePagesRec;
@@ -65,6 +73,15 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+static void
+pg_buffercache_mark_numa_invalid(BufferCachePagesContext *fctx, int n)
+{
+ int i;
+ for (i = 0; i < n; i++) {
+ fctx->record[i].numa_zone_id = -1;
+ }
+}
+
Datum
pg_buffercache_pages(PG_FUNCTION_ARGS)
{
@@ -78,7 +95,12 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
if (SRF_IS_FIRSTCALL())
{
- int i;
+ int i, blk2page, j;
+ Size os_page_size;
+ void **os_page_ptrs;
+ int *os_pages_status;
+ int os_page_count;
+ float pages_per_blk;
funcctx = SRF_FIRSTCALL_INIT();
@@ -122,10 +144,14 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
INT2OID, -1, 0);
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM-1)
TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "numa_zone_id",
+ INT4OID, -1, 0);
+
fctx->tupdesc = BlessTupleDesc(tupledesc);
/* Allocate NBuffers worth of BufferCachePagesRec records. */
@@ -140,6 +166,30 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
/* Return to original context when allocating transient memory */
MemoryContextSwitchTo(oldcontext);
+ /* This is for gathering some NUMA statistics. We might be using
+ * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
+ * allocated in various different OS memory pages sizes, so first
+ * we need to understand the OS memory page size before
+ * calling move_pages()
+ */
+ os_page_size = sysconf(_SC_PAGESIZE);
+ if(huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ os_page_count = (NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(NOTICE, "os_page_count=%d os_page_size=%ld pages_per_blk=%f",
+ os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(int) * os_page_count);
+ memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably
+ * have bug in our buffers to OS page mapping code here
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
/*
* Scan through all the buffers, saving the relevant fields in the
* fctx->record structure.
@@ -177,8 +227,55 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
else
fctx->record[i].isvalid = false;
+#ifdef USE_LIBNUMA
+/* FIXME: taken from bufmgr.c, maybe move to .h ? */
+#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+ blk2page = (int) i * pages_per_blk;
+ j = 0;
+ do {
+ /*
+ * Many buffers can point to the same page, but we want to
+ * query just first address
+ */
+ if(os_page_ptrs[blk2page+j] == 0) {
+ os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j);
+ }
+ j++;
+ } while(j < (int)pages_per_blk);
+#endif
+
UnlockBufHdr(bufHdr, buf_state);
}
+
+
+#ifdef USE_LIBNUMA
+ /* According to numa(3) it is required to initialize library even if that's no-op. */
+ /* FIXME: should we also consider GUC debug_numa to be added just in case to disable this ? */
+ if(numa_available() == -1) {
+ pg_buffercache_mark_numa_invalid(fctx, NBuffers);
+ elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");;
+ } else {
+ /* Amortize the number of pages we need to query about */
+ /* FIXME: switch to numa_move_pages(3) instead ? */
+ if(move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) {
+ elog(ERROR, "failed NUMA pages inquiry status");
+ }
+ for (i = 0; i < NBuffers; i++) {
+ blk2page = (int) i * pages_per_blk;
+ /* Technically we can get errors too here and pass that to user
+ *
+ * XXX:: also we could somehow report single DB block spanning
+ * more than 2 NUMA zones, but it should be rare (?)
+ */
+ fctx->record[i].numa_zone_id = os_pages_status[blk2page];
+ }
+ }
+#else
+ pg_buffercache_mark_numa_invalid(fctx, NBuffers);
+#endif
+
+ pfree(os_page_ptrs);
+ pfree(os_pages_status);
}
funcctx = SRF_PERCALL_SETUP();
@@ -211,6 +308,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
nulls[7] = true;
/* unused for v1.0 callers, but the array is always long enough */
nulls[8] = true;
+ nulls[9] = true;
}
else
{
@@ -231,6 +329,8 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
/* unused for v1.0 callers, but the array is always long enough */
values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
nulls[8] = false;
+ values[9] = Int32GetDatum(fctx->record[i].numa_zone_id);
+ nulls[9] = false;
}
/* Build and return the tuple. */
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index ce7534d4d23..be880184042 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -561,7 +561,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
--
2.39.5
0001-Add-optional-dependency-to-libnuma-for-basic-NUMA-aw.patchapplication/octet-stream; name=0001-Add-optional-dependency-to-libnuma-for-basic-NUMA-aw.patchDownload
From 58d02490a94a0a4c23fd5f7fb060c46e81862aae Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 7 Feb 2025 08:07:06 +0100
Subject: [PATCH] Add optional dependency to libnuma for basic NUMA awareness
routines
---
.cirrus.tasks.yml | 4 ++++
configure.ac | 13 +++++++++++++
meson.build | 17 +++++++++++++++++
meson_options.txt | 3 +++
src/Makefile.global.in | 1 +
src/backend/Makefile | 3 +++
src/include/pg_config.h.in | 3 +++
src/makefiles/meson.build | 3 +++
8 files changed, 47 insertions(+)
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index cfe2117e02e..db3e986957a 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -325,6 +325,10 @@ task:
SANITIZER_FLAGS: -fsanitize=address
PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range
+
+ # FIXME: use or not the libnuma?
+ # --with-libnuma \
+ #
# Normally, the "relation segment" code basically has no coverage in our
# tests, because we (quite reasonably) don't generate tables large
# enough in tests. We've had plenty bugs that we didn't notice due the
diff --git a/configure.ac b/configure.ac
index f56681e0d91..fbdacc9b240 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1007,6 +1007,19 @@ fi
AC_SUBST(with_uuid)
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+fi
+
#
# XML
#
diff --git a/meson.build b/meson.build
index 1ceadb9a830..d077ff80889 100644
--- a/meson.build
+++ b/meson.build
@@ -853,6 +853,21 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+libnuma = dependency('libnuma', required: libnumaopt)
+if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt, dirs: test_lib_d)
+endif
+if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: libxml
@@ -3068,6 +3083,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
libxml,
lz4,
pam,
@@ -3720,6 +3736,7 @@ if meson.version().version_compare('>=0.57')
'gss': gssapi,
'icu': icu,
'ldap': ldap,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index d9c7ddccbc4..4cf81b6ce25 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -103,6 +103,9 @@ option('ldap', type: 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index bbe11e75bf0..9c3fb2a4713 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -190,6 +190,7 @@ with_systemd = @with_systemd@
with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
+with_libnuma = @with_libnuma@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5aa..bff9f077a8c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -54,6 +54,9 @@ ifeq ($(with_systemd),yes)
LIBS += -lsystemd
endif
+# FIXME: filter-out / with/without with_libnuma?
+LIBS += $(LIBNUMA_LIBS)
+
override LDFLAGS := $(LDFLAGS) $(LDFLAGS_EX) $(LDFLAGS_EX_BE)
##########################################################################
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798abd..88b0d5330b6 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -663,6 +663,9 @@
/* Define to 1 to build with LDAP support. (--with-ldap) */
#undef USE_LDAP
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index d49b2079a44..211cc3ca0eb 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS'
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -229,6 +231,7 @@ pgxs_deps = {
'gssapi': gssapi,
'icu': icu,
'ldap': ldap,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
--
2.39.5
Hi,
On Fri, Feb 07, 2025 at 03:32:43PM +0100, Jakub Wartak wrote:
As I have promised to Andres on the Discord hacking server some time
ago, I'm attaching the very brief (and potentially way too rushed)
draft of the first step into NUMA observability on PostgreSQL that was
based on his presentation [0]. It might be rough, but it is to get us
started. The patches were not really even basically tested, they are
more like input for discussion - rather than solid code - to shake out
what should be the proper form of this.Right now it gives:
postgres=# select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+-------
| 16127
6 | 256
1 | 1
Thanks for the patch!
Not doing a code review but sharing some experimentation.
First, I had to:
@@ -99,7 +100,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
Size os_page_size;
void **os_page_ptrs;
int *os_pages_status;
- int os_page_count;
+ uint64 os_page_count;
and
- os_page_count = (NBuffers * BLCKSZ) / os_page_size;
+ os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size;
to make it work with non tiny shared_buffers.
Observations:
when using 2 sessions:
Session 1 first loads buffers (e.g., by querying a relation) and then runs
'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;'
Session 2 does nothing but runs 'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;'
I see a lot of '-2' for the numa_zone_id in session 2, indicating that pages appear
as unmapped when viewed from a process that hasn't accessed them, even though
those same pages appear as allocated on a NUMA node in session 1.
To double check, I created a function pg_buffercache_pages_from_pid() that is
exactly the same as pg_buffercache_pages() (with your patch) except that it
takes a pid as input and uses it in move_pages(<pid>, …).
Let me show the results:
In session 1 (that "accessed/loaded" the ~65K buffers):
postgres=# select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+---------
| 5177310
0 | 65192
-2 | 378
(3 rows)
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
1662580
In session 2:
postgres=# select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+---------
| 5177301
0 | 85
-2 | 65494
(3 rows)
^
postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(pg_backend_pid()) group by numa_zone_id;
NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+---------
| 5177301
0 | 90
-2 | 65489
(3 rows)
But when session's 1 pid is used:
postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(1662580) group by numa_zone_id;
NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+---------
| 5177301
0 | 65195
-2 | 384
(3 rows)
Results show:
Correct NUMA distribution in session 1
Correct NUMA distribution in session 2 only when using pg_buffercache_pages_from_pid()
with the pid of session 1 as a parameter (the session that actually accessed the buffers)
Which makes me wondering if using numa_move_pages()/move_pages is the
right approach. Would be curious to know if you observe the same behavior though.
The initial idea that you shared on discord was to use get_mempolicy() but
as Andres stated:
"
One annoying thing about get_mempolicy() is this:
If no page has yet been allocated for the specified address, get_mempolicy() will allocate a page as if the thread
had performed a read (load) access to that address, and return the ID of the node where that page was allocated.
Forcing the allocation to happen inside a monitoring function is decidedly not great.
"
The man page looks correct (verified with "perf record -e page-faults,kmem:mm_page_alloc -p <pid>")
while using get_mempolicy().
But maybe we could use get_mempolicy() only on "valid" buffers i.e
((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Thu, Feb 13, 2025 at 4:28 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi Bertrand,
Thanks for playing with this!
Which makes me wonder if using numa_move_pages()/move_pages is the right approach. Would be curious to know if you observe the same behavior though.
You are correct, I'm observing identical behaviour, please see attached.
Forcing the allocation to happen inside a monitoring function is decidedly not great.
We probably would need to split it to some separate and new view
within the pg_buffercache extension, but that is going to be slow, yet
still provide valid results. In the previous approach that
get_mempolicy() was allocating on 1st access, but it was slow not only
because it was allocating but also because it was just 1 syscall per
1x addr (yikes!). I somehow struggle to imagine how e.g. scanning
(really allocating) a 128GB buffer cache in future won't cause issues
- that's like 16-17mln (* 2) syscalls to be issued when not using
move_pages(2)
Another thing is that numa_maps(5) won't help us a lot too (not enough
granularity).
But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts?
Different perspective: I wanted to use the same approach in the new
pg_shmemallocations_numa, but that won't cut it there. The other idea
that came to my mind is to issue move_pages() from the backend that
has already used all of those pages. That literally mean on of the
below ideas:
1. from somewhere like checkpointer / bgwriter?
2. add touching memory on backend startup like always (sic!)
3. or just attempt to read/touch memory addr just before calling
move_pages(). E.g. this last options is just two lines:
if(os_page_ptrs[blk2page+j] == 0) {
+ volatile uint64 touch pg_attribute_unused();
os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) +
(os_page_size*j);
+ touch = *(uint64 *)os_page_ptrs[blk2page+j];
}
and it seems to work while still issuing much less syscalls with
move_pages() across backends, well at least here.
Frankly speaking I do not know which path to take with this, maybe
that's good enough?
-J.
Attachments:
Hi Jakub,
On Mon, Feb 17, 2025 at 01:02:04PM +0100, Jakub Wartak wrote:
On Thu, Feb 13, 2025 at 4:28 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:Hi Bertrand,
Thanks for playing with this!
Which makes me wonder if using numa_move_pages()/move_pages is the right approach. Would be curious to know if you observe the same behavior though.
You are correct, I'm observing identical behaviour, please see attached.
Thanks for confirming!
We probably would need to split it to some separate and new view
within the pg_buffercache extension, but that is going to be slow, yet
still provide valid results.
Yup.
In the previous approach that
get_mempolicy() was allocating on 1st access, but it was slow not only
because it was allocating but also because it was just 1 syscall per
1x addr (yikes!). I somehow struggle to imagine how e.g. scanning
(really allocating) a 128GB buffer cache in future won't cause issues
- that's like 16-17mln (* 2) syscalls to be issued when not using
move_pages(2)
Yeah, get_mempolicy() not working on a range is not great.
But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts?
Different perspective: I wanted to use the same approach in the new
pg_shmemallocations_numa, but that won't cut it there. The other idea
that came to my mind is to issue move_pages() from the backend that
has already used all of those pages. That literally mean on of the
below ideas:
1. from somewhere like checkpointer / bgwriter?
2. add touching memory on backend startup like always (sic!)
3. or just attempt to read/touch memory addr just before calling
move_pages(). E.g. this last options is just two lines:if(os_page_ptrs[blk2page+j] == 0) { + volatile uint64 touch pg_attribute_unused(); os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j); + touch = *(uint64 *)os_page_ptrs[blk2page+j]; }and it seems to work while still issuing much less syscalls with
move_pages() across backends, well at least here.
One of the main issue I see with 1. and 2. is that we would not get accurate
results should the kernel decides to migrate the pages. Indeed, the process doing
the move_pages() call needs to have accessed the pages more recently than any
kernel migrations to see accurate locations.
OTOH, one of the main issue that I see with 3. is that the monitoring could
probably influence the kernel's decision to start pages migration (I'm not 100%
sure but I could imagine it may influence the kernel's decision due to having to
read/touch the pages).
But I'm thinking: do we really need to know the page location of every single page?
I think what we want to see is if the pages are "equally" distributed on all
the nodes or are somehow "stuck" to one (or more) nodes. In that case what about
using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer
or contiguous chunks). We could create a new function that would accept a
"sampling distance" as parameter for example, thoughts?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi Bertrand,
TL;DR; the main problem seems choosing which way to page-fault the
shared memory before the backend is going to use numa_move_pages() as
the memory mappings (fresh after fork()/CoW) seem to be not ready for
numa_move_pages() inquiry.
On Thu, Feb 20, 2025 at 9:32 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
We probably would need to split it to some separate and new view
within the pg_buffercache extension, but that is going to be slow, yet
still provide valid results.Yup.
OK so I've made that NUMA inquiry (now with that "volatile touch" to
get valid results for not used memory) into a new and separate
pg_buffercache_numa view. This avoids the problem that somebody would
automatically run into this slow path when using pg_buffercache.
But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts?
Different perspective: I wanted to use the same approach in the new
pg_shmemallocations_numa, but that won't cut it there. The other idea
that came to my mind is to issue move_pages() from the backend that
has already used all of those pages. That literally mean on of the
below ideas:
1. from somewhere like checkpointer / bgwriter?
2. add touching memory on backend startup like always (sic!)
3. or just attempt to read/touch memory addr just before calling
move_pages(). E.g. this last options is just two lines:if(os_page_ptrs[blk2page+j] == 0) { + volatile uint64 touch pg_attribute_unused(); os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j); + touch = *(uint64 *)os_page_ptrs[blk2page+j]; }and it seems to work while still issuing much less syscalls with
move_pages() across backends, well at least here.One of the main issue I see with 1. and 2. is that we would not get accurate
results should the kernel decides to migrate the pages. Indeed, the process doing
the move_pages() call needs to have accessed the pages more recently than any
kernel migrations to see accurate locations.
We never get fully accurate state as the zone memory migration might
be happening as we query it, but in theory we could add something to
e.g. checkpointer/bgwriter that would inquiry it on demand and report
it back somewhat through shared memory (?), but I'm somehow afraid
because as stated at the end of email, it might take some time (well
we probably wouldn't need to "touch memory" then after all, as all of
it is active), but that's still impact to those bgworkers. Somehow I
feel safer if that code is NOT part of bgworker.
OTOH, one of the main issue that I see with 3. is that the monitoring could
probably influence the kernel's decision to start pages migration (I'm not 100%
sure but I could imagine it may influence the kernel's decision due to having to
read/touch the pages).But I'm thinking: do we really need to know the page location of every single page?
I think what we want to see is if the pages are "equally" distributed on all
the nodes or are somehow "stuck" to one (or more) nodes. In that case what about
using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer
or contiguous chunks). We could create a new function that would accept a
"sampling distance" as parameter for example, thoughts?
The way I envision it (and I think what Andres wanted, not sure, still
yet to see him comment on all of this) is to give PG devs a way to
quickly spot NUMA imbalances, even for single relation. Probably some
DBA in the wild could also query it to see how PG/kernel distributes
memory from time to time. It seems to be more debugging and coding aid
for future NUMA optimizations, rather than being used by some
monitoring constantly. I would even dare to say it would require
--enable-debug (or some other developer-only toggle), but apparently
there's no need to hide it like that if those are separate views.
Changes since previous version:
0. rebase due the recent OAuth commit introducing libcurl
1. cast uint64 for NBuffers as You found out
2. put stuff into pg_buffercache_numa
3. 0003 adds pg_shmem_numa_allocations Or should we rather call it
pg_shmem_numa_zones or maybe just pg_shm_numa ?
If there would be agreement that this is the way we want to have it
(from the backend and not from checkpointer), here's what's left on
the table to be done here:
a. isn't there something quicker for touching / page-faulting memory ?
If not then maybe add CHECKS_FOR_INTERRUPTS() there? BTW I've tried
additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't help (it
probably only works for parent//postmaster). I've also tried
MADV_POPULATE_READ (5.14+ kernels only) and that seems to work too:
+ rc = madvise(ShmemBase, ShmemSegHdr->totalsize, MADV_POPULATE_READ);
+ if(rc != 0) {
+ elog(NOTICE, "madvice() failed");
+ }
[..]
- volatile uint64 touch pg_attribute_unused();
os_page_ptrs[i] = (char *)ent->location + (i *
os_page_size);
- touch = *(uint64 *)os_page_ptrs[i];
with volatile touching memory or MADV_POPULATE_READ the result seems
to reliable (s_b 128MB here):
postgres@postgres:1234 : 14442 # select * from
pg_shmem_numa_allocations order by numa_size desc;
name | numa_zone_id | numa_size
------------------------------------------------+--------------+-----------
Buffer Blocks | 0 | 134221824
XLOG Ctl | 0 | 4206592
Buffer Descriptors | 0 | 1048576
transaction | 0 | 528384
Checkpointer Data | 0 | 524288
Checkpoint BufferIds | 0 | 327680
Shared Memory Stats | 0 | 311296
[..]
without at least one of those two, new backend reports complete garbage:
name | numa_zone_id | numa_size
------------------------------------------------+--------------+-----------
Buffer Blocks | 0 | 995328
Shared Memory Stats | 0 | 245760
shmInvalBuffer | 0 | 65536
Buffer Descriptors | 0 | 65536
Backend Status Array | 0 | 61440
serializable | 0 | 57344
[..]
b. refactor shared code so that it goes into src/port (but with
Linux-only support so far)
c. should we use MemoryContext in pg_get_shmem_numa_allocations or not?
d. fix tests, indent it, docs, make cfbot happy
As for the sampling, dunno, fine for me. As an optional argument? but
wouldn't it be better to find a way to actually for it to be quick?
OK, so here's larger test, on 512GB with 8x NUMA nodes and s_b set to
128GB with numactl --interleave=all pg_ctl start:
postgres=# select * from pg_shmem_numa_allocations ;
name | numa_zone_id | numa_size
------------------------------------------------+--------------+-------------
[..]
Buffer Blocks | 0 | 17179869184
Buffer Blocks | 1 | 17179869184
Buffer Blocks | 2 | 17179869184
Buffer Blocks | 3 | 17179869184
Buffer Blocks | 4 | 17179869184
Buffer Blocks | 5 | 17179869184
Buffer Blocks | 6 | 17179869184
Buffer Blocks | 7 | 17179869184
Buffer IO Condition Variables | 0 | 33554432
Buffer IO Condition Variables | 1 | 33554432
Buffer IO Condition Variables | 2 | 33554432
[..]
but it takes 23s. Yes it takes 23s to just gather that info with
memory touch, but that's ~128GB of memory and is hardly responsible
(lack of C_F_I()). By default without numactl's interleave=all, you
get clear picture of lack of NUMA awareness in PG shared segment (just
as Andres presented, but now it is evident; well it is subject to
autobalancing of course):
postgres=# select * from pg_shmem_numa_allocations ;
name | numa_zone_id | numa_size
------------------------------------------------+--------------+-------------
[..]
commit_timestamp | 0 | 2097152
commit_timestamp | 1 | 6291456
commit_timestamp | 2 | 0
commit_timestamp | 3 | 0
commit_timestamp | 4 | 0
[..]
transaction | 0 | 14680064
transaction | 1 | 0
transaction | 2 | 0
transaction | 3 | 0
transaction | 4 | 2097152
[..]
Somehow without interleave it is very quick too.
-J.
Attachments:
v3-0001-Add-optional-dependency-to-libnuma-for-basic-NUMA.patchapplication/octet-stream; name=v3-0001-Add-optional-dependency-to-libnuma-for-basic-NUMA.patchDownload
From d34b7b5d082c3ea3f8806a5202a3f0fa1e9cca7c Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 10:19:35 +0100
Subject: [PATCH v3 1/3] Add optional dependency to libnuma for basic NUMA
awareness routines
---
.cirrus.tasks.yml | 4 ++++
configure.ac | 13 +++++++++++++
meson.build | 17 +++++++++++++++++
meson_options.txt | 3 +++
src/Makefile.global.in | 1 +
src/backend/Makefile | 3 +++
src/include/pg_config.h.in | 3 +++
src/makefiles/meson.build | 3 +++
8 files changed, 47 insertions(+)
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 91b51142d2e..e3b7554d9e8 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -436,6 +436,10 @@ task:
SANITIZER_FLAGS: -fsanitize=address
PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range
+
+ # FIXME: use or not the libnuma?
+ # --with-libnuma \
+ #
# Normally, the "relation segment" code basically has no coverage in our
# tests, because we (quite reasonably) don't generate tables large
# enough in tests. We've had plenty bugs that we didn't notice due the
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..1a394dfc077 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1041,6 +1041,19 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+fi
+
#
# XML
#
diff --git a/meson.build b/meson.build
index 574f992ed49..cf9dead5d02 100644
--- a/meson.build
+++ b/meson.build
@@ -949,6 +949,21 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+libnuma = dependency('libnuma', required: libnumaopt)
+if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt, dirs: test_lib_d)
+endif
+if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: libxml
@@ -3168,6 +3183,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
libxml,
lz4,
pam,
@@ -3821,6 +3837,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..adaadb5faf1 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5ac..0bd4b2d7d32 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5aa..bff9f077a8c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -54,6 +54,9 @@ ifeq ($(with_systemd),yes)
LIBS += -lsystemd
endif
+# FIXME: filter-out / with/without with_libnuma?
+LIBS += $(LIBNUMA_LIBS)
+
override LDFLAGS := $(LDFLAGS) $(LDFLAGS_EX) $(LDFLAGS_EX_BE)
##########################################################################
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..8894f800607 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..f786c191605 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS'
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
--
2.39.5
v3-0002-Extend-pg_buffercache-with-new-view-pg_buffercach.patchapplication/octet-stream; name=v3-0002-Extend-pg_buffercache-with-new-view-pg_buffercach.patchDownload
From e7239148d6dab4482e0f97958c077102295c31c7 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 11:17:28 +0100
Subject: [PATCH v3 2/3] Extend pg_buffercache with new view
pg_buffercache_numa to show NUMA zone
---
contrib/pg_buffercache/Makefile | 3 +-
contrib/pg_buffercache/meson.build | 1 +
.../pg_buffercache--1.5--1.6.sql | 30 +++++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 113 +++++++++++++++++-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/storage/pg_shmem.h | 1 +
7 files changed, 146 insertions(+), 6 deletions(-)
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..9b2e9393410 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..e5b3d1f7dd2
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,30 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the function.
+DROP FUNCTION pg_buffercache_pages() CASCADE;
+
+CREATE OR REPLACE FUNCTION pg_buffercache_pages(boolean)
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages(false) AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_pages(true) AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, numa_zone_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3ae0a018e10..a5aab07fc99 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -6,6 +6,7 @@
* contrib/pg_buffercache/pg_buffercache_pages.c
*-------------------------------------------------------------------------
*/
+#include "pg_config.h"
#include "postgres.h"
#include "access/htup_details.h"
@@ -13,10 +14,16 @@
#include "funcapi.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#include <unistd.h>
+#endif
+#include "storage/pg_shmem.h"
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -43,6 +50,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_zone_id;
} BufferCachePagesRec;
@@ -65,6 +73,15 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+static void
+pg_buffercache_mark_numa_invalid(BufferCachePagesContext *fctx, int n)
+{
+ int i;
+ for (i = 0; i < n; i++) {
+ fctx->record[i].numa_zone_id = -1;
+ }
+}
+
Datum
pg_buffercache_pages(PG_FUNCTION_ARGS)
{
@@ -75,10 +92,16 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
HeapTuple tuple;
+ Buffer query_numa = PG_GETARG_BOOL(0);
if (SRF_IS_FIRSTCALL())
{
- int i;
+ int i, blk2page, j;
+ Size os_page_size;
+ void **os_page_ptrs;
+ int *os_pages_status;
+ int os_page_count;
+ float pages_per_blk;
funcctx = SRF_FIRSTCALL_INIT();
@@ -122,10 +145,14 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
INT2OID, -1, 0);
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM-1)
TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "numa_zone_id",
+ INT4OID, -1, 0);
+
fctx->tupdesc = BlessTupleDesc(tupledesc);
/* Allocate NBuffers worth of BufferCachePagesRec records. */
@@ -140,6 +167,30 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
/* Return to original context when allocating transient memory */
MemoryContextSwitchTo(oldcontext);
+ /* This is for gathering some NUMA statistics. We might be using
+ * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
+ * allocated in various different OS memory pages sizes, so first
+ * we need to understand the OS memory page size before
+ * calling move_pages()
+ */
+ os_page_size = sysconf(_SC_PAGESIZE);
+ if(huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ os_page_count = (NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(NOTICE, "os_page_count=%d os_page_size=%ld pages_per_blk=%f",
+ os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(int) * os_page_count);
+ memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably
+ * have bug in our buffers to OS page mapping code here
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
/*
* Scan through all the buffers, saving the relevant fields in the
* fctx->record structure.
@@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
else
fctx->record[i].isvalid = false;
+#ifdef USE_LIBNUMA
+/* FIXME: taken from bufmgr.c, maybe move to .h ? */
+#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+ blk2page = (int) i * pages_per_blk;
+ j = 0;
+ do {
+ /*
+ * Many buffers can point to the same page, but we want to
+ * query just first address.
+ *
+ * In order to get reliable results we also need to touch memory pages
+ * so that inquiry about NUMA zone doesn't return -2.
+ */
+ if(os_page_ptrs[blk2page+j] == 0) {
+ volatile uint64 touch pg_attribute_unused();
+ os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j);
+ touch = *(uint64 *)os_page_ptrs[blk2page+j];
+ }
+ j++;
+ } while(j < (int)pages_per_blk);
+#endif
+
UnlockBufHdr(bufHdr, buf_state);
}
+
+
+#ifdef USE_LIBNUMA
+ if(query_numa) {
+ /* According to numa(3) it is required to initialize library even if that's no-op. */
+ if(numa_available() == -1) {
+ pg_buffercache_mark_numa_invalid(fctx, NBuffers);
+ elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");;
+ } else {
+ /* Amortize the number of pages we need to query about */
+ if(numa_move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) {
+ elog(ERROR, "failed NUMA pages inquiry status");
+ }
+ for (i = 0; i < NBuffers; i++) {
+ blk2page = (int) i * pages_per_blk;
+ /* Technically we can get errors too here and pass that to user
+ *
+ * XXX:: also we could somehow report single DB block spanning
+ * more than 2 NUMA zones, but it should be rare (?)
+ */
+ fctx->record[i].numa_zone_id = os_pages_status[blk2page];
+ }
+ }
+ } else
+ pg_buffercache_mark_numa_invalid(fctx, NBuffers);
+#else
+ pg_buffercache_mark_numa_invalid(fctx, NBuffers);
+#endif
+
+ pfree(os_page_ptrs);
+ pfree(os_pages_status);
}
funcctx = SRF_PERCALL_SETUP();
@@ -211,6 +315,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
nulls[7] = true;
/* unused for v1.0 callers, but the array is always long enough */
nulls[8] = true;
+ nulls[9] = true;
}
else
{
@@ -231,6 +336,8 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
/* unused for v1.0 callers, but the array is always long enough */
values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
nulls[8] = false;
+ values[9] = Int32GetDatum(fctx->record[i].numa_zone_id);
+ nulls[9] = false;
}
/* Build and return the tuple. */
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 03a6dd49154..172309d389a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -562,7 +562,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
--
2.39.5
v3-0003-Add-pg_shmem_numa_allocations.patchapplication/octet-stream; name=v3-0003-Add-pg_shmem_numa_allocations.patchDownload
From 995011841cde76e530e2ff12452f54e8b8da5923 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v3 3/3] Add pg_shmem_numa_allocations
---
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 120 +++++++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
3 files changed, 136 insertions(+)
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index eff0990957e..c808fb82d75 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..c8881d98e05 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,11 @@
#include "storage/shmem.h"
#include "storage/spin.h"
#include "utils/builtins.h"
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#include <unistd.h>
+#endif
static void *ShmemAllocRaw(Size size, Size *allocated_size);
@@ -568,3 +573,118 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA zones for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+//#ifdef LIBNUMA
+#if 1
+
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+
+ /* According to numa(3) it is required to initialize library even if that's no-op. */
+ if(numa_available() == -1) {
+ elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");;
+ return (Datum) 0;
+ }
+
+ /* This is for gathering some NUMA statistics. We might be using
+ * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
+ * allocated in various different OS memory pages sizes, so first
+ * we need to understand the OS memory page size before
+ * calling move_pages()
+ */
+ os_page_size = sysconf(_SC_PAGESIZE);
+ if(huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ // MemoryContext!
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+ void **os_page_ptrs;
+ int *os_pages_status;
+ int os_page_count;
+#define MAX_ZONES 32 /* FIXME? */
+ Size zones[MAX_ZONES];
+
+ os_page_count = ent->allocated_size / os_page_size;
+ //elog(NOTICE, "os_page_count=%d os_page_size=%ld ", os_page_count, os_page_size);
+
+ os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(int) * os_page_count);
+ memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably
+ * have bug in our buffers to OS page mapping code here
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
+ for(i = 0; i < os_page_count; i++) {
+ /*
+ * In order to get reliable results we also need to touch memory pages
+ * so that inquiry about NUMA zone doesn't return -2.
+ */
+ volatile uint64 touch pg_attribute_unused();
+ os_page_ptrs[i] = (char *)ent->location + (i * os_page_size);
+ touch = *(uint64 *)os_page_ptrs[i];
+ }
+
+ /* Amortize the number of pages we need to query about */
+ if(numa_move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) {
+ elog(ERROR, "failed NUMA pages inquiry status");
+ }
+
+ memset(zones, 0, sizeof(zones));
+ /* Counter number of NUMA zones used for this shared memory entry */
+ for (i = 0; i < os_page_count; i++) {
+ int s = os_pages_status[i];
+ if(s >= 0)
+ zones[s]++;
+ }
+
+ pfree(os_page_ptrs);
+ pfree(os_pages_status);
+
+ for(i = 0; i <= numa_max_node(); i++){
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(zones[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /* XXX: We are ignoring reporting the following regions in pg_get_shmem_allocations() case:
+ * - output shared memory allocated but not counted via the shmem index
+ * - output as-of-yet unused shared memory
+ */
+
+ LWLockRelease(ShmemIndexLock);
+#else
+ ereport(WARNING,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("NUMA support is not availble"),
+ errdetail("NUMA zone information is not available on this platform due to lack of libnuma"),
+ errhint("It looks like you need to re-compile with libnuma packages available")));
+#endif
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9e803d610d7..1efa342b725 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8463,6 +8463,14 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+# shared memory usage with NUMA info
+{ oid => '5101', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int8,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,numa_zone_id,numa_size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
--
2.39.5
Hi,
On 2025-02-24 12:57:16 +0100, Jakub Wartak wrote:
On Thu, Feb 20, 2025 at 9:32 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:OTOH, one of the main issue that I see with 3. is that the monitoring could
probably influence the kernel's decision to start pages migration (I'm not 100%
sure but I could imagine it may influence the kernel's decision due to having to
read/touch the pages).But I'm thinking: do we really need to know the page location of every single page?
I think what we want to see is if the pages are "equally" distributed on all
the nodes or are somehow "stuck" to one (or more) nodes. In that case what about
using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer
or contiguous chunks). We could create a new function that would accept a
"sampling distance" as parameter for example, thoughts?The way I envision it (and I think what Andres wanted, not sure, still
yet to see him comment on all of this) is to give PG devs a way to
quickly spot NUMA imbalances, even for single relation.
Yea. E.g. for some benchmark workloads the difference whether the root btree
page is on the same NUMA node as the workload or not makes a roughly 2x perf
difference. It's really hard to determine that today.
If there would be agreement that this is the way we want to have it
(from the backend and not from checkpointer), here's what's left on
the table to be done here:
a. isn't there something quicker for touching / page-faulting memory ?
If you actually fault in a page the kernel actually has to allocate memory and
then zero it out. That rather severely limits the throughput...
If not then maybe add CHECKS_FOR_INTERRUPTS() there?
Should definitely be there.
BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't
help (it probably only works for parent//postmaster).
Yes, needs to be in postmaster.
Does the issue with "new" backends seeing pages as not present exist both with
and without huge pages?
FWIW, what you posted fails on CI:
https://cirrus-ci.com/task/5114213770723328
Probably some ifdefs are missing. The sanity-check task configures with
minimal dependencies, which is why you're seeing this even on linux.
b. refactor shared code so that it goes into src/port (but with
Linux-only support so far)
c. should we use MemoryContext in pg_get_shmem_numa_allocations or not?
You mean a specific context instead of CurrentMemoryContext?
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml index 91b51142d2e..e3b7554d9e8 100644 --- a/.cirrus.tasks.yml +++ b/.cirrus.tasks.yml @@ -436,6 +436,10 @@ task: SANITIZER_FLAGS: -fsanitize=address PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range+ + # FIXME: use or not the libnuma? + # --with-libnuma \ + # # Normally, the "relation segment" code basically has no coverage in our # tests, because we (quite reasonably) don't generate tables large # enough in tests. We've had plenty bugs that we didn't notice due the
I don't see why not.
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql new file mode 100644 index 00000000000..e5b3d1f7dd2 --- /dev/null +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql @@ -0,0 +1,30 @@ +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit + +-- Register the function. +DROP FUNCTION pg_buffercache_pages() CASCADE;
Why? I think that's going to cause problems, as the pg_buffercache view
depends on it, and user views might turn in depend on pg_buffercache. I think
CASCADE is rarely, if ever, ok to use in an extension scripot.
+CREATE OR REPLACE FUNCTION pg_buffercache_pages(boolean) +RETURNS SETOF RECORD +AS 'MODULE_PATHNAME', 'pg_buffercache_pages' +LANGUAGE C PARALLEL SAFE;+-- Create a view for convenient access. +CREATE OR REPLACE VIEW pg_buffercache AS + SELECT P.* FROM pg_buffercache_pages(false) AS P + (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid, + relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2, + pinning_backends int4); + +CREATE OR REPLACE VIEW pg_buffercache_numa AS + SELECT P.* FROM pg_buffercache_pages(true) AS P + (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid, + relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2, + pinning_backends int4, numa_zone_id int4); + +-- Don't want these to be available to public. +REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC; +REVOKE ALL ON pg_buffercache FROM PUBLIC; +REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
We grant pg_monitor SELECT TO pg_buffercache, I think we should do the same
for _numa?
@@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
else
fctx->record[i].isvalid = false;+#ifdef USE_LIBNUMA +/* FIXME: taken from bufmgr.c, maybe move to .h ? */ +#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ)) + blk2page = (int) i * pages_per_blk;
BufferGetBlock() is public, so I don't think BufHdrGetBlock() is needed here.
+ j = 0; + do { + /* + * Many buffers can point to the same page, but we want to + * query just first address. + * + * In order to get reliable results we also need to touch memory pages + * so that inquiry about NUMA zone doesn't return -2. + */ + if(os_page_ptrs[blk2page+j] == 0) { + volatile uint64 touch pg_attribute_unused(); + os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j); + touch = *(uint64 *)os_page_ptrs[blk2page+j]; + } + j++; + } while(j < (int)pages_per_blk); +#endif +
Why is this done before we even have gotten -2 back? Even if we need it, it
seems like we ought to defer this until necessary.
+#ifdef USE_LIBNUMA + if(query_numa) { + /* According to numa(3) it is required to initialize library even if that's no-op. */ + if(numa_available() == -1) { + pg_buffercache_mark_numa_invalid(fctx, NBuffers); + elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");; + } else { + /* Amortize the number of pages we need to query about */ + if(numa_move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) { + elog(ERROR, "failed NUMA pages inquiry status"); + }
I wonder if we ought to override numa_error() so we can display more useful
errors.
+ + LWLockAcquire(ShmemIndexLock, LW_SHARED);
Doing multiple memory allocations while holding an lwlock is probably not a
great idea, even if the lock normally isn't contended.
+ os_page_ptrs = palloc(sizeof(void *) * os_page_count); + os_pages_status = palloc(sizeof(int) * os_page_count);
Why do this in very loop iteration?
Greetings,
Andres Freund
Hi,
On Mon, Feb 24, 2025 at 09:06:20AM -0500, Andres Freund wrote:
Does the issue with "new" backends seeing pages as not present exist both with
and without huge pages?
That's a good point and from what I can see it's correct with huge pages being
used (it means all processes see the same NUMA node assignment regardless of
access patterns).
That said, wouldn't that be too strong to impose a restriction that huge_pages
must be enabled?
Jakub, thanks for the new patch version! FWIW, I did not look closely to the
code yet (just did the minor changes already shared to have valid result with non
tiny shared buffer size). I'll look closely at the code for sure once we all agree
on the design part of it.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-02-24 12:57:16 +0100, Jakub Wartak wrote:
Hi Andres, thanks for your review!
OK first sane version attached with new src/port/pg_numa.c boilerplate
in 0001. Fixed some bugs too, there is one remaining optimization to
be done (see that `static` question later). Docs/tests are still
missing.
QQ: I'm still wondering if we there's better way of exposing multiple
pg's shma entries pointing to the same page (think of something hot:
PROCLOCK or ProcArray), so wouldn't it make sense (in some future
thread/patch) to expose this info somehow via additional column
(pg_get_shmem_numa_allocations.shared_pages bool?) ? I'm thinking of
an easy way of showing that a potential NUMA auto balancing could lead
to TLB NUMA shootdowns (not that it is happening or counting , just
identifying it as a problem in allocation). Or that stuff doesn't make
sense as we already have pg_shm_allocations.{off|size} and we could
derive such info from it (after all it is for devs?)?
postgres@postgres:1234 : 18843 # select
name,off,off+allocated_size,allocated_size from pg_shmem_allocations
order by off;
name | off | ?column?
| allocated_size
------------------------------------------------+-----------+-----------+----------------
[..]
Proc Header | 147114112 |
147114240 | 128
Proc Array | 147274752 |
147275392 | 640
KnownAssignedXids | 147275392 |
147310848 | 35456
KnownAssignedXidsValid | 147310848 |
147319808 | 8960
Backend Status Array | 147319808 |
147381248 | 61440
postgres@postgres:1234 : 18924 # select * from
pg_shmem_numa_allocations where name IN ('Proc Header', 'Proc Array',
'KnownAssignedXids', '..') order by name;
name | numa_zone_id | numa_size
-------------------+--------------+-----------
KnownAssignedXids | 0 | 2097152
Proc Array | 0 | 2097152
Proc Header | 0 | 2097152
I.e. ProcArray ends and right afterwards KnownAssignedXids start, both
are hot , but on the same HP and NUMA node
If there would be agreement that this is the way we want to have it
(from the backend and not from checkpointer), here's what's left on
the table to be done here:a. isn't there something quicker for touching / page-faulting memory ?
If you actually fault in a page the kernel actually has to allocate memory and
then zero it out. That rather severely limits the throughput...
OK, no comments about that madvise(MADV_POPULATE_READ), so I'm
sticking to pointers.
If not then maybe add CHECKS_FOR_INTERRUPTS() there?
Should definitely be there.
Added.
BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't
help (it probably only works for parent//postmaster).Yes, needs to be in postmaster.
Does the issue with "new" backends seeing pages as not present exist both with
and without huge pages?
Please see attached file for more verbose results, but in short it is
like below:
patch(-touchpages) hugepages=off INVALID RESULTS (-2)
patch(-touchpages) hugepages=on INVALID RESULTS (-2)
patch(touchpages) hugepages=off CORRECT RESULT
patch(touchpages) hugepages=on CORRECT RESULT
patch(-touchpages)+MAP_POPULATE hugepages=off INVALID RESULTS (-2)
patch(-touchpages)+MAP_POPULATE hugepages=on INVALID RESULTS (-2)
IMHVO, the only other thing that could work here (but still
page-faulting) is that 5.14+ madvise(MADV_POPULATE_READ). Tests are
welcome, maybe it might be kernel version dependent.
BTW: and yes you can "feel" the timing impact of
MAP_SHARED|MAP_POPULATE during startup, it seems that for our case
child backends that don't come-up with pre-faulted page attachments
across fork() apparently.
FWIW, what you posted fails on CI:
https://cirrus-ci.com/task/5114213770723328Probably some ifdefs are missing. The sanity-check task configures with
minimal dependencies, which is why you're seeing this even on linux.
Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
of this CI stuff...
b. refactor shared code so that it goes into src/port (but with
Linux-only support so far)
Done.
c. should we use MemoryContext in pg_get_shmem_numa_allocations or not?
You mean a specific context instead of CurrentMemoryContext?
Yes, I had doubts earlier, but for now I'm going to leave it as it is.
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml index 91b51142d2e..e3b7554d9e8 100644 --- a/.cirrus.tasks.yml +++ b/.cirrus.tasks.yml @@ -436,6 +436,10 @@ task: SANITIZER_FLAGS: -fsanitize=address PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range+ + # FIXME: use or not the libnuma? + # --with-libnuma \ + # # Normally, the "relation segment" code basically has no coverage in our # tests, because we (quite reasonably) don't generate tables large # enough in tests. We've had plenty bugs that we didn't notice due theI don't see why not.
Fixed.
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql new file mode 100644 index 00000000000..e5b3d1f7dd2 --- /dev/null +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql @@ -0,0 +1,30 @@ +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit + +-- Register the function. +DROP FUNCTION pg_buffercache_pages() CASCADE;Why? I think that's going to cause problems, as the pg_buffercache view
depends on it, and user views might turn in depend on pg_buffercache. I think
CASCADE is rarely, if ever, ok to use in an extension scripot.
... it's just me cutting corners :^), fixed now.
[..]
+-- Don't want these to be available to public. +REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC; +REVOKE ALL ON pg_buffercache FROM PUBLIC; +REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;We grant pg_monitor SELECT TO pg_buffercache, I think we should do the same
for _numa?
Yup, fixed.
@@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
else
fctx->record[i].isvalid = false;+#ifdef USE_LIBNUMA +/* FIXME: taken from bufmgr.c, maybe move to .h ? */ +#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ)) + blk2page = (int) i * pages_per_blk;BufferGetBlock() is public, so I don't think BufHdrGetBlock() is needed here.
Fixed, thanks I was looking for something like this! Is that +1 in v4 good?
+ j = 0; + do { + /* + * Many buffers can point to the same page, but we want to + * query just first address. + * + * In order to get reliable results we also need to touch memory pages + * so that inquiry about NUMA zone doesn't return -2. + */ + if(os_page_ptrs[blk2page+j] == 0) { + volatile uint64 touch pg_attribute_unused(); + os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j); + touch = *(uint64 *)os_page_ptrs[blk2page+j]; + } + j++; + } while(j < (int)pages_per_blk); +#endif +Why is this done before we even have gotten -2 back? Even if we need it, it
seems like we ought to defer this until necessary.
Not fixed yet: maybe we could even do a `static` with
`has_this_run_earlier` and just perform this work only once during the
first time?
+#ifdef USE_LIBNUMA + if(query_numa) { + /* According to numa(3) it is required to initialize library even if that's no-op. */ + if(numa_available() == -1) { + pg_buffercache_mark_numa_invalid(fctx, NBuffers); + elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");; + } else { + /* Amortize the number of pages we need to query about */ + if(numa_move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) { + elog(ERROR, "failed NUMA pages inquiry status"); + }I wonder if we ought to override numa_error() so we can display more useful
errors.
Another question without an easy answer as I never hit this error from
numa_move_pages(), one gets invalid stuff in *os_pages_status instead.
BUT!: most of our patch just uses things that cannot fail as per
libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700
/sys` (because it's hard to unmount it) and then still most of numactl
stuff works as euid != 0, but numactl --hardware gets at least
"libnuma: Warning: Cannot parse distance information in sysfs:
Permission denied" or same story with numactl -C 678 date. So unless
we start way more heavy use of libnuma (not just for observability)
there's like no point in that right now (?) Contrary to that: we can
do just do variadic elog() for that, I've put some code, but no idea
if that works fine...
[..]
Doing multiple memory allocations while holding an lwlock is probably not a
great idea, even if the lock normally isn't contended.
[..]
Why do this in very loop iteration?
Both fixed.
-J.
Attachments:
v4-0003-Add-pg_shmem_numa_allocations.patchapplication/octet-stream; name=v4-0003-Add-pg_shmem_numa_allocations.patchDownload
From 7be77950a0e29640204b77616e90a5137b33d154 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v4 3/3] Add pg_shmem_numa_allocations
---
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 127 +++++++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
3 files changed, 143 insertions(+)
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index eff0990957e..c808fb82d75 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..dd84a41e3e8 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
#include "storage/shmem.h"
#include "storage/spin.h"
#include "utils/builtins.h"
+#include "port/pg_numa.h"
static void *ShmemAllocRaw(Size size, Size *allocated_size);
@@ -568,3 +569,129 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA zones for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ int shm_total_page_count,
+ shm_ent_page_count;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");;
+ return (Datum) 0;
+ }
+
+ /*
+ * This is for gathering some NUMA statistics. We might be using various
+ * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in
+ * various different OS memory pages sizes, so first we need to understand
+ * the OS memory page size before calling move_pages()
+ */
+ os_page_size = sysconf(_SC_PAGESIZE);
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ /*
+ * Preallocate memory all at once without going into details which shared
+ * memory segment is the biggest (technically min s_b can be as low as
+ * 16xBLCKSZ)
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+ memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count);
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+#define MAX_NUMA_ZONES 32 /* FIXME? */
+ Size zones[MAX_NUMA_ZONES];
+
+ shm_ent_page_count = ent->allocated_size / os_page_size;
+ /* It is always at least 1 page */
+ shm_ent_page_count = shm_ent_page_count == 0 ? 1 : shm_ent_page_count;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages so that inquiry about NUMA zone doesn't return -2.
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ /* Every 1GB of scanned memory we give process chance to respond */
+#define ONE_GIGABYTE 1024*1024*1024
+ if ((i * os_page_size) % ONE_GIGABYTE == 0)
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1) {
+ /* FIXME: should we release LWlock and pfree here? */
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+ }
+
+ memset(zones, 0, sizeof(zones));
+ /* Count number of NUMA zones used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ if (s >= 0)
+ zones[s]++;
+ }
+
+ for (i = 0; i <= pg_numa_get_max_node(); i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(zones[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+
+ }
+
+ /*
+ * XXX: We are ignoring in NUMA version reporting of the following regions
+ * (compare to pg_get_shmem_allocations() case): - output shared memory
+ * allocated but not counted via the shmem index - output as-of-yet unused
+ * shared memory
+ */
+
+ LWLockRelease(ShmemIndexLock);
+
+ pfree(page_ptrs);
+ pfree(pages_status);
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9e803d610d7..1efa342b725 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8463,6 +8463,14 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+# shared memory usage with NUMA info
+{ oid => '5101', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int8,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,numa_zone_id,numa_size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
--
2.39.5
v4-0002-Extend-pg_buffercache-with-new-view-pg_buffercach.patchapplication/octet-stream; name=v4-0002-Extend-pg_buffercache-with-new-view-pg_buffercach.patchDownload
From 86a9f778afaeb08ca3e03e81fb49372f392745a7 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 11:17:28 +0100
Subject: [PATCH v4 2/3] Extend pg_buffercache with new view
pg_buffercache_numa to show NUMA zone
---
contrib/pg_buffercache/Makefile | 3 +-
contrib/pg_buffercache/meson.build | 1 +
.../pg_buffercache--1.5--1.6.sql | 35 ++++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 156 +++++++++++++++++-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/storage/pg_shmem.h | 1 +
7 files changed, 189 insertions(+), 11 deletions(-)
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..9b2e9393410 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..448d08196f3
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,35 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the new function.
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+CREATE OR REPLACE FUNCTION pg_buffercache_pages(boolean)
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages(false) AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_pages(true) AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, numa_zone_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pages(boolean) TO pg_monitor;
+GRANT SELECT ON pg_buffercache TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3ae0a018e10..95f331dc193 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -6,6 +6,7 @@
* contrib/pg_buffercache/pg_buffercache_pages.c
*-------------------------------------------------------------------------
*/
+#include "pg_config.h"
#include "postgres.h"
#include "access/htup_details.h"
@@ -13,10 +14,12 @@
#include "funcapi.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -43,6 +46,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_zone_id;
} BufferCachePagesRec;
@@ -65,6 +69,17 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+static void
+pg_buffercache_mark_numa_invalid(BufferCachePagesContext *fctx, int n)
+{
+ int i;
+
+ for (i = 0; i < n; i++)
+ {
+ fctx->record[i].numa_zone_id = -1;
+ }
+}
+
Datum
pg_buffercache_pages(PG_FUNCTION_ARGS)
{
@@ -75,14 +90,33 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
HeapTuple tuple;
+ Buffer query_numa = PG_GETARG_BOOL(0);
if (SRF_IS_FIRSTCALL())
{
- int i;
+ int i,
+ blk2page,
+ j;
+ Size os_page_size;
+ void **os_page_ptrs;
+ int *os_pages_status;
+ int os_page_count;
+ float pages_per_blk;
funcctx = SRF_FIRSTCALL_INIT();
- /* Switch context when allocating stuff to be used in later calls */
+ if (query_numa)
+ {
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");
+ query_numa = false;
+ }
+ }
+
+ /*
+ * Switch context when allocating stuff to be used in later calls
+ */
oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
/* Create a user function context for cross-call persistence */
@@ -122,10 +156,14 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
INT2OID, -1, 0);
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "numa_zone_id",
+ INT4OID, -1, 0);
+
fctx->tupdesc = BlessTupleDesc(tupledesc);
/* Allocate NBuffers worth of BufferCachePagesRec records. */
@@ -137,9 +175,37 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
funcctx->max_calls = NBuffers;
funcctx->user_fctx = fctx;
- /* Return to original context when allocating transient memory */
+ /*
+ * Return to original context when allocating transient memory
+ */
MemoryContextSwitchTo(oldcontext);
+ /*
+ * This is for gathering some NUMA statistics. We might be using
+ * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
+ * allocated in various different OS memory pages sizes, so first we
+ * need to understand the OS memory page size before calling
+ * move_pages()
+ */
+ os_page_size = sysconf(_SC_PAGESIZE);
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA os_page_count=%d os_page_size=%ld pages_per_blk=%f",
+ os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(int) * os_page_count);
+ memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
/*
* Scan through all the buffers, saving the relevant fields in the
* fctx->record structure.
@@ -171,14 +237,79 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
else
fctx->record[i].isdirty = false;
- /* Note if the buffer is valid, and has storage created */
+ /*
+ * Note if the buffer is valid, and has storage created
+ */
if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
fctx->record[i].isvalid = true;
else
fctx->record[i].isvalid = false;
+ if (query_numa)
+ {
+ blk2page = (int) i * pages_per_blk;
+ j = 0;
+ do
+ {
+ /*
+ * Many buffers can point to the same page (in case of
+ * BLCKSZ < 4kB), but we want to also query just first
+ * address.
+ *
+ * In order to get reliable results we also need to touch
+ * memory pages, so that inquiry about NUMA zone doesn't
+ * return -2.
+ */
+ if (os_page_ptrs[blk2page + j] == 0)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ /*
+ * NBuffers count start really from 1
+ */
+ os_page_ptrs[blk2page + j] = (char *) BufferGetBlock(i + 1) + (os_page_size * j);
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2page + j]);
+
+ /*
+ * Every 1GB of scanned memory we give process chance
+ * to respond
+ */
+#define ONE_GIGABYTE 1024*1024*1024
+ if ((i * os_page_size) % ONE_GIGABYTE == 0)
+ CHECK_FOR_INTERRUPTS();
+ }
+ j++;
+ } while (j < (int) pages_per_blk);
+ }
+
UnlockBufHdr(bufHdr, buf_state);
}
+
+
+ if (query_numa)
+ {
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ blk2page = (int) i * pages_per_blk;
+
+ /*
+ * Technically we can get errors too here and pass that to
+ * user
+ *
+ * XXX:: also we could somehow report single DB block spanning
+ * more than 2 NUMA zones, but it should be rare (?)
+ */
+ fctx->record[i].numa_zone_id = os_pages_status[blk2page];
+ }
+ }
+ else
+ pg_buffercache_mark_numa_invalid(fctx, NBuffers);
+
+ pfree(os_page_ptrs);
+ pfree(os_pages_status);
}
funcctx = SRF_PERCALL_SETUP();
@@ -209,8 +340,12 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
nulls[5] = true;
nulls[6] = true;
nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
+
+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */
nulls[8] = true;
+ nulls[9] = true;
}
else
{
@@ -228,9 +363,14 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
nulls[6] = false;
values[7] = Int16GetDatum(fctx->record[i].usagecount);
nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
+
+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */
values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
nulls[8] = false;
+ values[9] = Int32GetDatum(fctx->record[i].numa_zone_id);
+ nulls[9] = false;
}
/* Build and return the tuple. */
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 03a6dd49154..172309d389a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -562,7 +562,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
--
2.39.5
v4-0001-Add-optional-dependency-to-libnuma-for-basic-NUMA.patchapplication/octet-stream; name=v4-0001-Add-optional-dependency-to-libnuma-for-basic-NUMA.patchDownload
From 9e8f5376545dd73cd58183bffafcac8df26602a9 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 10:19:35 +0100
Subject: [PATCH v4 1/3] Add optional dependency to libnuma for basic NUMA
awareness routines
---
.cirrus.tasks.yml | 1 +
configure.ac | 13 ++++
meson.build | 17 ++++++
meson_options.txt | 3 +
src/Makefile.global.in | 1 +
src/backend/Makefile | 3 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 41 +++++++++++++
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 121 +++++++++++++++++++++++++++++++++++++
12 files changed, 208 insertions(+)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 91b51142d2e..7467e029131 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -448,6 +448,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
\
${LINUX_CONFIGURE_FEATURES} \
\
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..1a394dfc077 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1041,6 +1041,19 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+fi
+
#
# XML
#
diff --git a/meson.build b/meson.build
index 574f992ed49..cf9dead5d02 100644
--- a/meson.build
+++ b/meson.build
@@ -949,6 +949,21 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+libnuma = dependency('libnuma', required: libnumaopt)
+if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt, dirs: test_lib_d)
+endif
+if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: libxml
@@ -3168,6 +3183,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
libxml,
lz4,
pam,
@@ -3821,6 +3837,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..adaadb5faf1 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5ac..0bd4b2d7d32 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5aa..bff9f077a8c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -54,6 +54,9 @@ ifeq ($(with_systemd),yes)
LIBS += -lsystemd
endif
+# FIXME: filter-out / with/without with_libnuma?
+LIBS += $(LIBNUMA_LIBS)
+
override LDFLAGS := $(LDFLAGS) $(LDFLAGS_EX) $(LDFLAGS_EX_BE)
##########################################################################
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..8894f800607 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..7cdc81d1ed6
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,41 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Miscellaneous functions for bit-wise operations.
+ *
+ *
+ * Copyright (c) 2019-2025, PostgreSQL Global Development Group
+ *
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "c.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..f786c191605 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS'
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c224319512..a68a29d5414 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_numa.o \
pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d43..7ffbd4d88d2 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_numa.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..a9a7d2c964b
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,121 @@
+/*-------------------------------------------------------------------------
+ *
+ * numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+#include "postgres.h"
+#include "port/pg_numa.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+#include <unistd.h>
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+#ifndef FRONTEND
+/* FIXME not tested, might crash */
+void
+numa_warn(int num, char *fmt,...)
+{
+ va_list ap;
+ int olde = errno;
+ int needed;
+ StringInfoData msg;
+
+ initStringInfo(&msg);
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ if (needed > 0)
+ {
+ enlargeStringInfo(&msg, needed);
+ va_start(ap, fmt);
+ appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ }
+
+ ereport(WARNING,
+ (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
+ errmsg_internal("libnuma: WARNING: %s", msg.data)));
+
+ pfree(msg.data);
+
+ errno = olde;
+}
+
+void
+numa_error(char *where)
+{
+ int olde = errno;
+
+ /*
+ * XXX: for now we issue just WARNING, but long-term that might depend on
+ * numa_set_strict() here
+ */
+ elog(WARNING, "libnuma: ERROR: %s", where);
+ errno = olde;
+}
+#endif /* FRONTEND */
+
+#else
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+#endif
--
2.39.5
On Mon, Feb 24, 2025 at 5:11 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi,
On Mon, Feb 24, 2025 at 09:06:20AM -0500, Andres Freund wrote:
Does the issue with "new" backends seeing pages as not present exist both with
and without huge pages?That's a good point and from what I can see it's correct with huge pages being
used (it means all processes see the same NUMA node assignment regardless of
access patterns).
Hi Bertrand , please see that nearby thread. I've got quite the
opposite results. I need page-fault memory or I get invalid results
("-2"). What kernel version are you using ? (I've tried it on two
6.10.x series kernels , virtualized in both cases; one was EPYC [real
NUMA, but not VM so not real hardware]).
That said, wouldn't that be too strong to impose a restriction that huge_pages
must be enabled?Jakub, thanks for the new patch version! FWIW, I did not look closely to the
code yet (just did the minor changes already shared to have valid result with non
tiny shared buffer size). I'll look closely at the code for sure once we all agree
on the design part of it.
Cool, I think we are pretty close actually, but others might have
different perspective.
-J.
Hi,
On 2025-02-26 09:38:20 +0100, Jakub Wartak wrote:
FWIW, what you posted fails on CI:
https://cirrus-ci.com/task/5114213770723328Probably some ifdefs are missing. The sanity-check task configures with
minimal dependencies, which is why you're seeing this even on linux.Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
of this CI stuff...
FYI, you can enable CI on a github repo, to see results without posting to the
list:
https://github.com/postgres/postgres/blob/master/src/tools/ci/README
Greetings,
Andres Freund
On Wed, Feb 26, 2025 at 10:58 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2025-02-26 09:38:20 +0100, Jakub Wartak wrote:
FWIW, what you posted fails on CI:
https://cirrus-ci.com/task/5114213770723328Probably some ifdefs are missing. The sanity-check task configures with
minimal dependencies, which is why you're seeing this even on linux.Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
of this CI stuff...FYI, you can enable CI on a github repo, to see results without posting to the
list:
https://github.com/postgres/postgres/blob/master/src/tools/ci/README
Thanks, I'll take a look into it.
Meanwhile v5 is attached with slight changes to try to make cfbot happy:
1. fixed tests and added tiny copy-cat basic tests for
pg_buffercache_numa and pg_shm_numa_allocations views
2. win32 doesn't have sysconf()
No docs yet.
-J.
Attachments:
v5-0002-Extend-pg_buffercache-with-new-view-pg_buffercach.patchapplication/octet-stream; name=v5-0002-Extend-pg_buffercache-with-new-view-pg_buffercach.patchDownload
From 01cf3b0d1b0d3f3f484a489158bae9b04d1e27a7 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 11:17:28 +0100
Subject: [PATCH v5 2/3] Extend pg_buffercache with new view
pg_buffercache_numa to show NUMA zone
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache.out | 21 ++-
contrib/pg_buffercache/meson.build | 1 +
.../pg_buffercache--1.5--1.6.sql | 35 ++++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 154 +++++++++++++++++-
contrib/pg_buffercache/sql/pg_buffercache.sql | 10 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/storage/pg_shmem.h | 1 +
9 files changed, 216 insertions(+), 13 deletions(-)
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache.out b/contrib/pg_buffercache/expected/pg_buffercache.out
index b745dc69eae..c189892f086 100644
--- a/contrib/pg_buffercache/expected/pg_buffercache.out
+++ b/contrib/pg_buffercache/expected/pg_buffercache.out
@@ -8,6 +8,15 @@ from pg_buffercache;
t
(1 row)
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
select buffers_used + buffers_unused > 0,
buffers_dirty <= buffers_used,
buffers_pinned <= buffers_used
@@ -28,12 +37,16 @@ SELECT count(*) > 0 FROM pg_buffercache_usage_counts() WHERE buffers >= 0;
SET ROLE pg_database_owner;
SELECT * FROM pg_buffercache;
ERROR: permission denied for view pg_buffercache
-SELECT * FROM pg_buffercache_pages() AS p (wrong int);
+SELECT * FROM pg_buffercache_pages(false) AS p (wrong int);
+ERROR: permission denied for function pg_buffercache_pages
+SELECT * FROM pg_buffercache_pages(true) AS p (wrong int);
ERROR: permission denied for function pg_buffercache_pages
SELECT * FROM pg_buffercache_summary();
ERROR: permission denied for function pg_buffercache_summary
SELECT * FROM pg_buffercache_usage_counts();
ERROR: permission denied for function pg_buffercache_usage_counts
+SELECT * FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
RESET role;
-- Check that pg_monitor is allowed to query view / function
SET ROLE pg_monitor;
@@ -55,3 +68,9 @@ SELECT count(*) > 0 FROM pg_buffercache_usage_counts();
t
(1 row)
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..9b2e9393410 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..448d08196f3
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,35 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the new function.
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+CREATE OR REPLACE FUNCTION pg_buffercache_pages(boolean)
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages(false) AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_pages(true) AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, numa_zone_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pages(boolean) TO pg_monitor;
+GRANT SELECT ON pg_buffercache TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3ae0a018e10..dfe53eb8471 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -6,6 +6,7 @@
* contrib/pg_buffercache/pg_buffercache_pages.c
*-------------------------------------------------------------------------
*/
+#include "pg_config.h"
#include "postgres.h"
#include "access/htup_details.h"
@@ -13,10 +14,12 @@
#include "funcapi.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -43,6 +46,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_zone_id;
} BufferCachePagesRec;
@@ -65,6 +69,17 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+static void
+pg_buffercache_mark_numa_invalid(BufferCachePagesContext *fctx, int n)
+{
+ int i;
+
+ for (i = 0; i < n; i++)
+ {
+ fctx->record[i].numa_zone_id = -1;
+ }
+}
+
Datum
pg_buffercache_pages(PG_FUNCTION_ARGS)
{
@@ -75,14 +90,33 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
HeapTuple tuple;
+ Buffer query_numa = PG_GETARG_BOOL(0);
if (SRF_IS_FIRSTCALL())
{
- int i;
+ int i,
+ blk2page,
+ j;
+ Size os_page_size;
+ void **os_page_ptrs;
+ int *os_pages_status;
+ int os_page_count;
+ float pages_per_blk;
funcctx = SRF_FIRSTCALL_INIT();
- /* Switch context when allocating stuff to be used in later calls */
+ if (query_numa)
+ {
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");
+ query_numa = false;
+ }
+ }
+
+ /*
+ * Switch context when allocating stuff to be used in later calls
+ */
oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
/* Create a user function context for cross-call persistence */
@@ -122,10 +156,14 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
INT2OID, -1, 0);
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "numa_zone_id",
+ INT4OID, -1, 0);
+
fctx->tupdesc = BlessTupleDesc(tupledesc);
/* Allocate NBuffers worth of BufferCachePagesRec records. */
@@ -137,9 +175,35 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
funcctx->max_calls = NBuffers;
funcctx->user_fctx = fctx;
- /* Return to original context when allocating transient memory */
+ /*
+ * Return to original context when allocating transient memory
+ */
MemoryContextSwitchTo(oldcontext);
+ /*
+ * This is for gathering some NUMA statistics. We might be using
+ * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
+ * allocated in various different OS memory pages sizes, so first we
+ * need to understand the OS memory page size before calling
+ * move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+ os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA os_page_count=%d os_page_size=%ld pages_per_blk=%f",
+ os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(int) * os_page_count);
+ memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
/*
* Scan through all the buffers, saving the relevant fields in the
* fctx->record structure.
@@ -171,14 +235,79 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
else
fctx->record[i].isdirty = false;
- /* Note if the buffer is valid, and has storage created */
+ /*
+ * Note if the buffer is valid, and has storage created
+ */
if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
fctx->record[i].isvalid = true;
else
fctx->record[i].isvalid = false;
+ if (query_numa)
+ {
+ blk2page = (int) i * pages_per_blk;
+ j = 0;
+ do
+ {
+ /*
+ * Many buffers can point to the same page (in case of
+ * BLCKSZ < 4kB), but we want to also query just first
+ * address.
+ *
+ * In order to get reliable results we also need to touch
+ * memory pages, so that inquiry about NUMA zone doesn't
+ * return -2.
+ */
+ if (os_page_ptrs[blk2page + j] == 0)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ /*
+ * NBuffers count start really from 1
+ */
+ os_page_ptrs[blk2page + j] = (char *) BufferGetBlock(i + 1) + (os_page_size * j);
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2page + j]);
+
+ /*
+ * Every 1GB of scanned memory we give process chance
+ * to respond
+ */
+#define ONE_GIGABYTE 1024*1024*1024
+ if ((i * os_page_size) % ONE_GIGABYTE == 0)
+ CHECK_FOR_INTERRUPTS();
+ }
+ j++;
+ } while (j < (int) pages_per_blk);
+ }
+
UnlockBufHdr(bufHdr, buf_state);
}
+
+
+ if (query_numa)
+ {
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ blk2page = (int) i * pages_per_blk;
+
+ /*
+ * Technically we can get errors too here and pass that to
+ * user
+ *
+ * XXX:: also we could somehow report single DB block spanning
+ * more than 2 NUMA zones, but it should be rare (?)
+ */
+ fctx->record[i].numa_zone_id = os_pages_status[blk2page];
+ }
+ }
+ else
+ pg_buffercache_mark_numa_invalid(fctx, NBuffers);
+
+ pfree(os_page_ptrs);
+ pfree(os_pages_status);
}
funcctx = SRF_PERCALL_SETUP();
@@ -209,8 +338,12 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
nulls[5] = true;
nulls[6] = true;
nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
+
+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */
nulls[8] = true;
+ nulls[9] = true;
}
else
{
@@ -228,9 +361,14 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
nulls[6] = false;
values[7] = Int16GetDatum(fctx->record[i].usagecount);
nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
+
+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */
values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
nulls[8] = false;
+ values[9] = Int32GetDatum(fctx->record[i].numa_zone_id);
+ nulls[9] = false;
}
/* Build and return the tuple. */
diff --git a/contrib/pg_buffercache/sql/pg_buffercache.sql b/contrib/pg_buffercache/sql/pg_buffercache.sql
index 944fbb1beae..96e513a7f98 100644
--- a/contrib/pg_buffercache/sql/pg_buffercache.sql
+++ b/contrib/pg_buffercache/sql/pg_buffercache.sql
@@ -5,6 +5,11 @@ select count(*) = (select setting::bigint
where name = 'shared_buffers')
from pg_buffercache;
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
select buffers_used + buffers_unused > 0,
buffers_dirty <= buffers_used,
buffers_pinned <= buffers_used
@@ -16,9 +21,11 @@ SELECT count(*) > 0 FROM pg_buffercache_usage_counts() WHERE buffers >= 0;
-- having to create a dedicated user, use the pg_database_owner pseudo-role.
SET ROLE pg_database_owner;
SELECT * FROM pg_buffercache;
-SELECT * FROM pg_buffercache_pages() AS p (wrong int);
+SELECT * FROM pg_buffercache_pages(false) AS p (wrong int);
+SELECT * FROM pg_buffercache_pages(true) AS p (wrong int);
SELECT * FROM pg_buffercache_summary();
SELECT * FROM pg_buffercache_usage_counts();
+SELECT * FROM pg_buffercache_numa;
RESET role;
-- Check that pg_monitor is allowed to query view / function
@@ -26,3 +33,4 @@ SET ROLE pg_monitor;
SELECT count(*) > 0 FROM pg_buffercache;
SELECT buffers_used + buffers_unused > 0 FROM pg_buffercache_summary();
SELECT count(*) > 0 FROM pg_buffercache_usage_counts();
+SELECT count(*) > 0 FROM pg_buffercache_numa;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 03a6dd49154..172309d389a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -562,7 +562,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
--
2.39.5
v5-0001-Add-optional-dependency-to-libnuma-for-basic-NUMA.patchapplication/octet-stream; name=v5-0001-Add-optional-dependency-to-libnuma-for-basic-NUMA.patchDownload
From c548a256e73211dee506762f8efbddcacfc61faf Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 10:19:35 +0100
Subject: [PATCH v5 1/3] Add optional dependency to libnuma for basic NUMA
awareness routines
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 1 +
configure.ac | 13 ++++
meson.build | 17 +++++
meson_options.txt | 3 +
src/Makefile.global.in | 1 +
src/backend/Makefile | 3 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 42 +++++++++++
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 150 +++++++++++++++++++++++++++++++++++++
12 files changed, 238 insertions(+)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 91b51142d2e..7467e029131 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -448,6 +448,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
\
${LINUX_CONFIGURE_FEATURES} \
\
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..1a394dfc077 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1041,6 +1041,19 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+fi
+
#
# XML
#
diff --git a/meson.build b/meson.build
index 574f992ed49..cf9dead5d02 100644
--- a/meson.build
+++ b/meson.build
@@ -949,6 +949,21 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+libnuma = dependency('libnuma', required: libnumaopt)
+if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt, dirs: test_lib_d)
+endif
+if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: libxml
@@ -3168,6 +3183,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
libxml,
lz4,
pam,
@@ -3821,6 +3837,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..adaadb5faf1 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5ac..0bd4b2d7d32 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5aa..bff9f077a8c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -54,6 +54,9 @@ ifeq ($(with_systemd),yes)
LIBS += -lsystemd
endif
+# FIXME: filter-out / with/without with_libnuma?
+LIBS += $(LIBNUMA_LIBS)
+
override LDFLAGS := $(LDFLAGS) $(LDFLAGS_EX) $(LDFLAGS_EX_BE)
##########################################################################
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..8894f800607 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..be85b16b0de
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Miscellaneous functions for bit-wise operations.
+ *
+ *
+ * Copyright (c) 2019-2025, PostgreSQL Global Development Group
+ *
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "c.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..f786c191605 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS'
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c224319512..a68a29d5414 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_numa.o \
pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d43..7ffbd4d88d2 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_numa.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..e94e68abe42
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,150 @@
+/*-------------------------------------------------------------------------
+ *
+ * numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+#include "postgres.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+#include <unistd.h>
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#ifndef FRONTEND
+/* FIXME not tested, might crash */
+void
+numa_warn(int num, char *fmt,...)
+{
+ va_list ap;
+ int olde = errno;
+ int needed;
+ StringInfoData msg;
+
+ initStringInfo(&msg);
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ if (needed > 0)
+ {
+ enlargeStringInfo(&msg, needed);
+ va_start(ap, fmt);
+ appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ }
+
+ ereport(WARNING,
+ (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
+ errmsg_internal("libnuma: WARNING: %s", msg.data)));
+
+ pfree(msg.data);
+
+ errno = olde;
+}
+
+void
+numa_error(char *where)
+{
+ int olde = errno;
+
+ /*
+ * XXX: for now we issue just WARNING, but long-term that might depend on
+ * numa_set_strict() here
+ */
+ elog(WARNING, "libnuma: ERROR: %s", where);
+ errno = olde;
+}
+#endif /* FRONTEND */
+
+#else
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+#ifndef WIN32
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+#else
+ Size os_page_size;
+ SYSTEM_INFO sysinfo;
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#endif;
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#endif
--
2.39.5
v5-0003-Add-pg_shmem_numa_allocations.patchapplication/octet-stream; name=v5-0003-Add-pg_shmem_numa_allocations.patchDownload
From 63b130c3af4de104e0f7e1e8fc2fc990fc71263a Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v5 3/3] Add pg_shmem_numa_allocations
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 125 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/privileges.out | 22 +++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/sql/privileges.sql | 7 +-
6 files changed, 170 insertions(+), 4 deletions(-)
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index eff0990957e..c808fb82d75 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..48de096f0ea 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
#include "storage/shmem.h"
#include "storage/spin.h"
#include "utils/builtins.h"
+#include "port/pg_numa.h"
static void *ShmemAllocRaw(Size size, Size *allocated_size);
@@ -568,3 +569,127 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA zones for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ int shm_total_page_count,
+ shm_ent_page_count;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");;
+ return (Datum) 0;
+ }
+
+ /*
+ * This is for gathering some NUMA statistics. We might be using various
+ * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in
+ * various different OS memory pages sizes, so first we need to understand
+ * the OS memory page size before calling move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Preallocate memory all at once without going into details which shared
+ * memory segment is the biggest (technically min s_b can be as low as
+ * 16xBLCKSZ)
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+ memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count);
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+#define MAX_NUMA_ZONES 32 /* FIXME? */
+ Size zones[MAX_NUMA_ZONES];
+
+ shm_ent_page_count = ent->allocated_size / os_page_size;
+ /* It is always at least 1 page */
+ shm_ent_page_count = shm_ent_page_count == 0 ? 1 : shm_ent_page_count;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages so that inquiry about NUMA zone doesn't return -2.
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ /* Every 1GB of scanned memory we give process chance to respond */
+#define ONE_GIGABYTE 1024*1024*1024
+ if ((i * os_page_size) % ONE_GIGABYTE == 0)
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1) {
+ /* FIXME: should we release LWlock and pfree here? */
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+ }
+
+ memset(zones, 0, sizeof(zones));
+ /* Count number of NUMA zones used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ if (s >= 0)
+ zones[s]++;
+ }
+
+ for (i = 0; i <= pg_numa_get_max_node(); i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(zones[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+
+ }
+
+ /*
+ * XXX: We are ignoring in NUMA version reporting of the following regions
+ * (compare to pg_get_shmem_allocations() case): - output shared memory
+ * allocated but not counted via the shmem index - output as-of-yet unused
+ * shared memory
+ */
+
+ LWLockRelease(ShmemIndexLock);
+
+ pfree(page_ptrs);
+ pfree(pages_status);
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9e803d610d7..1efa342b725 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8463,6 +8463,14 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+# shared memory usage with NUMA info
+{ oid => '5101', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int8,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,numa_zone_id,numa_size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 6b01313101b..637f61e6ccc 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3097,8 +3097,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3114,6 +3114,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
has_table_privilege
@@ -3127,6 +3133,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
@@ -3141,6 +3153,12 @@ SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations;
t
(1 row)
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
RESET ROLE;
-- clean up
DROP ROLE regress_readallstats;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5baba8d39ff..5cbaa856858 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1740,6 +1740,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ numa_zone_id,
+ numa_size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, numa_zone_id, numa_size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index 60e7443bf59..38e6909c38c 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1896,8 +1896,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1906,16 +1906,19 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations;
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
RESET ROLE;
-- clean up
--
2.39.5
Hi,
On Wed, Feb 26, 2025 at 02:05:59PM +0100, Jakub Wartak wrote:
On Wed, Feb 26, 2025 at 10:58 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2025-02-26 09:38:20 +0100, Jakub Wartak wrote:
FWIW, what you posted fails on CI:
https://cirrus-ci.com/task/5114213770723328Probably some ifdefs are missing. The sanity-check task configures with
minimal dependencies, which is why you're seeing this even on linux.Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
of this CI stuff...FYI, you can enable CI on a github repo, to see results without posting to the
list:
https://github.com/postgres/postgres/blob/master/src/tools/ci/READMEThanks, I'll take a look into it.
Meanwhile v5 is attached with slight changes to try to make cfbot happy:
Thanks for the updated version!
FWIW, I had to do a few changes to get an error free compiling experience with
autoconf/or meson and both with or without the libnuma configure option.
Sharing here as .txt files:
v5-0004-configure-changes.txt: changes in configure + add a test on numa.h
availability and a call to numa_available.
v5-0005-pg_numa.c-changes.txt: moving the <unistd.h> outside of USE_LIBNUMA
because the file is still using sysconf() in the non-NUMA code path. Also,
removed a ";" in "#endif;" in the non-NUMA code path.
v5-0006-meson.build-changes.txt.
Those apply on top of your v5.
Also the pg_buffercache test fails without the libnuma configure option. Maybe
some tests should depend of the libnuma configure option.
Still did not look closely to the code.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v5-0004-configure-changes.txttext/plain; charset=us-asciiDownload
From 6871e2d54390d58403bcc36873a5e6e7bf88ed25 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Wed, 26 Feb 2025 13:52:15 +0000
Subject: [PATCH v5 4/6] configure changes
---
configure | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 87 insertions(+)
diff --git a/configure b/configure
index 93fddd69981..23c33dd9971 100755
--- a/configure
+++ b/configure
@@ -711,6 +711,7 @@ with_libxml
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
+with_libnuma
with_uuid
with_readline
with_systemd
@@ -868,6 +869,7 @@ with_libedit_preferred
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -1581,6 +1583,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -9140,6 +9143,33 @@ fi
+#
+# NUMA
+#
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
@@ -12378,6 +12408,63 @@ fi
fi
+if test "$with_libnuma" = yes ; then
+
+ ac_fn_c_check_header_mongrel "$LINENO" "numa.h" "ac_cv_header_numa_h" "$ac_includes_default"
+if test "x$ac_cv_header_numa_h" = xyes; then :
+
+else
+ as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
+fi
+
+
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'numa' does not provide numa_available" "$LINENO" 5
+fi
+
+fi
+
# XXX libcurl must link after libgssapi_krb5 on FreeBSD to avoid segfaults
# during gss_acquire_cred(). This is possibly related to Curl's Heimdal
# dependency on that platform?
--
2.34.1
v5-0005-pg_numa.c-changes.txttext/plain; charset=us-asciiDownload
From ca5449d7091ec724e270b77f5be21e189fa94314 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Wed, 26 Feb 2025 15:22:54 +0000
Subject: [PATCH v5 5/6] pg_numa.c changes
---
src/port/pg_numa.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
100.0% src/port/
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index e94e68abe42..3aa1c191f51 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -17,6 +17,7 @@
#include "postgres.h"
#include "port/pg_numa.h"
#include "storage/pg_shmem.h"
+#include <unistd.h>
#ifdef WIN32
#include <windows.h>
#endif
@@ -31,7 +32,6 @@
#include <numa.h>
#include <numaif.h>
-#include <unistd.h>
/* libnuma requires initialization as per numa(3) on Linux */
int
@@ -141,7 +141,7 @@ pg_numa_get_pagesize(void)
SYSTEM_INFO sysinfo;
GetSystemInfo(&sysinfo);
os_page_size = sysinfo.dwPageSize;
-#endif;
+#endif
if (huge_pages_status == HUGE_PAGES_ON)
GetHugePageSize(&os_page_size, NULL);
return os_page_size;
--
2.34.1
v5-0006-meson.build-changes.txttext/plain; charset=us-asciiDownload
From 2de58b2bab5d3c49c285c6d3781923645469e6a2 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Wed, 26 Feb 2025 16:41:14 +0000
Subject: [PATCH v5 6/6] meson.build changes
---
meson.build | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/meson.build b/meson.build
index d2178c7d32e..f81092eb661 100644
--- a/meson.build
+++ b/meson.build
@@ -954,9 +954,9 @@ endif
###############################################################
libnumaopt = get_option('libnuma')
-libnuma = dependency('libnuma', required: libnumaopt)
+libnuma = dependency('numa', required: libnumaopt)
if not libnuma.found()
- libnuma = cc.find_library('numa', required: libnumaopt, dirs: test_lib_d)
+ libnuma = cc.find_library('numa', required: libnumaopt)
endif
if libnuma.found()
cdata.set('USE_LIBNUMA', 1)
--
2.34.1
Hi Jakub,
On Wed, Feb 26, 2025 at 09:48:41AM +0100, Jakub Wartak wrote:
On Mon, Feb 24, 2025 at 5:11 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:Hi,
On Mon, Feb 24, 2025 at 09:06:20AM -0500, Andres Freund wrote:
Does the issue with "new" backends seeing pages as not present exist both with
and without huge pages?That's a good point and from what I can see it's correct with huge pages being
used (it means all processes see the same NUMA node assignment regardless of
access patterns).Hi Bertrand , please see that nearby thread. I've got quite the
opposite results. I need page-fault memory or I get invalid results
("-2"). What kernel version are you using ? (I've tried it on two
6.10.x series kernels , virtualized in both cases; one was EPYC [real
NUMA, but not VM so not real hardware])
Thanks for sharing your numbers!
It looks like that with hp enabled then the shared_buffers plays a role.
1. With hp, shared_buffers 4GB:
huge_pages_status
-------------------
on
(1 row)
shared_buffers
----------------
4GB
(1 row)
NOTICE: os_page_count=2048 os_page_size=2097152 pages_per_blk=0.003906
numa_zone_id | count
--------------+--------
| 507618
0 | 1054
-2 | 15616
(3 rows)
2. With hp, shared_buffers 23GB:
huge_pages_status
-------------------
on
(1 row)
shared_buffers
----------------
23GB
(1 row)
NOTICE: os_page_count=11776 os_page_size=2097152 pages_per_blk=0.003906
numa_zone_id | count
--------------+---------
| 2997974
0 | 16682
(2 rows)
3. no hp, shared_buffers 23GB:
huge_pages_status
-------------------
off
(1 row)
shared_buffers
----------------
23GB
(1 row)
ERROR: extension "pg_buffercache" already exists
NOTICE: os_page_count=6029312 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+---------
| 2997975
-2 | 16482
1 | 199
(3 rows)
Maybe the kernel is taking some decisions based on the HugePages_Rsvd, I've
no ideas. Anyway there is little than we can do and the "touchpages" patch
seems to provide "accurate" results.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Feb 26, 2025 at 6:13 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
[..]
Meanwhile v5 is attached with slight changes to try to make cfbot happy:
Thanks for the updated version!
FWIW, I had to do a few changes to get an error free compiling experience with
autoconf/or meson and both with or without the libnuma configure option.Sharing here as .txt files:
Also the pg_buffercache test fails without the libnuma configure option. Maybe
some tests should depend of the libnuma configure option.
[..]
Thank you so much for this Bertrand !
I've applied those , played a little bit with configure and meson and
reproduced the test error and fixed it by silencing that NOTICE in
tests. So v6 is attached even before I get a chance to start using
that CI. Still waiting for some input and tests regarding that earlier
touchpages attempt, docs are still missing...
-J.
Attachments:
v6-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones-.patchapplication/octet-stream; name=v6-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones-.patchDownload
From 968d02701cce8d0cb1260444ed661d3819a3427a Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v6 3/3] Add pg_shmem_numa_allocations to show NUMA zones for
shared memory allocations
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 125 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/privileges.out | 25 ++++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/sql/privileges.sql | 10 +-
6 files changed, 176 insertions(+), 4 deletions(-)
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index eff0990957e..c808fb82d75 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..48de096f0ea 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
#include "storage/shmem.h"
#include "storage/spin.h"
#include "utils/builtins.h"
+#include "port/pg_numa.h"
static void *ShmemAllocRaw(Size size, Size *allocated_size);
@@ -568,3 +569,127 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA zones for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ int shm_total_page_count,
+ shm_ent_page_count;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");;
+ return (Datum) 0;
+ }
+
+ /*
+ * This is for gathering some NUMA statistics. We might be using various
+ * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in
+ * various different OS memory pages sizes, so first we need to understand
+ * the OS memory page size before calling move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Preallocate memory all at once without going into details which shared
+ * memory segment is the biggest (technically min s_b can be as low as
+ * 16xBLCKSZ)
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+ memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count);
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+#define MAX_NUMA_ZONES 32 /* FIXME? */
+ Size zones[MAX_NUMA_ZONES];
+
+ shm_ent_page_count = ent->allocated_size / os_page_size;
+ /* It is always at least 1 page */
+ shm_ent_page_count = shm_ent_page_count == 0 ? 1 : shm_ent_page_count;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages so that inquiry about NUMA zone doesn't return -2.
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ /* Every 1GB of scanned memory we give process chance to respond */
+#define ONE_GIGABYTE 1024*1024*1024
+ if ((i * os_page_size) % ONE_GIGABYTE == 0)
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1) {
+ /* FIXME: should we release LWlock and pfree here? */
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+ }
+
+ memset(zones, 0, sizeof(zones));
+ /* Count number of NUMA zones used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ if (s >= 0)
+ zones[s]++;
+ }
+
+ for (i = 0; i <= pg_numa_get_max_node(); i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(zones[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+
+ }
+
+ /*
+ * XXX: We are ignoring in NUMA version reporting of the following regions
+ * (compare to pg_get_shmem_allocations() case): - output shared memory
+ * allocated but not counted via the shmem index - output as-of-yet unused
+ * shared memory
+ */
+
+ LWLockRelease(ShmemIndexLock);
+
+ pfree(page_ptrs);
+ pfree(pages_status);
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9e803d610d7..1efa342b725 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8463,6 +8463,14 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+# shared memory usage with NUMA info
+{ oid => '5101', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int8,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,numa_zone_id,numa_size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 6b01313101b..b20bbcf52b9 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3097,8 +3097,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3114,6 +3114,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
has_table_privilege
@@ -3127,6 +3133,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
@@ -3141,6 +3153,15 @@ SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations;
t
(1 row)
+-- to ignore potenital NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
+RESET client_min_messages;
RESET ROLE;
-- clean up
DROP ROLE regress_readallstats;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5baba8d39ff..5cbaa856858 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1740,6 +1740,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ numa_zone_id,
+ numa_size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, numa_zone_id, numa_size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index 60e7443bf59..1cf20dfe153 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1896,8 +1896,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1906,16 +1906,22 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations;
+-- to ignore potenital NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+RESET client_min_messages;
RESET ROLE;
-- clean up
--
2.39.5
v6-0002-Extend-pg_buffercache-with-new-view-pg_buffercach.patchapplication/octet-stream; name=v6-0002-Extend-pg_buffercache-with-new-view-pg_buffercach.patchDownload
From 7ad8be9e522a9abc95c81c51da332dfb3edc47fc Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 11:17:28 +0100
Subject: [PATCH v6 2/3] Extend pg_buffercache with new view
pg_buffercache_numa to show NUMA zone for indvidual buffer.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache.out | 30 +++-
contrib/pg_buffercache/meson.build | 1 +
.../pg_buffercache--1.5--1.6.sql | 35 ++++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 154 +++++++++++++++++-
contrib/pg_buffercache/sql/pg_buffercache.sql | 19 ++-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/storage/pg_shmem.h | 1 +
9 files changed, 234 insertions(+), 13 deletions(-)
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache.out b/contrib/pg_buffercache/expected/pg_buffercache.out
index b745dc69eae..f34f137075e 100644
--- a/contrib/pg_buffercache/expected/pg_buffercache.out
+++ b/contrib/pg_buffercache/expected/pg_buffercache.out
@@ -8,6 +8,18 @@ from pg_buffercache;
t
(1 row)
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET client_min_messages;
select buffers_used + buffers_unused > 0,
buffers_dirty <= buffers_used,
buffers_pinned <= buffers_used
@@ -28,12 +40,19 @@ SELECT count(*) > 0 FROM pg_buffercache_usage_counts() WHERE buffers >= 0;
SET ROLE pg_database_owner;
SELECT * FROM pg_buffercache;
ERROR: permission denied for view pg_buffercache
-SELECT * FROM pg_buffercache_pages() AS p (wrong int);
+SELECT * FROM pg_buffercache_pages(false) AS p (wrong int);
+ERROR: permission denied for function pg_buffercache_pages
+SELECT * FROM pg_buffercache_pages(true) AS p (wrong int);
ERROR: permission denied for function pg_buffercache_pages
SELECT * FROM pg_buffercache_summary();
ERROR: permission denied for function pg_buffercache_summary
SELECT * FROM pg_buffercache_usage_counts();
ERROR: permission denied for function pg_buffercache_usage_counts
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT * FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET client_min_messages;
RESET role;
-- Check that pg_monitor is allowed to query view / function
SET ROLE pg_monitor;
@@ -55,3 +74,12 @@ SELECT count(*) > 0 FROM pg_buffercache_usage_counts();
t
(1 row)
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET client_min_messages;
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..9b2e9393410 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..448d08196f3
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,35 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the new function.
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+CREATE OR REPLACE FUNCTION pg_buffercache_pages(boolean)
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages(false) AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_pages(true) AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, numa_zone_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pages(boolean) TO pg_monitor;
+GRANT SELECT ON pg_buffercache TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3ae0a018e10..dfe53eb8471 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -6,6 +6,7 @@
* contrib/pg_buffercache/pg_buffercache_pages.c
*-------------------------------------------------------------------------
*/
+#include "pg_config.h"
#include "postgres.h"
#include "access/htup_details.h"
@@ -13,10 +14,12 @@
#include "funcapi.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -43,6 +46,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_zone_id;
} BufferCachePagesRec;
@@ -65,6 +69,17 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+static void
+pg_buffercache_mark_numa_invalid(BufferCachePagesContext *fctx, int n)
+{
+ int i;
+
+ for (i = 0; i < n; i++)
+ {
+ fctx->record[i].numa_zone_id = -1;
+ }
+}
+
Datum
pg_buffercache_pages(PG_FUNCTION_ARGS)
{
@@ -75,14 +90,33 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
HeapTuple tuple;
+ Buffer query_numa = PG_GETARG_BOOL(0);
if (SRF_IS_FIRSTCALL())
{
- int i;
+ int i,
+ blk2page,
+ j;
+ Size os_page_size;
+ void **os_page_ptrs;
+ int *os_pages_status;
+ int os_page_count;
+ float pages_per_blk;
funcctx = SRF_FIRSTCALL_INIT();
- /* Switch context when allocating stuff to be used in later calls */
+ if (query_numa)
+ {
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");
+ query_numa = false;
+ }
+ }
+
+ /*
+ * Switch context when allocating stuff to be used in later calls
+ */
oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
/* Create a user function context for cross-call persistence */
@@ -122,10 +156,14 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
INT2OID, -1, 0);
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "numa_zone_id",
+ INT4OID, -1, 0);
+
fctx->tupdesc = BlessTupleDesc(tupledesc);
/* Allocate NBuffers worth of BufferCachePagesRec records. */
@@ -137,9 +175,35 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
funcctx->max_calls = NBuffers;
funcctx->user_fctx = fctx;
- /* Return to original context when allocating transient memory */
+ /*
+ * Return to original context when allocating transient memory
+ */
MemoryContextSwitchTo(oldcontext);
+ /*
+ * This is for gathering some NUMA statistics. We might be using
+ * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
+ * allocated in various different OS memory pages sizes, so first we
+ * need to understand the OS memory page size before calling
+ * move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+ os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA os_page_count=%d os_page_size=%ld pages_per_blk=%f",
+ os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(int) * os_page_count);
+ memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
/*
* Scan through all the buffers, saving the relevant fields in the
* fctx->record structure.
@@ -171,14 +235,79 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
else
fctx->record[i].isdirty = false;
- /* Note if the buffer is valid, and has storage created */
+ /*
+ * Note if the buffer is valid, and has storage created
+ */
if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
fctx->record[i].isvalid = true;
else
fctx->record[i].isvalid = false;
+ if (query_numa)
+ {
+ blk2page = (int) i * pages_per_blk;
+ j = 0;
+ do
+ {
+ /*
+ * Many buffers can point to the same page (in case of
+ * BLCKSZ < 4kB), but we want to also query just first
+ * address.
+ *
+ * In order to get reliable results we also need to touch
+ * memory pages, so that inquiry about NUMA zone doesn't
+ * return -2.
+ */
+ if (os_page_ptrs[blk2page + j] == 0)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ /*
+ * NBuffers count start really from 1
+ */
+ os_page_ptrs[blk2page + j] = (char *) BufferGetBlock(i + 1) + (os_page_size * j);
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2page + j]);
+
+ /*
+ * Every 1GB of scanned memory we give process chance
+ * to respond
+ */
+#define ONE_GIGABYTE 1024*1024*1024
+ if ((i * os_page_size) % ONE_GIGABYTE == 0)
+ CHECK_FOR_INTERRUPTS();
+ }
+ j++;
+ } while (j < (int) pages_per_blk);
+ }
+
UnlockBufHdr(bufHdr, buf_state);
}
+
+
+ if (query_numa)
+ {
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ blk2page = (int) i * pages_per_blk;
+
+ /*
+ * Technically we can get errors too here and pass that to
+ * user
+ *
+ * XXX:: also we could somehow report single DB block spanning
+ * more than 2 NUMA zones, but it should be rare (?)
+ */
+ fctx->record[i].numa_zone_id = os_pages_status[blk2page];
+ }
+ }
+ else
+ pg_buffercache_mark_numa_invalid(fctx, NBuffers);
+
+ pfree(os_page_ptrs);
+ pfree(os_pages_status);
}
funcctx = SRF_PERCALL_SETUP();
@@ -209,8 +338,12 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
nulls[5] = true;
nulls[6] = true;
nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
+
+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */
nulls[8] = true;
+ nulls[9] = true;
}
else
{
@@ -228,9 +361,14 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
nulls[6] = false;
values[7] = Int16GetDatum(fctx->record[i].usagecount);
nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
+
+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */
values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
nulls[8] = false;
+ values[9] = Int32GetDatum(fctx->record[i].numa_zone_id);
+ nulls[9] = false;
}
/* Build and return the tuple. */
diff --git a/contrib/pg_buffercache/sql/pg_buffercache.sql b/contrib/pg_buffercache/sql/pg_buffercache.sql
index 944fbb1beae..7f2ce683e6c 100644
--- a/contrib/pg_buffercache/sql/pg_buffercache.sql
+++ b/contrib/pg_buffercache/sql/pg_buffercache.sql
@@ -5,6 +5,14 @@ select count(*) = (select setting::bigint
where name = 'shared_buffers')
from pg_buffercache;
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+RESET client_min_messages;
+
select buffers_used + buffers_unused > 0,
buffers_dirty <= buffers_used,
buffers_pinned <= buffers_used
@@ -16,9 +24,14 @@ SELECT count(*) > 0 FROM pg_buffercache_usage_counts() WHERE buffers >= 0;
-- having to create a dedicated user, use the pg_database_owner pseudo-role.
SET ROLE pg_database_owner;
SELECT * FROM pg_buffercache;
-SELECT * FROM pg_buffercache_pages() AS p (wrong int);
+SELECT * FROM pg_buffercache_pages(false) AS p (wrong int);
+SELECT * FROM pg_buffercache_pages(true) AS p (wrong int);
SELECT * FROM pg_buffercache_summary();
SELECT * FROM pg_buffercache_usage_counts();
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT * FROM pg_buffercache_numa;
+RESET client_min_messages;
RESET role;
-- Check that pg_monitor is allowed to query view / function
@@ -26,3 +39,7 @@ SET ROLE pg_monitor;
SELECT count(*) > 0 FROM pg_buffercache;
SELECT buffers_used + buffers_unused > 0 FROM pg_buffercache_summary();
SELECT count(*) > 0 FROM pg_buffercache_usage_counts();
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET client_min_messages;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 03a6dd49154..172309d389a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -562,7 +562,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
--
2.39.5
v6-0001-Add-optional-dependency-to-libnuma-for-basic-NUMA.patchapplication/octet-stream; name=v6-0001-Add-optional-dependency-to-libnuma-for-basic-NUMA.patchDownload
From ae07555fdeb4a8d377ed44925d1ec4795ccf56ce Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 10:19:35 +0100
Subject: [PATCH v6 1/3] Add optional dependency to libnuma for basic NUMA
awareness routines add minimal src/port/pg_numa.c portability wrapper.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
---
.cirrus.tasks.yml | 1 +
configure | 87 +++++++++++++++++++++
configure.ac | 13 ++++
meson.build | 17 +++++
meson_options.txt | 3 +
src/Makefile.global.in | 1 +
src/backend/Makefile | 3 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 42 +++++++++++
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 150 +++++++++++++++++++++++++++++++++++++
13 files changed, 325 insertions(+)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 91b51142d2e..7467e029131 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -448,6 +448,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
\
${LINUX_CONFIGURE_FEATURES} \
\
diff --git a/configure b/configure
index 93fddd69981..23c33dd9971 100755
--- a/configure
+++ b/configure
@@ -711,6 +711,7 @@ with_libxml
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
+with_libnuma
with_uuid
with_readline
with_systemd
@@ -868,6 +869,7 @@ with_libedit_preferred
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -1581,6 +1583,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -9140,6 +9143,33 @@ fi
+#
+# NUMA
+#
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
@@ -12378,6 +12408,63 @@ fi
fi
+if test "$with_libnuma" = yes ; then
+
+ ac_fn_c_check_header_mongrel "$LINENO" "numa.h" "ac_cv_header_numa_h" "$ac_includes_default"
+if test "x$ac_cv_header_numa_h" = xyes; then :
+
+else
+ as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
+fi
+
+
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'numa' does not provide numa_available" "$LINENO" 5
+fi
+
+fi
+
# XXX libcurl must link after libgssapi_krb5 on FreeBSD to avoid segfaults
# during gss_acquire_cred(). This is possibly related to Curl's Heimdal
# dependency on that platform?
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..1a394dfc077 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1041,6 +1041,19 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+fi
+
#
# XML
#
diff --git a/meson.build b/meson.build
index 574f992ed49..7302c876b31 100644
--- a/meson.build
+++ b/meson.build
@@ -949,6 +949,21 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+libnuma = dependency('numa', required: libnumaopt)
+if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+endif
+if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: libxml
@@ -3168,6 +3183,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
libxml,
lz4,
pam,
@@ -3821,6 +3837,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..adaadb5faf1 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5ac..0bd4b2d7d32 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5aa..bff9f077a8c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -54,6 +54,9 @@ ifeq ($(with_systemd),yes)
LIBS += -lsystemd
endif
+# FIXME: filter-out / with/without with_libnuma?
+LIBS += $(LIBNUMA_LIBS)
+
override LDFLAGS := $(LDFLAGS) $(LDFLAGS_EX) $(LDFLAGS_EX_BE)
##########################################################################
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..8894f800607 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..be85b16b0de
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Miscellaneous functions for bit-wise operations.
+ *
+ *
+ * Copyright (c) 2019-2025, PostgreSQL Global Development Group
+ *
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "c.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..f786c191605 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS'
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c224319512..a68a29d5414 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_numa.o \
pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d43..7ffbd4d88d2 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_numa.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..3aa1c191f51
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,150 @@
+/*-------------------------------------------------------------------------
+ *
+ * numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+#include "postgres.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+#include <unistd.h>
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#ifndef FRONTEND
+/* FIXME not tested, might crash */
+void
+numa_warn(int num, char *fmt,...)
+{
+ va_list ap;
+ int olde = errno;
+ int needed;
+ StringInfoData msg;
+
+ initStringInfo(&msg);
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ if (needed > 0)
+ {
+ enlargeStringInfo(&msg, needed);
+ va_start(ap, fmt);
+ appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ }
+
+ ereport(WARNING,
+ (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
+ errmsg_internal("libnuma: WARNING: %s", msg.data)));
+
+ pfree(msg.data);
+
+ errno = olde;
+}
+
+void
+numa_error(char *where)
+{
+ int olde = errno;
+
+ /*
+ * XXX: for now we issue just WARNING, but long-term that might depend on
+ * numa_set_strict() here
+ */
+ elog(WARNING, "libnuma: ERROR: %s", where);
+ errno = olde;
+}
+#endif /* FRONTEND */
+
+#else
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+#ifndef WIN32
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+#else
+ Size os_page_size;
+ SYSTEM_INFO sysinfo;
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#endif
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#endif
--
2.39.5
Hi,
On Thu, Feb 27, 2025 at 10:05:46AM +0100, Jakub Wartak wrote:
On Wed, Feb 26, 2025 at 6:13 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
[..]Meanwhile v5 is attached with slight changes to try to make cfbot happy:
Thanks for the updated version!
FWIW, I had to do a few changes to get an error free compiling experience with
autoconf/or meson and both with or without the libnuma configure option.Sharing here as .txt files:
Also the pg_buffercache test fails without the libnuma configure option. Maybe
some tests should depend of the libnuma configure option.[..]
Thank you so much for this Bertrand !
I've applied those , played a little bit with configure and meson and
reproduced the test error and fixed it by silencing that NOTICE in
tests. So v6 is attached even before I get a chance to start using
that CI. Still waiting for some input and tests regarding that earlier
touchpages attempt, docs are still missing...
Thanks for the new version!
I did some tests and it looks like it's giving correct results. I don't see -2
anymore and every backend reports correct distribution (with or without hp,
with "small" or "large" shared buffer).
A few random comments:
=== 1
+ /*
+ * This is for gathering some NUMA statistics. We might be using
+ * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
+ * allocated in various different OS memory pages sizes, so first we
+ * need to understand the OS memory page size before calling
+ * move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+ os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA os_page_count=%d os_page_size=%ld pages_per_blk=%f",
+ os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(int) * os_page_count);
+ memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
I think that if (query_numa) check should also wrap that entire section of code.
=== 2
+ if (query_numa)
+ {
+ blk2page = (int) i * pages_per_blk;
+ j = 0;
+ do
+ {
This check is done for every page. I wonder if it would not make sense
to create a brand new function for pg_buffercache_numa and just let the
current pg_buffercache_pages() as it is. That said it would be great to avoid
code duplication as much a possible though, maybe using a shared
populate_buffercache_entry() or such helper function?
=== 3
+#define ONE_GIGABYTE 1024*1024*1024
+ if ((i * os_page_size) % ONE_GIGABYTE == 0)
+ CHECK_FOR_INTERRUPTS();
+ }
Did you observe noticable performance impact if calling CHECK_FOR_INTERRUPTS()
for every page instead? (I don't see with a 30GB shared buffer). I've the
feeling that we could get rid of the "ONE_GIGABYTE" check.
=== 4
+ pfree(os_page_ptrs);
+ pfree(os_pages_status);
Not sure that's needed, we should be in a short-lived memory context here
(ExprContext or such).
=== 5
+/* SQL SRF showing NUMA zones for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
That's a good idea.
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages so that inquiry about NUMA zone doesn't return -2.
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
That sounds right.
Could we also avoid some code duplication with pg_get_shmem_allocations()?
Also same remarks about pfree() and ONE_GIGABYTE as above.
A few other things:
==== 6
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
#include "storage/shmem.h"
#include "storage/spin.h"
#include "utils/builtins.h"
+#include "port/pg_numa.h"
Not at the right position, should be between those 2:
#include "miscadmin.h"
#include "storage/lwlock.h"
==== 7
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Miscellaneous functions for bit-wise operations.
description is not correct. Also the "Copyright (c) 2019-2025" might be
"Copyright (c) 2025" instead.
=== 8
+++ b/src/port/pg_numa.c
@@ -0,0 +1,150 @@
+/*-------------------------------------------------------------------------
+ *
+ * numa.c
+ * Basic NUMA portability routines
s/numa.c/pg_numa.c/ ?
=== 9
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -6,6 +6,7 @@
* contrib/pg_buffercache/pg_buffercache_pages.c
*-------------------------------------------------------------------------
*/
+#include "pg_config.h"
#include "postgres.h"
Is this new include needed?
#include "access/htup_details.h"
@@ -13,10 +14,12 @@
#include "funcapi.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
not in the right order.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Wed, Feb 26, 2025 at 09:38:20AM +0100, Jakub Wartak wrote:
On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote:
Why is this done before we even have gotten -2 back? Even if we need it, it
seems like we ought to defer this until necessary.Not fixed yet: maybe we could even do a `static` with
`has_this_run_earlier` and just perform this work only once during the
first time?
Not sure I get your idea, could you share what the code would look like?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi!
On Thu, Feb 27, 2025 at 4:34 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
I did some tests and it looks like it's giving correct results. I don't see -2
anymore and every backend reports correct distribution (with or without hp,
with "small" or "large" shared buffer).
Cool! Attached is v7 that is fully green on cirrus CI that Andres
recommended, we will see how cfbot reacts to this. BTW docs are still
missing. When started with proper interleave=all with s_b=64GB,
hugepages and 4 NUMA nodes (1socket with 4 CCDs) after small pgbench:
postgres=# select buffers_used, buffers_unused from pg_buffercache_summary();
buffers_used | buffers_unused
--------------+----------------
170853 | 8217755
(
ostgres=# select numa_zone_id, count(*) from pg_buffercache_numa group
by numa_zone_id order by numa_zone_id;
DEBUG: NUMA: os_page_count=32768 os_page_size=2097152 pages_per_blk=0.003906
numa_zone_id | count
--------------+---------
0 | 42752
1 | 42752
2 | 42752
3 | 42597
| 8217755
Time: 5828.100 ms (00:05.828)
postgres=# select 3*42752+42597;
?column?
----------
170853
postgres=# select * from pg_shmem_numa_allocations order by numa_size
desc limit 12;
DEBUG: NUMA: page-faulting shared memory segments for proper NUMA readouts
name | numa_zone_id | numa_size
--------------------+--------------+-------------
Buffer Blocks | 0 | 17179869184
Buffer Blocks | 1 | 17179869184
Buffer Blocks | 3 | 17179869184
Buffer Blocks | 2 | 17179869184
Buffer Descriptors | 2 | 134217728
Buffer Descriptors | 1 | 134217728
Buffer Descriptors | 0 | 134217728
Buffer Descriptors | 3 | 134217728
Checkpointer Data | 1 | 67108864
Checkpointer Data | 0 | 67108864
Checkpointer Data | 2 | 67108864
Checkpointer Data | 3 | 67108864
Time: 68.579 ms
A few random comments:
=== 1
[..]
I think that the if (query_numa) check should also wrap that entire section of code.
Done.
=== 2
+ if (query_numa) + { + blk2page = (int) i * pages_per_blk; + j = 0; + do + {This check is done for every page. I wonder if it would not make sense
to create a brand new function for pg_buffercache_numa and just let the
current pg_buffercache_pages() as it is. That said it would be great to avoid
code duplication as much a possible though, maybe using a shared
populate_buffercache_entry() or such helper function?
Well, I've made query_numa a parameter there simply to avoid that code
duplication in the first place, look at those TupleDescInitEntry()...
IMHO rarely anybody uses pg_buffercache, but we could add unlikely()
there maybe to hint compiler with some smaller routine to reduce
complexity of that main routine? (assuming NUMA inquiry is going to be
rare).
=== 3
+#define ONE_GIGABYTE 1024*1024*1024 + if ((i * os_page_size) % ONE_GIGABYTE == 0) + CHECK_FOR_INTERRUPTS(); + }Did you observe noticable performance impact if calling CHECK_FOR_INTERRUPTS()
for every page instead? (I don't see with a 30GB shared buffer). I've the
feeling that we could get rid of the "ONE_GIGABYTE" check.
You are right and no it was simply my premature optimization attempt,
as apparently CFI on closer looks seems to be already having
unlikely() and looks really cheap, so yeah I've removed that.
=== 4
+ pfree(os_page_ptrs);
+ pfree(os_pages_status);Not sure that's needed, we should be in a short-lived memory context here
(ExprContext or such).
Yes, I wanted to have it just for illustrative and stylish purposes,
but right, removed.
=== 5
+/* SQL SRF showing NUMA zones for allocated shared memory */ +Datum +pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS) +{
[..]
+ for (i = 0; i < shm_ent_page_count; i++) + { + /* + * In order to get reliable results we also need to touch memory + * pages so that inquiry about NUMA zone doesn't return -2. + */ + volatile uint64 touch pg_attribute_unused(); + + page_ptrs[i] = (char *) ent->location + (i * os_page_size); + pg_numa_touch_mem_if_required(touch, page_ptrs[i]);That sounds right.
Could we also avoid some code duplication with pg_get_shmem_allocations()?
Not sure I understand do you want to avoid code duplication
pg_get_shmem_allocations() vs pg_get_shmem_numa_allocations() or
pg_get_shmem_numa_allocations() vs pg_buffercache_pages(query_numa =
true) ?
Also same remarks about pfree() and ONE_GIGABYTE as above.
Fixed.
A few other things:
==== 6
+#include "port/pg_numa.h"
Not at the right position, should be between those 2:#include "miscadmin.h"
#include "storage/lwlock.h"
Fixed.
==== 7
+/*------------------------------------------------------------------------- + * + * pg_numa.h + * Miscellaneous functions for bit-wise operations.description is not correct. Also the "Copyright (c) 2019-2025" might be
"Copyright (c) 2025" instead.
Fixed.
=== 8
+++ b/src/port/pg_numa.c @@ -0,0 +1,150 @@ +/*------------------------------------------------------------------------- + * + * numa.c + * Basic NUMA portability routiness/numa.c/pg_numa.c/ ?
Fixed.
=== 9
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -6,6 +6,7 @@ * contrib/pg_buffercache/pg_buffercache_pages.c *------------------------------------------------------------------------- */ +#include "pg_config.h" #include "postgres.h"Is this new include needed?
Removed, don't remember how it arrived here, most have been some
artifact of earlier attempts.
#include "access/htup_details.h"
@@ -13,10 +14,12 @@
#include "funcapi.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"not in the right order.
Fixed.
And also those from nearby message:
On Thu, Feb 27, 2025 at 4:42 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
On Wed, Feb 26, 2025 at 09:38:20AM +0100, Jakub Wartak wrote:
On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote:
Why is this done before we even have gotten -2 back? Even if we need it, it
seems like we ought to defer this until necessary.Not fixed yet: maybe we could even do a `static` with
`has_this_run_earlier` and just perform this work only once during the
first time?Not sure I get your idea, could you share what the code would look like?
Please see pg_buffercache_pages I've just added static bool firstUseInBackend:
postgres@postgres:1234 : 25103 # select numa_zone_id, count(*) from
pg_buffercache_numa group by numa_zone_id;
DEBUG: NUMA: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
DEBUG: NUMA: page-faulting the buffercache for proper NUMA readouts
[..]
postgres@postgres:1234 : 25103 # select numa_zone_id, count(*) from
pg_buffercache_numa group by numa_zone_id;
DEBUG: NUMA: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
[..]
Same was done to the pg_get_shmem_numa_allocations.
Also CFbot/cirrus was getting:
[11:01:05.027] In file included from ../../src/include/postgres.h:49,
[11:01:05.027] from pg_buffercache_pages.c:10:
[11:01:05.027] pg_buffercache_pages.c: In function ‘pg_buffercache_pages’:
[11:01:05.027] pg_buffercache_pages.c:194:30: error: format ‘%ld’ expects argument of type ‘long int’, but argument 3 has type ‘Size’ {aka ‘long long unsigned int’} [-Werror=format=]
Fixed with %zu (for size_t) instead of %ld.
Linux - Debian Bookworm - Autoconf got:
[10:42:59.216] checking numa.h usability... no
[10:42:59.268] checking numa.h presence... no
[10:42:59.286] checking for numa.h... no
[10:42:59.286] configure: error: header file <numa.h> is required for --with-libnuma
I've added libnuma1 to cirrus in a similar vein like libcurl to avoid this.
[13:50:47.449] gcc -m32 @src/backend/postgres.rsp
[13:50:47.449] /usr/bin/ld: /usr/lib/x86_64-linux-gnu/libnuma.so: error adding symbols: file in wrong format
I've also got an error in 32-bit build as libnuma.h is there, but
apparently libnuma provides only x86_64. Anyway 32-bit(even with PAE)
and NUMA doesn't seem to make sense so I've added -Dlibnuma=off for
such build.
-J.
Attachments:
v7-0001-Add-optional-dependency-to-libnuma-Linux-only-for.patchapplication/octet-stream; name=v7-0001-Add-optional-dependency-to-libnuma-Linux-only-for.patchDownload
From 9795cf53205c6fb1be9ae5364664d6ba7b2baa65 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 10:19:35 +0100
Subject: [PATCH v7 1/3] Add optional dependency to libnuma (Linux-only) for
basic NUMA awareness routines and add minimal src/port/pg_numa.c portability
wrapper.
Other platforms can be added later.
libnuma is unavailable on 32-bit builds, so due to lack of i386 shared object,
we disable it there (it does not make sense anyway as i386 is is very memory-only
limited even with PAE)
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 7 +-
configure | 87 +++++++++++++++++++++
configure.ac | 13 ++++
meson.build | 17 +++++
meson_options.txt | 3 +
src/Makefile.global.in | 1 +
src/backend/Makefile | 3 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 43 +++++++++++
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 150 +++++++++++++++++++++++++++++++++++++
13 files changed, 331 insertions(+), 1 deletion(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 91b51142d2..584e3e5a44 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -428,6 +428,8 @@ task:
DEBIAN_FRONTEND=noninteractive apt-get -y install \
libcurl4-openssl-dev \
libcurl4-openssl-dev:i386 \
+ libnuma1 \
+ libnuma-dev
matrix:
- name: Linux - Debian Bookworm - Autoconf
@@ -448,6 +450,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
\
${LINUX_CONFIGURE_FEATURES} \
\
@@ -492,6 +495,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
@@ -804,7 +808,8 @@ task:
setup_additional_packages_script: |
apt-get update
- DEBIAN_FRONTEND=noninteractive apt-get -y install libcurl4-openssl-dev
+ DEBIAN_FRONTEND=noninteractive apt-get -y install libcurl4-openssl-dev \
+ libnuma1 libnuma-dev
###
# Test that code can be built with gcc/clang without warnings
diff --git a/configure b/configure
index 93fddd6998..23c33dd997 100755
--- a/configure
+++ b/configure
@@ -711,6 +711,7 @@ with_libxml
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
+with_libnuma
with_uuid
with_readline
with_systemd
@@ -868,6 +869,7 @@ with_libedit_preferred
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -1581,6 +1583,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -9140,6 +9143,33 @@ fi
+#
+# NUMA
+#
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
@@ -12378,6 +12408,63 @@ fi
fi
+if test "$with_libnuma" = yes ; then
+
+ ac_fn_c_check_header_mongrel "$LINENO" "numa.h" "ac_cv_header_numa_h" "$ac_includes_default"
+if test "x$ac_cv_header_numa_h" = xyes; then :
+
+else
+ as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
+fi
+
+
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'numa' does not provide numa_available" "$LINENO" 5
+fi
+
+fi
+
# XXX libcurl must link after libgssapi_krb5 on FreeBSD to avoid segfaults
# during gss_acquire_cred(). This is possibly related to Curl's Heimdal
# dependency on that platform?
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc..1a394dfc07 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1041,6 +1041,19 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+fi
+
#
# XML
#
diff --git a/meson.build b/meson.build
index 13c13748e5..f81092eb66 100644
--- a/meson.build
+++ b/meson.build
@@ -949,6 +949,21 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+libnuma = dependency('numa', required: libnumaopt)
+if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+endif
+if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: libxml
@@ -3168,6 +3183,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
libxml,
lz4,
pam,
@@ -3823,6 +3839,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c451714..adaadb5faf 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5a..0bd4b2d7d3 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5a..bff9f077a8 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -54,6 +54,9 @@ ifeq ($(with_systemd),yes)
LIBS += -lsystemd
endif
+# FIXME: filter-out / with/without with_libnuma?
+LIBS += $(LIBNUMA_LIBS)
+
override LDFLAGS := $(LDFLAGS) $(LDFLAGS_EX) $(LDFLAGS_EX_BE)
##########################################################################
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d..8894f80060 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 0000000000..d3ebe8b5bd
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "c.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d5023..f786c19160 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS'
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c22431951..a68a29d541 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_numa.o \
pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d4..7ffbd4d88d 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_numa.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 0000000000..db28578bca
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,150 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+#include "postgres.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+#include <unistd.h>
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#ifndef FRONTEND
+/* FIXME not tested, might crash */
+void
+numa_warn(int num, char *fmt,...)
+{
+ va_list ap;
+ int olde = errno;
+ int needed;
+ StringInfoData msg;
+
+ initStringInfo(&msg);
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ if (needed > 0)
+ {
+ enlargeStringInfo(&msg, needed);
+ va_start(ap, fmt);
+ appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ }
+
+ ereport(WARNING,
+ (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
+ errmsg_internal("libnuma: WARNING: %s", msg.data)));
+
+ pfree(msg.data);
+
+ errno = olde;
+}
+
+void
+numa_error(char *where)
+{
+ int olde = errno;
+
+ /*
+ * XXX: for now we issue just WARNING, but long-term that might depend on
+ * numa_set_strict() here
+ */
+ elog(WARNING, "libnuma: ERROR: %s", where);
+ errno = olde;
+}
+#endif /* FRONTEND */
+
+#else
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+#ifndef WIN32
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+#else
+ Size os_page_size;
+ SYSTEM_INFO sysinfo;
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#endif
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#endif
--
2.39.5
v7-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones-.patchapplication/octet-stream; name=v7-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones-.patchDownload
From 977c18b0ad8abfcd81a557c9559ae911f4752aac Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v7 3/3] Add pg_shmem_numa_allocations to show NUMA zones for
shared memory allocations
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 125 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/privileges.out | 25 ++++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/sql/privileges.sql | 10 +-
6 files changed, 176 insertions(+), 4 deletions(-)
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf..cc014a62dc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39..7d83a14390 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -568,3 +569,127 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA zones for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ int shm_total_page_count,
+ shm_ent_page_count;
+ static bool firstUseInBackend = true;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");;
+ return (Datum) 0;
+ }
+
+ /*
+ * This is for gathering some NUMA statistics. We might be using various
+ * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in
+ * various different OS memory pages sizes, so first we need to understand
+ * the OS memory page size before calling move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Preallocate memory all at once without going into details which shared
+ * memory segment is the biggest (technically min s_b can be as low as
+ * 16xBLCKSZ)
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+ memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count);
+
+ if (firstUseInBackend == true)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+#define MAX_NUMA_ZONES 32 /* FIXME? */
+ Size zones[MAX_NUMA_ZONES];
+
+ shm_ent_page_count = ent->allocated_size / os_page_size;
+ /* It is always at least 1 page */
+ shm_ent_page_count = shm_ent_page_count == 0 ? 1 : shm_ent_page_count;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages so that inquiry about NUMA zone doesn't return -2.
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstUseInBackend == true)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ {
+ /* FIXME: should we release LWlock here ? */
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+ }
+
+ memset(zones, 0, sizeof(zones));
+ /* Count number of NUMA zones used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ if (s >= 0)
+ zones[s]++;
+ }
+
+ for (i = 0; i <= pg_numa_get_max_node(); i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(zones[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+
+ }
+
+ /*
+ * XXX: We are ignoring in NUMA version reporting of the following regions
+ * (compare to pg_get_shmem_allocations() case): 1. output shared memory
+ * allocated but not counted via the shmem index 2. output as-of-yet
+ * unused shared memory
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstUseInBackend = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index cd9422d0ba..62f051c194 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8463,6 +8463,14 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+# shared memory usage with NUMA info
+{ oid => '5101', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int8,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,numa_zone_id,numa_size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index a76256405f..02997690e1 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3144,6 +3144,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
has_table_privilege
@@ -3157,6 +3163,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
@@ -3171,6 +3183,15 @@ SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations;
t
(1 row)
+-- to ignore potenital NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
+RESET client_min_messages;
RESET ROLE;
-- clean up
DROP ROLE regress_readallstats;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b..b63c6e0f74 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1740,6 +1740,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ numa_zone_id,
+ numa_size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, numa_zone_id, numa_size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index d195aaf137..e969cc3854 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1921,16 +1921,22 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations;
+-- to ignore potenital NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+RESET client_min_messages;
RESET ROLE;
-- clean up
--
2.39.5
v7-0002-Extend-pg_buffercache-with-new-view-pg_buffercach.patchapplication/octet-stream; name=v7-0002-Extend-pg_buffercache-with-new-view-pg_buffercach.patchDownload
From 77149659f37dd8943a85ab5cf61c96c6cc9dcebd Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 11:17:28 +0100
Subject: [PATCH v7 2/3] Extend pg_buffercache with new view
pg_buffercache_numa to show NUMA zone for indvidual buffer.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache.out | 30 +++-
contrib/pg_buffercache/meson.build | 1 +
.../pg_buffercache--1.5--1.6.sql | 35 ++++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 156 +++++++++++++++++-
contrib/pg_buffercache/sql/pg_buffercache.sql | 19 ++-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/storage/pg_shmem.h | 1 +
9 files changed, 237 insertions(+), 12 deletions(-)
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e..2a33602537 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache.out b/contrib/pg_buffercache/expected/pg_buffercache.out
index b745dc69ea..f34f137075 100644
--- a/contrib/pg_buffercache/expected/pg_buffercache.out
+++ b/contrib/pg_buffercache/expected/pg_buffercache.out
@@ -8,6 +8,18 @@ from pg_buffercache;
t
(1 row)
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET client_min_messages;
select buffers_used + buffers_unused > 0,
buffers_dirty <= buffers_used,
buffers_pinned <= buffers_used
@@ -28,12 +40,19 @@ SELECT count(*) > 0 FROM pg_buffercache_usage_counts() WHERE buffers >= 0;
SET ROLE pg_database_owner;
SELECT * FROM pg_buffercache;
ERROR: permission denied for view pg_buffercache
-SELECT * FROM pg_buffercache_pages() AS p (wrong int);
+SELECT * FROM pg_buffercache_pages(false) AS p (wrong int);
+ERROR: permission denied for function pg_buffercache_pages
+SELECT * FROM pg_buffercache_pages(true) AS p (wrong int);
ERROR: permission denied for function pg_buffercache_pages
SELECT * FROM pg_buffercache_summary();
ERROR: permission denied for function pg_buffercache_summary
SELECT * FROM pg_buffercache_usage_counts();
ERROR: permission denied for function pg_buffercache_usage_counts
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT * FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET client_min_messages;
RESET role;
-- Check that pg_monitor is allowed to query view / function
SET ROLE pg_monitor;
@@ -55,3 +74,12 @@ SELECT count(*) > 0 FROM pg_buffercache_usage_counts();
t
(1 row)
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET client_min_messages;
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe4871..9b2e939341 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 0000000000..448d08196f
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,35 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the new function.
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+CREATE OR REPLACE FUNCTION pg_buffercache_pages(boolean)
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages(false) AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_pages(true) AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, numa_zone_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pages(boolean) TO pg_monitor;
+GRANT SELECT ON pg_buffercache TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77d..b030ba3a6f 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3ae0a018e1..f32546fdee 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,12 +11,14 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -43,6 +45,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_zone_id;
} BufferCachePagesRec;
@@ -65,6 +68,52 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+static void
+pg_buffercache_mark_numa_invalid(BufferCachePagesContext *fctx, int n)
+{
+ int i;
+
+ for (i = 0; i < n; i++)
+ {
+ fctx->record[i].numa_zone_id = -1;
+ }
+}
+
+/*
+* Many buffers can point to the same page (in case of
+* BLCKSZ < 4kB), but we want to also query just first
+* address.
+*
+* In order to get reliable results we also need to touch
+* memory pages, so that inquiry about NUMA zone doesn't
+* return -2.
+*/
+static inline void
+pg_buffercache_numa_prepare_ptrs(int i, float pages_per_blk, Size os_page_size,
+ void **os_page_ptrs, bool firstUseInBackend)
+{
+ int j = 0,
+ blk2page = (int) i * pages_per_blk;
+
+ do
+ {
+ if (os_page_ptrs[blk2page + j] == 0)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ /* NBuffers count start really from 1 */
+ os_page_ptrs[blk2page + j] = (char *) BufferGetBlock(i + 1) + (os_page_size * j);
+
+ /* We just need to do it only once in backend */
+ if (firstUseInBackend == true)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2page + j]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ j++;
+ } while (j < (int) pages_per_blk);
+}
+
Datum
pg_buffercache_pages(PG_FUNCTION_ARGS)
{
@@ -75,14 +124,32 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
HeapTuple tuple;
+ Buffer query_numa = PG_GETARG_BOOL(0);
+ static bool firstUseInBackend = true;
if (SRF_IS_FIRSTCALL())
{
int i;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_pages_status = NULL;
+ int os_page_count = 0;
+ float pages_per_blk = 0;
funcctx = SRF_FIRSTCALL_INIT();
- /* Switch context when allocating stuff to be used in later calls */
+ if (query_numa)
+ {
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");
+ query_numa = false;
+ }
+ }
+
+ /*
+ * Switch context when allocating stuff to be used in later calls
+ */
oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
/* Create a user function context for cross-call persistence */
@@ -122,10 +189,14 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
INT2OID, -1, 0);
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "numa_zone_id",
+ INT4OID, -1, 0);
+
fctx->tupdesc = BlessTupleDesc(tupledesc);
/* Allocate NBuffers worth of BufferCachePagesRec records. */
@@ -137,9 +208,41 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
funcctx->max_calls = NBuffers;
funcctx->user_fctx = fctx;
- /* Return to original context when allocating transient memory */
+ /*
+ * Return to original context when allocating transient memory
+ */
MemoryContextSwitchTo(oldcontext);
+ if (query_numa)
+ {
+ /*
+ * This is for gathering some NUMA statistics. We might be using
+ * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
+ * allocated in various different OS memory pages sizes, so first
+ * we need to understand the OS memory page size before calling
+ * move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+ os_page_count = ((uint64) NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA: os_page_count=%d os_page_size=%zu pages_per_blk=%f",
+ os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(int) * os_page_count);
+ memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably
+ * have bug in our buffers to OS page mapping code here
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstUseInBackend == true)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+ }
+
/*
* Scan through all the buffers, saving the relevant fields in the
* fctx->record structure.
@@ -171,14 +274,41 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
else
fctx->record[i].isdirty = false;
- /* Note if the buffer is valid, and has storage created */
+ /*
+ * Note if the buffer is valid, and has storage created
+ */
if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
fctx->record[i].isvalid = true;
else
fctx->record[i].isvalid = false;
+ if (unlikely(query_numa))
+ pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
+
UnlockBufHdr(bufHdr, buf_state);
}
+
+
+ if (query_numa)
+ {
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ int blk2page = (int) i * pages_per_blk;
+
+ /*
+ * Technically we can get errors too here and pass that to
+ * user. Also we could somehow report single DB block spanning
+ * more than one NUMA zone, but it should be rare.
+ */
+ fctx->record[i].numa_zone_id = os_pages_status[blk2page];
+ }
+ }
+ else
+ pg_buffercache_mark_numa_invalid(fctx, NBuffers);
+
}
funcctx = SRF_PERCALL_SETUP();
@@ -209,8 +339,12 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
nulls[5] = true;
nulls[6] = true;
nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
+
+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */
nulls[8] = true;
+ nulls[9] = true;
}
else
{
@@ -228,9 +362,14 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
nulls[6] = false;
values[7] = Int16GetDatum(fctx->record[i].usagecount);
nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
+
+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */
values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
nulls[8] = false;
+ values[9] = Int32GetDatum(fctx->record[i].numa_zone_id);
+ nulls[9] = false;
}
/* Build and return the tuple. */
@@ -240,7 +379,10 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
SRF_RETURN_NEXT(funcctx, result);
}
else
+ {
+ firstUseInBackend = false;
SRF_RETURN_DONE(funcctx);
+ }
}
Datum
diff --git a/contrib/pg_buffercache/sql/pg_buffercache.sql b/contrib/pg_buffercache/sql/pg_buffercache.sql
index 944fbb1bea..7f2ce683e6 100644
--- a/contrib/pg_buffercache/sql/pg_buffercache.sql
+++ b/contrib/pg_buffercache/sql/pg_buffercache.sql
@@ -5,6 +5,14 @@ select count(*) = (select setting::bigint
where name = 'shared_buffers')
from pg_buffercache;
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+RESET client_min_messages;
+
select buffers_used + buffers_unused > 0,
buffers_dirty <= buffers_used,
buffers_pinned <= buffers_used
@@ -16,9 +24,14 @@ SELECT count(*) > 0 FROM pg_buffercache_usage_counts() WHERE buffers >= 0;
-- having to create a dedicated user, use the pg_database_owner pseudo-role.
SET ROLE pg_database_owner;
SELECT * FROM pg_buffercache;
-SELECT * FROM pg_buffercache_pages() AS p (wrong int);
+SELECT * FROM pg_buffercache_pages(false) AS p (wrong int);
+SELECT * FROM pg_buffercache_pages(true) AS p (wrong int);
SELECT * FROM pg_buffercache_summary();
SELECT * FROM pg_buffercache_usage_counts();
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT * FROM pg_buffercache_numa;
+RESET client_min_messages;
RESET role;
-- Check that pg_monitor is allowed to query view / function
@@ -26,3 +39,7 @@ SET ROLE pg_monitor;
SELECT count(*) > 0 FROM pg_buffercache;
SELECT buffers_used + buffers_unused > 0 FROM pg_buffercache_summary();
SELECT count(*) > 0 FROM pg_buffercache_usage_counts();
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET client_min_messages;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index ad25cbb39c..dd34c79f52 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -563,7 +563,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86..5f7d4b83a6 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
--
2.39.5
Hi,
On Tue, Mar 04, 2025 at 11:48:31AM +0100, Jakub Wartak wrote:
Hi!
On Thu, Feb 27, 2025 at 4:34 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:I did some tests and it looks like it's giving correct results. I don't see -2
anymore and every backend reports correct distribution (with or without hp,
with "small" or "large" shared buffer).Cool! Attached is v7
Thanks for the new version!
=== 2
+ if (query_numa) + { + blk2page = (int) i * pages_per_blk; + j = 0; + do + {This check is done for every page. I wonder if it would not make sense
to create a brand new function for pg_buffercache_numa and just let the
current pg_buffercache_pages() as it is. That said it would be great to avoid
code duplication as much a possible though, maybe using a shared
populate_buffercache_entry() or such helper function?Well, I've made query_numa a parameter there simply to avoid that code
duplication in the first place, look at those TupleDescInitEntry()...
Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
or such function: to put the "duplicated" code in it and then use this
shared function in pg_buffercache_pages() and in the new numa related one.
IMHO rarely anybody uses pg_buffercache, but we could add unlikely()
I think unlikely() should be used for optimization based on code path likelihood,
not based on how often users might use a feature.
=== 5
Could we also avoid some code duplication with pg_get_shmem_allocations()?
Not sure I understand do you want to avoid code duplication
pg_get_shmem_allocations() vs pg_get_shmem_numa_allocations() or
pg_get_shmem_numa_allocations() vs pg_buffercache_pages(query_numa =
true) ?
I meant to say avoid code duplication between pg_get_shmem_allocations() and
pg_get_shmem_numa_allocations(). It might be possible to create a shared
function for them too. That said, it looks like that the savings (if any), would
not be that much, so maybe just forget about it.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Tue, Mar 4, 2025 at 5:02 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Cool! Attached is v7
Thanks for the new version!
... and another one: 7b ;)
=== 2
[..]
Well, I've made query_numa a parameter there simply to avoid that code
duplication in the first place, look at those TupleDescInitEntry()...Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
or such function: to put the "duplicated" code in it and then use this
shared function in pg_buffercache_pages() and in the new numa related one.
OK, so hastily attempted that in 7b , I had to do a larger refactor
there to avoid code duplication between those two. I don't know which
attempt is better though (7 vs 7b)..
IMHO rarely anybody uses pg_buffercache, but we could add unlikely()
I think unlikely() should be used for optimization based on code path likelihood,
not based on how often users might use a feature.
In 7b I've removed the unlikely() - For a moment I was thinking that
you are concerned about this loop for NBuffers to be as much optimized
as it can and that's the reason for splitting the routines.
=== 5
[..]
I meant to say avoid code duplication between pg_get_shmem_allocations() and
pg_get_shmem_numa_allocations(). It might be possible to create a shared
function for them too. That said, it looks like that the savings (if any), would
not be that much, so maybe just forget about it.
Yeah, OK, so let's leave it at that.
-J.
Attachments:
v7b-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones.patchapplication/octet-stream; name=v7b-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones.patchDownload
From 65dbe35271a3800037858058dcec77b186e0df05 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v7b 3/3] Add pg_shmem_numa_allocations to show NUMA zones for
shared memory allocations
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 125 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/privileges.out | 25 ++++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/sql/privileges.sql | 10 +-
6 files changed, 176 insertions(+), 4 deletions(-)
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf..cc014a62dc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39..7d83a14390 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -568,3 +569,127 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA zones for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ int shm_total_page_count,
+ shm_ent_page_count;
+ static bool firstUseInBackend = true;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");;
+ return (Datum) 0;
+ }
+
+ /*
+ * This is for gathering some NUMA statistics. We might be using various
+ * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in
+ * various different OS memory pages sizes, so first we need to understand
+ * the OS memory page size before calling move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Preallocate memory all at once without going into details which shared
+ * memory segment is the biggest (technically min s_b can be as low as
+ * 16xBLCKSZ)
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+ memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count);
+
+ if (firstUseInBackend == true)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+#define MAX_NUMA_ZONES 32 /* FIXME? */
+ Size zones[MAX_NUMA_ZONES];
+
+ shm_ent_page_count = ent->allocated_size / os_page_size;
+ /* It is always at least 1 page */
+ shm_ent_page_count = shm_ent_page_count == 0 ? 1 : shm_ent_page_count;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages so that inquiry about NUMA zone doesn't return -2.
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstUseInBackend == true)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ {
+ /* FIXME: should we release LWlock here ? */
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+ }
+
+ memset(zones, 0, sizeof(zones));
+ /* Count number of NUMA zones used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ if (s >= 0)
+ zones[s]++;
+ }
+
+ for (i = 0; i <= pg_numa_get_max_node(); i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(zones[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+
+ }
+
+ /*
+ * XXX: We are ignoring in NUMA version reporting of the following regions
+ * (compare to pg_get_shmem_allocations() case): 1. output shared memory
+ * allocated but not counted via the shmem index 2. output as-of-yet
+ * unused shared memory
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstUseInBackend = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index cd9422d0ba..62f051c194 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8463,6 +8463,14 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+# shared memory usage with NUMA info
+{ oid => '5101', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int8,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,numa_zone_id,numa_size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index a76256405f..02997690e1 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3144,6 +3144,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
has_table_privilege
@@ -3157,6 +3163,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
@@ -3171,6 +3183,15 @@ SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations;
t
(1 row)
+-- to ignore potenital NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
+RESET client_min_messages;
RESET ROLE;
-- clean up
DROP ROLE regress_readallstats;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b..b63c6e0f74 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1740,6 +1740,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ numa_zone_id,
+ numa_size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, numa_zone_id, numa_size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index d195aaf137..e969cc3854 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1921,16 +1921,22 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations;
+-- to ignore potenital NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+RESET client_min_messages;
RESET ROLE;
-- clean up
--
2.39.5
v7b-0002-Extend-pg_buffercache-with-new-view-pg_buffercac.patchapplication/octet-stream; name=v7b-0002-Extend-pg_buffercache-with-new-view-pg_buffercac.patchDownload
From 84dcc240c51c568a192da8d11a49144b027a87de Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 11:17:28 +0100
Subject: [PATCH v7b 2/3] Extend pg_buffercache with new view
pg_buffercache_numa to show NUMA zone for indvidual buffer.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache.out | 26 +
contrib/pg_buffercache/meson.build | 1 +
.../pg_buffercache--1.5--1.6.sql | 42 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 470 +++++++++++++-----
contrib/pg_buffercache/sql/pg_buffercache.sql | 16 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/storage/pg_shmem.h | 1 +
9 files changed, 430 insertions(+), 133 deletions(-)
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e..2a33602537 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache.out b/contrib/pg_buffercache/expected/pg_buffercache.out
index b745dc69ea..4da569c20a 100644
--- a/contrib/pg_buffercache/expected/pg_buffercache.out
+++ b/contrib/pg_buffercache/expected/pg_buffercache.out
@@ -8,6 +8,18 @@ from pg_buffercache;
t
(1 row)
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET client_min_messages;
select buffers_used + buffers_unused > 0,
buffers_dirty <= buffers_used,
buffers_pinned <= buffers_used
@@ -34,6 +46,11 @@ SELECT * FROM pg_buffercache_summary();
ERROR: permission denied for function pg_buffercache_summary
SELECT * FROM pg_buffercache_usage_counts();
ERROR: permission denied for function pg_buffercache_usage_counts
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT * FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET client_min_messages;
RESET role;
-- Check that pg_monitor is allowed to query view / function
SET ROLE pg_monitor;
@@ -55,3 +72,12 @@ SELECT count(*) > 0 FROM pg_buffercache_usage_counts();
t
(1 row)
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET client_min_messages;
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe4871..9b2e939341 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 0000000000..52f63aa258
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,42 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the new function.
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+CREATE OR REPLACE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, numa_zone_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pages() TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77d..b030ba3a6f 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3ae0a018e1..96b14f9ed4 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,12 +11,14 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -43,6 +45,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_zone_id;
} BufferCachePagesRec;
@@ -61,84 +64,250 @@ typedef struct
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-Datum
-pg_buffercache_pages(PG_FUNCTION_ARGS)
+/*
+ * Helper routine to map Buffers into addresses that can be
+ * later consumed by pg_numa_query_pages()
+ *
+ * Many buffers can point to the same page (in case of
+ * BLCKSZ < 4kB), but we want to also query just first
+ * address.
+ *
+ * In order to get reliable results we also need to touch
+ * memory pages, so that inquiry about NUMA zone doesn't
+ * return -2.
+ */
+static inline void
+pg_buffercache_numa_prepare_ptrs(int i, float pages_per_blk, Size os_page_size,
+ void **os_page_ptrs, bool firstUseInBackend)
+{
+ int j = 0,
+ blk2page = (int) i * pages_per_blk;
+
+ do
+ {
+ if (os_page_ptrs[blk2page + j] == 0)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ /* NBuffers count start really from 1 */
+ os_page_ptrs[blk2page + j] = (char *) BufferGetBlock(i + 1) + (os_page_size * j);
+
+ /* We just need to do it only once in backend */
+ if (firstUseInBackend == true)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2page + j]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ j++;
+ } while (j < (int) pages_per_blk);
+}
+
+/*
+ * Helper routine for pg_buffercache_(numa_)pages.
+ *
+ * We need fcinfo here and we pass it here with PG_FUNCTION_ARGS
+ */
+static BufferCachePagesContext *
+init_buffercache_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS)
{
- FuncCallContext *funcctx;
- Datum result;
- MemoryContext oldcontext;
BufferCachePagesContext *fctx; /* User function context. */
+ MemoryContext oldcontext;
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
- HeapTuple tuple;
- if (SRF_IS_FIRSTCALL())
- {
- int i;
+ /*
+ * Switch context when allocating stuff to be used in later calls
+ */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
- funcctx = SRF_FIRSTCALL_INIT();
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. We unfortunately have to get the result type for that... - we
+ * can't use the result type determined by the function definition without
+ * potentially crashing when somebody uses the old (or even wrong)
+ * function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+ expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
+ INT2OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
+ BOOLOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
+ INT2OID, -1, 0);
+
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+ INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "numa_zone_id",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCachePagesRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCachePagesRec) * NBuffers);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /*
+ * Return to original context when allocating transient memory
+ */
+ MemoryContextSwitchTo(oldcontext);
+ return fctx;
+}
- /* Switch context when allocating stuff to be used in later calls */
- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+/*
+ * Helper routine for pg_buffercache_(numa_)pages
+ */
+static void
+populate_buffercache_entry(int i, BufferCachePagesContext *fctx)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ bufHdr = GetBufferDescriptor(i);
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+
+ fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
+ fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ fctx->record[i].reltablespace = bufHdr->tag.spcOid;
+ fctx->record[i].reldatabase = bufHdr->tag.dbOid;
+ fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
+ fctx->record[i].blocknum = bufHdr->tag.blockNum;
+ fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+ if (buf_state & BM_DIRTY)
+ fctx->record[i].isdirty = true;
+ else
+ fctx->record[i].isdirty = false;
- /* Create a user function context for cross-call persistence */
- fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+ /*
+ * Note if the buffer is valid, and has storage created
+ */
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ fctx->record[i].isvalid = true;
+ else
+ fctx->record[i].isvalid = false;
+
+ fctx->record[i].numa_zone_id = -1;
+
+ UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
+ * Helper routine for pg_buffercache_(numa_)pages
+ */
+static Datum
+get_buffercache_tuple(int i, BufferCachePagesContext *fctx)
+{
+ Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+ HeapTuple tuple;
+
+ values[0] = Int32GetDatum(fctx->record[i].bufferid);
+ nulls[0] = false;
+
+ /*
+ * Set all fields except the bufferid to null if the buffer is unused or
+ * not valid.
+ */
+ if (fctx->record[i].blocknum == InvalidBlockNumber ||
+ fctx->record[i].isvalid == false)
+ {
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
+ nulls[6] = true;
+ nulls[7] = true;
/*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
+ * unused for v1.0 callers, but the array is always long enough
*/
- if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
- elog(ERROR, "return type must be a row type");
+ nulls[8] = true;
+ nulls[9] = true;
+ }
+ else
+ {
+ values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
+ nulls[1] = false;
+ values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
+ nulls[2] = false;
+ values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
+ nulls[3] = false;
+ values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
+ nulls[4] = false;
+ values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
+ nulls[5] = false;
+ values[6] = BoolGetDatum(fctx->record[i].isdirty);
+ nulls[6] = false;
+ values[7] = Int16GetDatum(fctx->record[i].usagecount);
+ nulls[7] = false;
- if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
- expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
- elog(ERROR, "incorrect number of output arguments");
+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */
+ values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
+ nulls[8] = false;
+ values[9] = Int32GetDatum(fctx->record[i].numa_zone_id);
+ nulls[9] = false;
+ }
- /* Construct a tuple descriptor for the result rows. */
- tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
- TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
- INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
- INT2OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
- INT8OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
- BOOLOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
- INT2OID, -1, 0);
-
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);
-
- fctx->tupdesc = BlessTupleDesc(tupledesc);
-
- /* Allocate NBuffers worth of BufferCachePagesRec records. */
- fctx->record = (BufferCachePagesRec *)
- MemoryContextAllocHuge(CurrentMemoryContext,
- sizeof(BufferCachePagesRec) * NBuffers);
-
- /* Set max calls and remember the user function context. */
- funcctx->max_calls = NBuffers;
- funcctx->user_fctx = fctx;
-
- /* Return to original context when allocating transient memory */
- MemoryContextSwitchTo(oldcontext);
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
+
+/*
+ * When updating this routine please sync it with below one:
+ * pg_buffercache_numa_pages()
+ */
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ fctx = init_buffercache_entries(funcctx, fcinfo);
/*
* Scan through all the buffers, saving the relevant fields in the
@@ -149,36 +318,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
* locks, so the information of each buffer is self-consistent.
*/
for (i = 0; i < NBuffers; i++)
- {
- BufferDesc *bufHdr;
- uint32 buf_state;
-
- bufHdr = GetBufferDescriptor(i);
- /* Lock each buffer header before inspecting. */
- buf_state = LockBufHdr(bufHdr);
-
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
- if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
- else
- fctx->record[i].isdirty = false;
-
- /* Note if the buffer is valid, and has storage created */
- if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
- else
- fctx->record[i].isvalid = false;
-
- UnlockBufHdr(bufHdr, buf_state);
- }
+ populate_buffercache_entry(i, fctx);
}
funcctx = SRF_PERCALL_SETUP();
@@ -188,59 +328,129 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
if (funcctx->call_cntr < funcctx->max_calls)
{
+ Datum result;
uint32 i = funcctx->call_cntr;
- Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
- bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
- nulls[0] = false;
+ result = get_buffercache_tuple(i, fctx);
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ {
+ SRF_RETURN_DONE(funcctx);
+ }
+}
+
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+ bool query_numa = true;
+ static bool firstUseInBackend = true;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_pages_status = NULL;
+ int os_page_count = 0;
+ float pages_per_blk = 0;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");
+ query_numa = false;
+ }
+ fctx = init_buffercache_entries(funcctx, fcinfo);
+
+ if (query_numa)
+ {
+ /*
+ * This is for gathering some NUMA statistics. We might be using
+ * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
+ * allocated in various different OS memory pages sizes, so first
+ * we need to understand the OS memory page size before calling
+ * move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+ os_page_count = ((uint64) NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA: os_page_count=%d os_page_size=%zu pages_per_blk=%f",
+ os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(int) * os_page_count);
+ memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably
+ * have bug in our buffers to OS page mapping code here
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstUseInBackend == true)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+ }
/*
- * Set all fields except the bufferid to null if the buffer is unused
- * or not valid.
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
*/
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
+ for (i = 0; i < NBuffers; i++)
{
- nulls[1] = true;
- nulls[2] = true;
- nulls[3] = true;
- nulls[4] = true;
- nulls[5] = true;
- nulls[6] = true;
- nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
- nulls[8] = true;
+ populate_buffercache_entry(i, fctx);
+ if (query_numa)
+ pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
}
- else
+
+ if (query_numa)
{
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
- nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
- nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
- nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
- nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
- nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
- nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
- nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
- nulls[8] = false;
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ int blk2page = (int) i * pages_per_blk;
+
+ /*
+ * Technically we can get errors too here and pass that to
+ * user. Also we could somehow report single DB block spanning
+ * more than one NUMA zone, but it should be rare.
+ */
+ fctx->record[i].numa_zone_id = os_pages_status[blk2page];
+ }
}
+ }
- /* Build and return the tuple. */
- tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
- result = HeapTupleGetDatum(tuple);
+ funcctx = SRF_PERCALL_SETUP();
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ Datum result;
+ uint32 i = funcctx->call_cntr;
+
+ result = get_buffercache_tuple(i, fctx);
SRF_RETURN_NEXT(funcctx, result);
}
else
+ {
+ firstUseInBackend = false;
SRF_RETURN_DONE(funcctx);
+ }
}
Datum
diff --git a/contrib/pg_buffercache/sql/pg_buffercache.sql b/contrib/pg_buffercache/sql/pg_buffercache.sql
index 944fbb1bea..d982048425 100644
--- a/contrib/pg_buffercache/sql/pg_buffercache.sql
+++ b/contrib/pg_buffercache/sql/pg_buffercache.sql
@@ -5,6 +5,14 @@ select count(*) = (select setting::bigint
where name = 'shared_buffers')
from pg_buffercache;
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+RESET client_min_messages;
+
select buffers_used + buffers_unused > 0,
buffers_dirty <= buffers_used,
buffers_pinned <= buffers_used
@@ -19,6 +27,10 @@ SELECT * FROM pg_buffercache;
SELECT * FROM pg_buffercache_pages() AS p (wrong int);
SELECT * FROM pg_buffercache_summary();
SELECT * FROM pg_buffercache_usage_counts();
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT * FROM pg_buffercache_numa;
+RESET client_min_messages;
RESET role;
-- Check that pg_monitor is allowed to query view / function
@@ -26,3 +38,7 @@ SET ROLE pg_monitor;
SELECT count(*) > 0 FROM pg_buffercache;
SELECT buffers_used + buffers_unused > 0 FROM pg_buffercache_summary();
SELECT count(*) > 0 FROM pg_buffercache_usage_counts();
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET client_min_messages;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index ad25cbb39c..dd34c79f52 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -563,7 +563,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86..5f7d4b83a6 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
--
2.39.5
v7b-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchapplication/octet-stream; name=v7b-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchDownload
From 9795cf53205c6fb1be9ae5364664d6ba7b2baa65 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 10:19:35 +0100
Subject: [PATCH v7b 1/3] Add optional dependency to libnuma (Linux-only) for
basic NUMA awareness routines and add minimal src/port/pg_numa.c portability
wrapper.
Other platforms can be added later.
libnuma is unavailable on 32-bit builds, so due to lack of i386 shared object,
we disable it there (it does not make sense anyway as i386 is is very memory-only
limited even with PAE)
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 7 +-
configure | 87 +++++++++++++++++++++
configure.ac | 13 ++++
meson.build | 17 +++++
meson_options.txt | 3 +
src/Makefile.global.in | 1 +
src/backend/Makefile | 3 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 43 +++++++++++
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 150 +++++++++++++++++++++++++++++++++++++
13 files changed, 331 insertions(+), 1 deletion(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 91b51142d2..584e3e5a44 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -428,6 +428,8 @@ task:
DEBIAN_FRONTEND=noninteractive apt-get -y install \
libcurl4-openssl-dev \
libcurl4-openssl-dev:i386 \
+ libnuma1 \
+ libnuma-dev
matrix:
- name: Linux - Debian Bookworm - Autoconf
@@ -448,6 +450,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
\
${LINUX_CONFIGURE_FEATURES} \
\
@@ -492,6 +495,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
@@ -804,7 +808,8 @@ task:
setup_additional_packages_script: |
apt-get update
- DEBIAN_FRONTEND=noninteractive apt-get -y install libcurl4-openssl-dev
+ DEBIAN_FRONTEND=noninteractive apt-get -y install libcurl4-openssl-dev \
+ libnuma1 libnuma-dev
###
# Test that code can be built with gcc/clang without warnings
diff --git a/configure b/configure
index 93fddd6998..23c33dd997 100755
--- a/configure
+++ b/configure
@@ -711,6 +711,7 @@ with_libxml
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
+with_libnuma
with_uuid
with_readline
with_systemd
@@ -868,6 +869,7 @@ with_libedit_preferred
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -1581,6 +1583,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -9140,6 +9143,33 @@ fi
+#
+# NUMA
+#
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
@@ -12378,6 +12408,63 @@ fi
fi
+if test "$with_libnuma" = yes ; then
+
+ ac_fn_c_check_header_mongrel "$LINENO" "numa.h" "ac_cv_header_numa_h" "$ac_includes_default"
+if test "x$ac_cv_header_numa_h" = xyes; then :
+
+else
+ as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
+fi
+
+
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'numa' does not provide numa_available" "$LINENO" 5
+fi
+
+fi
+
# XXX libcurl must link after libgssapi_krb5 on FreeBSD to avoid segfaults
# during gss_acquire_cred(). This is possibly related to Curl's Heimdal
# dependency on that platform?
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc..1a394dfc07 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1041,6 +1041,19 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+fi
+
#
# XML
#
diff --git a/meson.build b/meson.build
index 13c13748e5..f81092eb66 100644
--- a/meson.build
+++ b/meson.build
@@ -949,6 +949,21 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+libnuma = dependency('numa', required: libnumaopt)
+if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+endif
+if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: libxml
@@ -3168,6 +3183,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
libxml,
lz4,
pam,
@@ -3823,6 +3839,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c451714..adaadb5faf 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5a..0bd4b2d7d3 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5a..bff9f077a8 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -54,6 +54,9 @@ ifeq ($(with_systemd),yes)
LIBS += -lsystemd
endif
+# FIXME: filter-out / with/without with_libnuma?
+LIBS += $(LIBNUMA_LIBS)
+
override LDFLAGS := $(LDFLAGS) $(LDFLAGS_EX) $(LDFLAGS_EX_BE)
##########################################################################
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d..8894f80060 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 0000000000..d3ebe8b5bd
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "c.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d5023..f786c19160 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS'
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c22431951..a68a29d541 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_numa.o \
pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d4..7ffbd4d88d 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_numa.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 0000000000..db28578bca
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,150 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+#include "postgres.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+#include <unistd.h>
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#ifndef FRONTEND
+/* FIXME not tested, might crash */
+void
+numa_warn(int num, char *fmt,...)
+{
+ va_list ap;
+ int olde = errno;
+ int needed;
+ StringInfoData msg;
+
+ initStringInfo(&msg);
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ if (needed > 0)
+ {
+ enlargeStringInfo(&msg, needed);
+ va_start(ap, fmt);
+ appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ }
+
+ ereport(WARNING,
+ (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
+ errmsg_internal("libnuma: WARNING: %s", msg.data)));
+
+ pfree(msg.data);
+
+ errno = olde;
+}
+
+void
+numa_error(char *where)
+{
+ int olde = errno;
+
+ /*
+ * XXX: for now we issue just WARNING, but long-term that might depend on
+ * numa_set_strict() here
+ */
+ elog(WARNING, "libnuma: ERROR: %s", where);
+ errno = olde;
+}
+#endif /* FRONTEND */
+
+#else
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+#ifndef WIN32
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+#else
+ Size os_page_size;
+ SYSTEM_INFO sysinfo;
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#endif
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#endif
--
2.39.5
Hi,
On Wed, Mar 5, 2025 at 10:30 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
Hi,
Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
or such function: to put the "duplicated" code in it and then use this
shared function in pg_buffercache_pages() and in the new numa related one.OK, so hastily attempted that in 7b , I had to do a larger refactor
there to avoid code duplication between those two. I don't know which
attempt is better though (7 vs 7b)..
I'm attaching basically the earlier stuff (v7b) as v8 with the
following minor changes:
- docs are included
- changed int8 to int4 in one function definition for numa_zone_id
-J.
Attachments:
v8-0002-Extend-pg_buffercache-with-new-view-pg_buffercach.patchapplication/octet-stream; name=v8-0002-Extend-pg_buffercache-with-new-view-pg_buffercach.patchDownload
From 53b6f8925c7367ab668b39338507942388dbbeca Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 11:17:28 +0100
Subject: [PATCH v8 2/3] Extend pg_buffercache with new view
pg_buffercache_numa to show NUMA zone for indvidual buffer.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache.out | 26 +
contrib/pg_buffercache/meson.build | 1 +
.../pg_buffercache--1.5--1.6.sql | 42 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 470 +++++++++++++-----
contrib/pg_buffercache/sql/pg_buffercache.sql | 16 +
doc/src/sgml/pgbuffercache.sgml | 64 ++-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/storage/pg_shmem.h | 1 +
10 files changed, 493 insertions(+), 134 deletions(-)
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e..2a33602537 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache.out b/contrib/pg_buffercache/expected/pg_buffercache.out
index b745dc69ea..4da569c20a 100644
--- a/contrib/pg_buffercache/expected/pg_buffercache.out
+++ b/contrib/pg_buffercache/expected/pg_buffercache.out
@@ -8,6 +8,18 @@ from pg_buffercache;
t
(1 row)
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET client_min_messages;
select buffers_used + buffers_unused > 0,
buffers_dirty <= buffers_used,
buffers_pinned <= buffers_used
@@ -34,6 +46,11 @@ SELECT * FROM pg_buffercache_summary();
ERROR: permission denied for function pg_buffercache_summary
SELECT * FROM pg_buffercache_usage_counts();
ERROR: permission denied for function pg_buffercache_usage_counts
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT * FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET client_min_messages;
RESET role;
-- Check that pg_monitor is allowed to query view / function
SET ROLE pg_monitor;
@@ -55,3 +72,12 @@ SELECT count(*) > 0 FROM pg_buffercache_usage_counts();
t
(1 row)
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET client_min_messages;
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe4871..9b2e939341 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 0000000000..52f63aa258
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,42 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the new function.
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+CREATE OR REPLACE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, numa_zone_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pages() TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77d..b030ba3a6f 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3ae0a018e1..96b14f9ed4 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,12 +11,14 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -43,6 +45,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_zone_id;
} BufferCachePagesRec;
@@ -61,84 +64,250 @@ typedef struct
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-Datum
-pg_buffercache_pages(PG_FUNCTION_ARGS)
+/*
+ * Helper routine to map Buffers into addresses that can be
+ * later consumed by pg_numa_query_pages()
+ *
+ * Many buffers can point to the same page (in case of
+ * BLCKSZ < 4kB), but we want to also query just first
+ * address.
+ *
+ * In order to get reliable results we also need to touch
+ * memory pages, so that inquiry about NUMA zone doesn't
+ * return -2.
+ */
+static inline void
+pg_buffercache_numa_prepare_ptrs(int i, float pages_per_blk, Size os_page_size,
+ void **os_page_ptrs, bool firstUseInBackend)
+{
+ int j = 0,
+ blk2page = (int) i * pages_per_blk;
+
+ do
+ {
+ if (os_page_ptrs[blk2page + j] == 0)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ /* NBuffers count start really from 1 */
+ os_page_ptrs[blk2page + j] = (char *) BufferGetBlock(i + 1) + (os_page_size * j);
+
+ /* We just need to do it only once in backend */
+ if (firstUseInBackend == true)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2page + j]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ j++;
+ } while (j < (int) pages_per_blk);
+}
+
+/*
+ * Helper routine for pg_buffercache_(numa_)pages.
+ *
+ * We need fcinfo here and we pass it here with PG_FUNCTION_ARGS
+ */
+static BufferCachePagesContext *
+init_buffercache_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS)
{
- FuncCallContext *funcctx;
- Datum result;
- MemoryContext oldcontext;
BufferCachePagesContext *fctx; /* User function context. */
+ MemoryContext oldcontext;
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
- HeapTuple tuple;
- if (SRF_IS_FIRSTCALL())
- {
- int i;
+ /*
+ * Switch context when allocating stuff to be used in later calls
+ */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
- funcctx = SRF_FIRSTCALL_INIT();
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. We unfortunately have to get the result type for that... - we
+ * can't use the result type determined by the function definition without
+ * potentially crashing when somebody uses the old (or even wrong)
+ * function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+ expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
+ INT2OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
+ BOOLOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
+ INT2OID, -1, 0);
+
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+ INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "numa_zone_id",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCachePagesRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCachePagesRec) * NBuffers);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /*
+ * Return to original context when allocating transient memory
+ */
+ MemoryContextSwitchTo(oldcontext);
+ return fctx;
+}
- /* Switch context when allocating stuff to be used in later calls */
- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+/*
+ * Helper routine for pg_buffercache_(numa_)pages
+ */
+static void
+populate_buffercache_entry(int i, BufferCachePagesContext *fctx)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ bufHdr = GetBufferDescriptor(i);
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+
+ fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
+ fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ fctx->record[i].reltablespace = bufHdr->tag.spcOid;
+ fctx->record[i].reldatabase = bufHdr->tag.dbOid;
+ fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
+ fctx->record[i].blocknum = bufHdr->tag.blockNum;
+ fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+ if (buf_state & BM_DIRTY)
+ fctx->record[i].isdirty = true;
+ else
+ fctx->record[i].isdirty = false;
- /* Create a user function context for cross-call persistence */
- fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+ /*
+ * Note if the buffer is valid, and has storage created
+ */
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ fctx->record[i].isvalid = true;
+ else
+ fctx->record[i].isvalid = false;
+
+ fctx->record[i].numa_zone_id = -1;
+
+ UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
+ * Helper routine for pg_buffercache_(numa_)pages
+ */
+static Datum
+get_buffercache_tuple(int i, BufferCachePagesContext *fctx)
+{
+ Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+ HeapTuple tuple;
+
+ values[0] = Int32GetDatum(fctx->record[i].bufferid);
+ nulls[0] = false;
+
+ /*
+ * Set all fields except the bufferid to null if the buffer is unused or
+ * not valid.
+ */
+ if (fctx->record[i].blocknum == InvalidBlockNumber ||
+ fctx->record[i].isvalid == false)
+ {
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
+ nulls[6] = true;
+ nulls[7] = true;
/*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
+ * unused for v1.0 callers, but the array is always long enough
*/
- if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
- elog(ERROR, "return type must be a row type");
+ nulls[8] = true;
+ nulls[9] = true;
+ }
+ else
+ {
+ values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
+ nulls[1] = false;
+ values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
+ nulls[2] = false;
+ values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
+ nulls[3] = false;
+ values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
+ nulls[4] = false;
+ values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
+ nulls[5] = false;
+ values[6] = BoolGetDatum(fctx->record[i].isdirty);
+ nulls[6] = false;
+ values[7] = Int16GetDatum(fctx->record[i].usagecount);
+ nulls[7] = false;
- if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
- expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
- elog(ERROR, "incorrect number of output arguments");
+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */
+ values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
+ nulls[8] = false;
+ values[9] = Int32GetDatum(fctx->record[i].numa_zone_id);
+ nulls[9] = false;
+ }
- /* Construct a tuple descriptor for the result rows. */
- tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
- TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
- INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
- INT2OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
- INT8OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
- BOOLOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
- INT2OID, -1, 0);
-
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);
-
- fctx->tupdesc = BlessTupleDesc(tupledesc);
-
- /* Allocate NBuffers worth of BufferCachePagesRec records. */
- fctx->record = (BufferCachePagesRec *)
- MemoryContextAllocHuge(CurrentMemoryContext,
- sizeof(BufferCachePagesRec) * NBuffers);
-
- /* Set max calls and remember the user function context. */
- funcctx->max_calls = NBuffers;
- funcctx->user_fctx = fctx;
-
- /* Return to original context when allocating transient memory */
- MemoryContextSwitchTo(oldcontext);
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
+
+/*
+ * When updating this routine please sync it with below one:
+ * pg_buffercache_numa_pages()
+ */
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ fctx = init_buffercache_entries(funcctx, fcinfo);
/*
* Scan through all the buffers, saving the relevant fields in the
@@ -149,36 +318,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
* locks, so the information of each buffer is self-consistent.
*/
for (i = 0; i < NBuffers; i++)
- {
- BufferDesc *bufHdr;
- uint32 buf_state;
-
- bufHdr = GetBufferDescriptor(i);
- /* Lock each buffer header before inspecting. */
- buf_state = LockBufHdr(bufHdr);
-
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
- if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
- else
- fctx->record[i].isdirty = false;
-
- /* Note if the buffer is valid, and has storage created */
- if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
- else
- fctx->record[i].isvalid = false;
-
- UnlockBufHdr(bufHdr, buf_state);
- }
+ populate_buffercache_entry(i, fctx);
}
funcctx = SRF_PERCALL_SETUP();
@@ -188,59 +328,129 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
if (funcctx->call_cntr < funcctx->max_calls)
{
+ Datum result;
uint32 i = funcctx->call_cntr;
- Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
- bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
- nulls[0] = false;
+ result = get_buffercache_tuple(i, fctx);
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ {
+ SRF_RETURN_DONE(funcctx);
+ }
+}
+
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+ bool query_numa = true;
+ static bool firstUseInBackend = true;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_pages_status = NULL;
+ int os_page_count = 0;
+ float pages_per_blk = 0;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");
+ query_numa = false;
+ }
+ fctx = init_buffercache_entries(funcctx, fcinfo);
+
+ if (query_numa)
+ {
+ /*
+ * This is for gathering some NUMA statistics. We might be using
+ * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
+ * allocated in various different OS memory pages sizes, so first
+ * we need to understand the OS memory page size before calling
+ * move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+ os_page_count = ((uint64) NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA: os_page_count=%d os_page_size=%zu pages_per_blk=%f",
+ os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(int) * os_page_count);
+ memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably
+ * have bug in our buffers to OS page mapping code here
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstUseInBackend == true)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+ }
/*
- * Set all fields except the bufferid to null if the buffer is unused
- * or not valid.
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
*/
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
+ for (i = 0; i < NBuffers; i++)
{
- nulls[1] = true;
- nulls[2] = true;
- nulls[3] = true;
- nulls[4] = true;
- nulls[5] = true;
- nulls[6] = true;
- nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
- nulls[8] = true;
+ populate_buffercache_entry(i, fctx);
+ if (query_numa)
+ pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
}
- else
+
+ if (query_numa)
{
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
- nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
- nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
- nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
- nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
- nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
- nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
- nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
- nulls[8] = false;
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ int blk2page = (int) i * pages_per_blk;
+
+ /*
+ * Technically we can get errors too here and pass that to
+ * user. Also we could somehow report single DB block spanning
+ * more than one NUMA zone, but it should be rare.
+ */
+ fctx->record[i].numa_zone_id = os_pages_status[blk2page];
+ }
}
+ }
- /* Build and return the tuple. */
- tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
- result = HeapTupleGetDatum(tuple);
+ funcctx = SRF_PERCALL_SETUP();
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ Datum result;
+ uint32 i = funcctx->call_cntr;
+
+ result = get_buffercache_tuple(i, fctx);
SRF_RETURN_NEXT(funcctx, result);
}
else
+ {
+ firstUseInBackend = false;
SRF_RETURN_DONE(funcctx);
+ }
}
Datum
diff --git a/contrib/pg_buffercache/sql/pg_buffercache.sql b/contrib/pg_buffercache/sql/pg_buffercache.sql
index 944fbb1bea..d982048425 100644
--- a/contrib/pg_buffercache/sql/pg_buffercache.sql
+++ b/contrib/pg_buffercache/sql/pg_buffercache.sql
@@ -5,6 +5,14 @@ select count(*) = (select setting::bigint
where name = 'shared_buffers')
from pg_buffercache;
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+RESET client_min_messages;
+
select buffers_used + buffers_unused > 0,
buffers_dirty <= buffers_used,
buffers_pinned <= buffers_used
@@ -19,6 +27,10 @@ SELECT * FROM pg_buffercache;
SELECT * FROM pg_buffercache_pages() AS p (wrong int);
SELECT * FROM pg_buffercache_summary();
SELECT * FROM pg_buffercache_usage_counts();
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT * FROM pg_buffercache_numa;
+RESET client_min_messages;
RESET role;
-- Check that pg_monitor is allowed to query view / function
@@ -26,3 +38,7 @@ SET ROLE pg_monitor;
SELECT count(*) > 0 FROM pg_buffercache;
SELECT buffers_used + buffers_unused > 0 FROM pg_buffercache_summary();
SELECT count(*) > 0 FROM pg_buffercache_usage_counts();
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET client_min_messages;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d7..75978a6eae 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,13 @@
convenient use.
</para>
+ <para>
+ The similiar <function>pg_buffercache_numa_pages()</function> is a slower
+ variant of the above, but also can provide NUMA node ID for shared buffer entry.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +209,59 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache_numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed are almost identical to the previous
+ <structname>pg_buffercache</structname> view, but this one includes one additional
+ column numa_zone_id as defined in <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_zone_id</structfield> <type>integer</type>
+ </para>
+ <para>
+ ID (number) of the NUMA node for this particular buffer. NULL if the shared buffer
+ has not been used yet.On systems without NUMA this usually returns 0.
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ This is clone version of the original pg_buffercache view, however it provides
+ additional <structfield>numa_zone_id</structfield> column. Fetching this
+ information from OS is costly and might take much longer and querying it is not
+ recommended by automated or monitoring systems.
+ </para>
+
+ <para>
+ As NUMA node ID inquiry for each page requires memory pages to be paged-in, first
+ execution of this function can take long time especially on systems with bigint
+ shared_buffers and without huge_pages enabled.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index ad25cbb39c..dd34c79f52 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -563,7 +563,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86..5f7d4b83a6 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
--
2.39.5
v8-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones-.patchapplication/octet-stream; name=v8-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones-.patchDownload
From 0d0319a32a7ea5689d10fe8de7cf5848d547cadf Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v8 3/3] Add pg_shmem_numa_allocations to show NUMA zones for
shared memory allocations
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 78 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 125 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/privileges.out | 25 ++++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/sql/privileges.sql | 10 +-
7 files changed, 254 insertions(+), 4 deletions(-)
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f5a306247..6f8fea37de 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -176,6 +176,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-numa-allocations"><structname>pg_shmem_numa_allocations</structname></link></entry>
+ <entry>NUMA mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -3746,6 +3751,79 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-numa-allocations">
+ <title><structname>pg_shmem_numa_allocations</structname></title>
+
+ <indexterm zone="view-pg-shmem-numa-allocations">
+ <primary>pg_shmem_numa_allocations</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_allocations</structname> view shows NUMA nodes
+ assigned allocations made from the server's main shared memory segment.
+ This includes both memory allocated by <productname>PostgreSQL</productname>
+ itself and memory allocated by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_numa_allocations</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_zone_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of NUMA node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_numa_allocations</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf..cc014a62dc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39..7d83a14390 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -568,3 +569,127 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA zones for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ int shm_total_page_count,
+ shm_ent_page_count;
+ static bool firstUseInBackend = true;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");;
+ return (Datum) 0;
+ }
+
+ /*
+ * This is for gathering some NUMA statistics. We might be using various
+ * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in
+ * various different OS memory pages sizes, so first we need to understand
+ * the OS memory page size before calling move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Preallocate memory all at once without going into details which shared
+ * memory segment is the biggest (technically min s_b can be as low as
+ * 16xBLCKSZ)
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+ memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count);
+
+ if (firstUseInBackend == true)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+#define MAX_NUMA_ZONES 32 /* FIXME? */
+ Size zones[MAX_NUMA_ZONES];
+
+ shm_ent_page_count = ent->allocated_size / os_page_size;
+ /* It is always at least 1 page */
+ shm_ent_page_count = shm_ent_page_count == 0 ? 1 : shm_ent_page_count;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages so that inquiry about NUMA zone doesn't return -2.
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstUseInBackend == true)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ {
+ /* FIXME: should we release LWlock here ? */
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+ }
+
+ memset(zones, 0, sizeof(zones));
+ /* Count number of NUMA zones used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ if (s >= 0)
+ zones[s]++;
+ }
+
+ for (i = 0; i <= pg_numa_get_max_node(); i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(zones[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+
+ }
+
+ /*
+ * XXX: We are ignoring in NUMA version reporting of the following regions
+ * (compare to pg_get_shmem_allocations() case): 1. output shared memory
+ * allocated but not counted via the shmem index 2. output as-of-yet
+ * unused shared memory
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstUseInBackend = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index cd9422d0ba..59b40dc8f4 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8463,6 +8463,14 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+# shared memory usage with NUMA info
+{ oid => '5101', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,numa_zone_id,numa_size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index a76256405f..02997690e1 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3144,6 +3144,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
has_table_privilege
@@ -3157,6 +3163,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
@@ -3171,6 +3183,15 @@ SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations;
t
(1 row)
+-- to ignore potenital NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
+RESET client_min_messages;
RESET ROLE;
-- clean up
DROP ROLE regress_readallstats;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b..b63c6e0f74 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1740,6 +1740,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ numa_zone_id,
+ numa_size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, numa_zone_id, numa_size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index d195aaf137..e969cc3854 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1921,16 +1921,22 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations;
+-- to ignore potenital NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+RESET client_min_messages;
RESET ROLE;
-- clean up
--
2.39.5
v8-0001-Add-optional-dependency-to-libnuma-Linux-only-for.patchapplication/octet-stream; name=v8-0001-Add-optional-dependency-to-libnuma-Linux-only-for.patchDownload
From cc5272cb4acd864482fa0f5ca051e4e1d659c32e Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 10:19:35 +0100
Subject: [PATCH v8 1/3] Add optional dependency to libnuma (Linux-only) for
basic NUMA awareness routines and add minimal src/port/pg_numa.c portability
wrapper.
Other platforms can be added later.
libnuma is unavailable on 32-bit builds, so due to lack of i386 shared object,
we disable it there (it does not make sense anyway as i386 is is very memory-only
limited even with PAE)
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 7 +-
configure | 87 +++++++++++++++++++
configure.ac | 13 +++
doc/src/sgml/installation.sgml | 20 +++++
meson.build | 17 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 1 +
src/backend/Makefile | 3 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 43 ++++++++++
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 150 +++++++++++++++++++++++++++++++++
14 files changed, 351 insertions(+), 1 deletion(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 91b51142d2..584e3e5a44 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -428,6 +428,8 @@ task:
DEBIAN_FRONTEND=noninteractive apt-get -y install \
libcurl4-openssl-dev \
libcurl4-openssl-dev:i386 \
+ libnuma1 \
+ libnuma-dev
matrix:
- name: Linux - Debian Bookworm - Autoconf
@@ -448,6 +450,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
\
${LINUX_CONFIGURE_FEATURES} \
\
@@ -492,6 +495,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
@@ -804,7 +808,8 @@ task:
setup_additional_packages_script: |
apt-get update
- DEBIAN_FRONTEND=noninteractive apt-get -y install libcurl4-openssl-dev
+ DEBIAN_FRONTEND=noninteractive apt-get -y install libcurl4-openssl-dev \
+ libnuma1 libnuma-dev
###
# Test that code can be built with gcc/clang without warnings
diff --git a/configure b/configure
index 93fddd6998..23c33dd997 100755
--- a/configure
+++ b/configure
@@ -711,6 +711,7 @@ with_libxml
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
+with_libnuma
with_uuid
with_readline
with_systemd
@@ -868,6 +869,7 @@ with_libedit_preferred
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -1581,6 +1583,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -9140,6 +9143,33 @@ fi
+#
+# NUMA
+#
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
@@ -12378,6 +12408,63 @@ fi
fi
+if test "$with_libnuma" = yes ; then
+
+ ac_fn_c_check_header_mongrel "$LINENO" "numa.h" "ac_cv_header_numa_h" "$ac_includes_default"
+if test "x$ac_cv_header_numa_h" = xyes; then :
+
+else
+ as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
+fi
+
+
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'numa' does not provide numa_available" "$LINENO" 5
+fi
+
+fi
+
# XXX libcurl must link after libgssapi_krb5 on FreeBSD to avoid segfaults
# during gss_acquire_cred(). This is possibly related to Curl's Heimdal
# dependency on that platform?
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc..1a394dfc07 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1041,6 +1041,19 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index e076cefa3b..79203e45a8 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on Linux.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-libxml">
<term><option>--with-libxml</option></term>
<listitem>
@@ -2611,6 +2621,16 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on Linux. The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 13c13748e5..f81092eb66 100644
--- a/meson.build
+++ b/meson.build
@@ -949,6 +949,21 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+libnuma = dependency('numa', required: libnumaopt)
+if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+endif
+if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: libxml
@@ -3168,6 +3183,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
libxml,
lz4,
pam,
@@ -3823,6 +3839,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c451714..adaadb5faf 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5a..0bd4b2d7d3 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5a..bff9f077a8 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -54,6 +54,9 @@ ifeq ($(with_systemd),yes)
LIBS += -lsystemd
endif
+# FIXME: filter-out / with/without with_libnuma?
+LIBS += $(LIBNUMA_LIBS)
+
override LDFLAGS := $(LDFLAGS) $(LDFLAGS_EX) $(LDFLAGS_EX_BE)
##########################################################################
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d..8894f80060 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 0000000000..d3ebe8b5bd
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "c.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d5023..f786c19160 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS'
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c22431951..a68a29d541 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_numa.o \
pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d4..7ffbd4d88d 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_numa.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 0000000000..db28578bca
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,150 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+#include "postgres.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+#include <unistd.h>
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#ifndef FRONTEND
+/* FIXME not tested, might crash */
+void
+numa_warn(int num, char *fmt,...)
+{
+ va_list ap;
+ int olde = errno;
+ int needed;
+ StringInfoData msg;
+
+ initStringInfo(&msg);
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ if (needed > 0)
+ {
+ enlargeStringInfo(&msg, needed);
+ va_start(ap, fmt);
+ appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ }
+
+ ereport(WARNING,
+ (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
+ errmsg_internal("libnuma: WARNING: %s", msg.data)));
+
+ pfree(msg.data);
+
+ errno = olde;
+}
+
+void
+numa_error(char *where)
+{
+ int olde = errno;
+
+ /*
+ * XXX: for now we issue just WARNING, but long-term that might depend on
+ * numa_set_strict() here
+ */
+ elog(WARNING, "libnuma: ERROR: %s", where);
+ errno = olde;
+}
+#endif /* FRONTEND */
+
+#else
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+#ifndef WIN32
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+#else
+ Size os_page_size;
+ SYSTEM_INFO sysinfo;
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#endif
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#endif
--
2.39.5
On Fri, Mar 7, 2025 at 11:20 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
Hi,
On Wed, Mar 5, 2025 at 10:30 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:Hi,
Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
or such function: to put the "duplicated" code in it and then use this
shared function in pg_buffercache_pages() and in the new numa related one.OK, so hastily attempted that in 7b , I had to do a larger refactor
there to avoid code duplication between those two. I don't know which
attempt is better though (7 vs 7b)..I'm attaching basically the earlier stuff (v7b) as v8 with the
following minor changes:
- docs are included
- changed int8 to int4 in one function definition for numa_zone_id
.. and v9 attached because cfbot partially complained about
.cirrus.tasks.yml being adjusted recently (it seems master is hot
these days).
-J.
Attachments:
v9-0001-Add-optional-dependency-to-libnuma-Linux-only-for.patchapplication/octet-stream; name=v9-0001-Add-optional-dependency-to-libnuma-Linux-only-for.patchDownload
From ade181e760e9cb36e1688fa9cfd67172fea01509 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 10:19:35 +0100
Subject: [PATCH v9 1/3] Add optional dependency to libnuma (Linux-only) for
basic NUMA awareness routines and add minimal src/port/pg_numa.c portability
wrapper.
Other platforms can be added later.
libnuma is unavailable on 32-bit builds, so due to lack of i386 shared object,
we disable it there (it does not make sense anyway as i386 is is very memory-only
limited even with PAE)
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 12 ++-
configure | 87 +++++++++++++++++++
configure.ac | 13 +++
doc/src/sgml/installation.sgml | 20 +++++
meson.build | 17 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 1 +
src/backend/Makefile | 3 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 43 ++++++++++
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 150 +++++++++++++++++++++++++++++++++
14 files changed, 353 insertions(+), 4 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 5849cbb839a..7010dff7aef 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -445,8 +445,10 @@ task:
EOF
setup_additional_packages_script: |
- #apt-get update
- #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
+ apt-get update
+ DEBIAN_FRONTEND=noninteractive apt-get -y install \
+ libnuma1 \
+ libnuma-dev
matrix:
# SPECIAL:
@@ -471,6 +473,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
\
${LINUX_CONFIGURE_FEATURES} \
\
@@ -519,6 +522,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
@@ -835,8 +839,8 @@ task:
folder: $CCACHE_DIR
setup_additional_packages_script: |
- #apt-get update
- #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
+ apt-get update
+ DEBIAN_FRONTEND=noninteractive apt-get -y install libnuma1 libnuma-dev
###
# Test that code can be built with gcc/clang without warnings
diff --git a/configure b/configure
index 93fddd69981..23c33dd9971 100755
--- a/configure
+++ b/configure
@@ -711,6 +711,7 @@ with_libxml
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
+with_libnuma
with_uuid
with_readline
with_systemd
@@ -868,6 +869,7 @@ with_libedit_preferred
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -1581,6 +1583,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -9140,6 +9143,33 @@ fi
+#
+# NUMA
+#
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
@@ -12378,6 +12408,63 @@ fi
fi
+if test "$with_libnuma" = yes ; then
+
+ ac_fn_c_check_header_mongrel "$LINENO" "numa.h" "ac_cv_header_numa_h" "$ac_includes_default"
+if test "x$ac_cv_header_numa_h" = xyes; then :
+
+else
+ as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
+fi
+
+
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'numa' does not provide numa_available" "$LINENO" 5
+fi
+
+fi
+
# XXX libcurl must link after libgssapi_krb5 on FreeBSD to avoid segfaults
# during gss_acquire_cred(). This is possibly related to Curl's Heimdal
# dependency on that platform?
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..1a394dfc077 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1041,6 +1041,19 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index e076cefa3b9..79203e45a83 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on Linux.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-libxml">
<term><option>--with-libxml</option></term>
<listitem>
@@ -2611,6 +2621,16 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on Linux. The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 13c13748e5d..f81092eb661 100644
--- a/meson.build
+++ b/meson.build
@@ -949,6 +949,21 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+libnuma = dependency('numa', required: libnumaopt)
+if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+endif
+if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: libxml
@@ -3168,6 +3183,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
libxml,
lz4,
pam,
@@ -3823,6 +3839,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..adaadb5faf1 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5ac..0bd4b2d7d32 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5aa..bff9f077a8c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -54,6 +54,9 @@ ifeq ($(with_systemd),yes)
LIBS += -lsystemd
endif
+# FIXME: filter-out / with/without with_libnuma?
+LIBS += $(LIBNUMA_LIBS)
+
override LDFLAGS := $(LDFLAGS) $(LDFLAGS_EX) $(LDFLAGS_EX_BE)
##########################################################################
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..8894f800607 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..d3ebe8b5bd8
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "c.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..f786c191605 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS'
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c224319512..a68a29d5414 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_numa.o \
pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d43..7ffbd4d88d2 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_numa.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..db28578bcaf
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,150 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+#include "postgres.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+#include <unistd.h>
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#ifndef FRONTEND
+/* FIXME not tested, might crash */
+void
+numa_warn(int num, char *fmt,...)
+{
+ va_list ap;
+ int olde = errno;
+ int needed;
+ StringInfoData msg;
+
+ initStringInfo(&msg);
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ if (needed > 0)
+ {
+ enlargeStringInfo(&msg, needed);
+ va_start(ap, fmt);
+ appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ }
+
+ ereport(WARNING,
+ (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
+ errmsg_internal("libnuma: WARNING: %s", msg.data)));
+
+ pfree(msg.data);
+
+ errno = olde;
+}
+
+void
+numa_error(char *where)
+{
+ int olde = errno;
+
+ /*
+ * XXX: for now we issue just WARNING, but long-term that might depend on
+ * numa_set_strict() here
+ */
+ elog(WARNING, "libnuma: ERROR: %s", where);
+ errno = olde;
+}
+#endif /* FRONTEND */
+
+#else
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+#ifndef WIN32
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+#else
+ Size os_page_size;
+ SYSTEM_INFO sysinfo;
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#endif
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#endif
--
2.39.5
v9-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones-.patchapplication/octet-stream; name=v9-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones-.patchDownload
From 74fbe3c042533409ef4ae847826d918451d3bf68 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v9 3/3] Add pg_shmem_numa_allocations to show NUMA zones for
shared memory allocations
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 78 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 125 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/privileges.out | 25 ++++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/sql/privileges.sql | 10 +-
7 files changed, 254 insertions(+), 4 deletions(-)
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f5a306247e..6f8fea37de6 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -176,6 +176,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-numa-allocations"><structname>pg_shmem_numa_allocations</structname></link></entry>
+ <entry>NUMA mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -3746,6 +3751,79 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-numa-allocations">
+ <title><structname>pg_shmem_numa_allocations</structname></title>
+
+ <indexterm zone="view-pg-shmem-numa-allocations">
+ <primary>pg_shmem_numa_allocations</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_allocations</structname> view shows NUMA nodes
+ assigned allocations made from the server's main shared memory segment.
+ This includes both memory allocated by <productname>PostgreSQL</productname>
+ itself and memory allocated by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_numa_allocations</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_zone_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of NUMA node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_numa_allocations</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf5..cc014a62dc2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..7d83a143900 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -568,3 +569,127 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA zones for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ int shm_total_page_count,
+ shm_ent_page_count;
+ static bool firstUseInBackend = true;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");;
+ return (Datum) 0;
+ }
+
+ /*
+ * This is for gathering some NUMA statistics. We might be using various
+ * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in
+ * various different OS memory pages sizes, so first we need to understand
+ * the OS memory page size before calling move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Preallocate memory all at once without going into details which shared
+ * memory segment is the biggest (technically min s_b can be as low as
+ * 16xBLCKSZ)
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+ memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count);
+
+ if (firstUseInBackend == true)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+#define MAX_NUMA_ZONES 32 /* FIXME? */
+ Size zones[MAX_NUMA_ZONES];
+
+ shm_ent_page_count = ent->allocated_size / os_page_size;
+ /* It is always at least 1 page */
+ shm_ent_page_count = shm_ent_page_count == 0 ? 1 : shm_ent_page_count;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages so that inquiry about NUMA zone doesn't return -2.
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstUseInBackend == true)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ {
+ /* FIXME: should we release LWlock here ? */
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+ }
+
+ memset(zones, 0, sizeof(zones));
+ /* Count number of NUMA zones used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ if (s >= 0)
+ zones[s]++;
+ }
+
+ for (i = 0; i <= pg_numa_get_max_node(); i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(zones[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+
+ }
+
+ /*
+ * XXX: We are ignoring in NUMA version reporting of the following regions
+ * (compare to pg_get_shmem_allocations() case): 1. output shared memory
+ * allocated but not counted via the shmem index 2. output as-of-yet
+ * unused shared memory
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstUseInBackend = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index cede992b6e2..ac9b8003fbc 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8482,6 +8482,14 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+# shared memory usage with NUMA info
+{ oid => '5101', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,numa_zone_id,numa_size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index a76256405fe..02997690e1b 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3144,6 +3144,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
has_table_privilege
@@ -3157,6 +3163,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
@@ -3171,6 +3183,15 @@ SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations;
t
(1 row)
+-- to ignore potenital NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
+RESET client_min_messages;
RESET ROLE;
-- clean up
DROP ROLE regress_readallstats;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b2..b63c6e0f744 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1740,6 +1740,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ numa_zone_id,
+ numa_size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, numa_zone_id, numa_size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index d195aaf1377..e969cc38545 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1921,16 +1921,22 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations;
+-- to ignore potenital NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+RESET client_min_messages;
RESET ROLE;
-- clean up
--
2.39.5
v9-0002-Extend-pg_buffercache-with-new-view-pg_buffercach.patchapplication/octet-stream; name=v9-0002-Extend-pg_buffercache-with-new-view-pg_buffercach.patchDownload
From 3bc82e2aa255c4669ccf81bfee546a131cacedc3 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 11:17:28 +0100
Subject: [PATCH v9 2/3] Extend pg_buffercache with new view
pg_buffercache_numa to show NUMA zone for indvidual buffer.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache.out | 26 +
contrib/pg_buffercache/meson.build | 1 +
.../pg_buffercache--1.5--1.6.sql | 42 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 470 +++++++++++++-----
contrib/pg_buffercache/sql/pg_buffercache.sql | 16 +
doc/src/sgml/pgbuffercache.sgml | 64 ++-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/storage/pg_shmem.h | 1 +
10 files changed, 493 insertions(+), 134 deletions(-)
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache.out b/contrib/pg_buffercache/expected/pg_buffercache.out
index b745dc69eae..4da569c20ae 100644
--- a/contrib/pg_buffercache/expected/pg_buffercache.out
+++ b/contrib/pg_buffercache/expected/pg_buffercache.out
@@ -8,6 +8,18 @@ from pg_buffercache;
t
(1 row)
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET client_min_messages;
select buffers_used + buffers_unused > 0,
buffers_dirty <= buffers_used,
buffers_pinned <= buffers_used
@@ -34,6 +46,11 @@ SELECT * FROM pg_buffercache_summary();
ERROR: permission denied for function pg_buffercache_summary
SELECT * FROM pg_buffercache_usage_counts();
ERROR: permission denied for function pg_buffercache_usage_counts
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT * FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET client_min_messages;
RESET role;
-- Check that pg_monitor is allowed to query view / function
SET ROLE pg_monitor;
@@ -55,3 +72,12 @@ SELECT count(*) > 0 FROM pg_buffercache_usage_counts();
t
(1 row)
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET client_min_messages;
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..9b2e9393410 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..52f63aa258c
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,42 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the new function.
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+CREATE OR REPLACE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, numa_zone_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pages() TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3ae0a018e10..96b14f9ed49 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,12 +11,14 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -43,6 +45,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_zone_id;
} BufferCachePagesRec;
@@ -61,84 +64,250 @@ typedef struct
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-Datum
-pg_buffercache_pages(PG_FUNCTION_ARGS)
+/*
+ * Helper routine to map Buffers into addresses that can be
+ * later consumed by pg_numa_query_pages()
+ *
+ * Many buffers can point to the same page (in case of
+ * BLCKSZ < 4kB), but we want to also query just first
+ * address.
+ *
+ * In order to get reliable results we also need to touch
+ * memory pages, so that inquiry about NUMA zone doesn't
+ * return -2.
+ */
+static inline void
+pg_buffercache_numa_prepare_ptrs(int i, float pages_per_blk, Size os_page_size,
+ void **os_page_ptrs, bool firstUseInBackend)
+{
+ int j = 0,
+ blk2page = (int) i * pages_per_blk;
+
+ do
+ {
+ if (os_page_ptrs[blk2page + j] == 0)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ /* NBuffers count start really from 1 */
+ os_page_ptrs[blk2page + j] = (char *) BufferGetBlock(i + 1) + (os_page_size * j);
+
+ /* We just need to do it only once in backend */
+ if (firstUseInBackend == true)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2page + j]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ j++;
+ } while (j < (int) pages_per_blk);
+}
+
+/*
+ * Helper routine for pg_buffercache_(numa_)pages.
+ *
+ * We need fcinfo here and we pass it here with PG_FUNCTION_ARGS
+ */
+static BufferCachePagesContext *
+init_buffercache_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS)
{
- FuncCallContext *funcctx;
- Datum result;
- MemoryContext oldcontext;
BufferCachePagesContext *fctx; /* User function context. */
+ MemoryContext oldcontext;
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
- HeapTuple tuple;
- if (SRF_IS_FIRSTCALL())
- {
- int i;
+ /*
+ * Switch context when allocating stuff to be used in later calls
+ */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
- funcctx = SRF_FIRSTCALL_INIT();
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. We unfortunately have to get the result type for that... - we
+ * can't use the result type determined by the function definition without
+ * potentially crashing when somebody uses the old (or even wrong)
+ * function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+ expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
+ INT2OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
+ BOOLOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
+ INT2OID, -1, 0);
+
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+ INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "numa_zone_id",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCachePagesRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCachePagesRec) * NBuffers);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /*
+ * Return to original context when allocating transient memory
+ */
+ MemoryContextSwitchTo(oldcontext);
+ return fctx;
+}
- /* Switch context when allocating stuff to be used in later calls */
- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+/*
+ * Helper routine for pg_buffercache_(numa_)pages
+ */
+static void
+populate_buffercache_entry(int i, BufferCachePagesContext *fctx)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ bufHdr = GetBufferDescriptor(i);
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+
+ fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
+ fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ fctx->record[i].reltablespace = bufHdr->tag.spcOid;
+ fctx->record[i].reldatabase = bufHdr->tag.dbOid;
+ fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
+ fctx->record[i].blocknum = bufHdr->tag.blockNum;
+ fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+ if (buf_state & BM_DIRTY)
+ fctx->record[i].isdirty = true;
+ else
+ fctx->record[i].isdirty = false;
- /* Create a user function context for cross-call persistence */
- fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+ /*
+ * Note if the buffer is valid, and has storage created
+ */
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ fctx->record[i].isvalid = true;
+ else
+ fctx->record[i].isvalid = false;
+
+ fctx->record[i].numa_zone_id = -1;
+
+ UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
+ * Helper routine for pg_buffercache_(numa_)pages
+ */
+static Datum
+get_buffercache_tuple(int i, BufferCachePagesContext *fctx)
+{
+ Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+ HeapTuple tuple;
+
+ values[0] = Int32GetDatum(fctx->record[i].bufferid);
+ nulls[0] = false;
+
+ /*
+ * Set all fields except the bufferid to null if the buffer is unused or
+ * not valid.
+ */
+ if (fctx->record[i].blocknum == InvalidBlockNumber ||
+ fctx->record[i].isvalid == false)
+ {
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
+ nulls[6] = true;
+ nulls[7] = true;
/*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
+ * unused for v1.0 callers, but the array is always long enough
*/
- if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
- elog(ERROR, "return type must be a row type");
+ nulls[8] = true;
+ nulls[9] = true;
+ }
+ else
+ {
+ values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
+ nulls[1] = false;
+ values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
+ nulls[2] = false;
+ values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
+ nulls[3] = false;
+ values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
+ nulls[4] = false;
+ values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
+ nulls[5] = false;
+ values[6] = BoolGetDatum(fctx->record[i].isdirty);
+ nulls[6] = false;
+ values[7] = Int16GetDatum(fctx->record[i].usagecount);
+ nulls[7] = false;
- if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
- expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
- elog(ERROR, "incorrect number of output arguments");
+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */
+ values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
+ nulls[8] = false;
+ values[9] = Int32GetDatum(fctx->record[i].numa_zone_id);
+ nulls[9] = false;
+ }
- /* Construct a tuple descriptor for the result rows. */
- tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
- TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
- INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
- INT2OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
- INT8OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
- BOOLOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
- INT2OID, -1, 0);
-
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);
-
- fctx->tupdesc = BlessTupleDesc(tupledesc);
-
- /* Allocate NBuffers worth of BufferCachePagesRec records. */
- fctx->record = (BufferCachePagesRec *)
- MemoryContextAllocHuge(CurrentMemoryContext,
- sizeof(BufferCachePagesRec) * NBuffers);
-
- /* Set max calls and remember the user function context. */
- funcctx->max_calls = NBuffers;
- funcctx->user_fctx = fctx;
-
- /* Return to original context when allocating transient memory */
- MemoryContextSwitchTo(oldcontext);
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
+
+/*
+ * When updating this routine please sync it with below one:
+ * pg_buffercache_numa_pages()
+ */
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ fctx = init_buffercache_entries(funcctx, fcinfo);
/*
* Scan through all the buffers, saving the relevant fields in the
@@ -149,36 +318,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
* locks, so the information of each buffer is self-consistent.
*/
for (i = 0; i < NBuffers; i++)
- {
- BufferDesc *bufHdr;
- uint32 buf_state;
-
- bufHdr = GetBufferDescriptor(i);
- /* Lock each buffer header before inspecting. */
- buf_state = LockBufHdr(bufHdr);
-
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
- if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
- else
- fctx->record[i].isdirty = false;
-
- /* Note if the buffer is valid, and has storage created */
- if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
- else
- fctx->record[i].isvalid = false;
-
- UnlockBufHdr(bufHdr, buf_state);
- }
+ populate_buffercache_entry(i, fctx);
}
funcctx = SRF_PERCALL_SETUP();
@@ -188,59 +328,129 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
if (funcctx->call_cntr < funcctx->max_calls)
{
+ Datum result;
uint32 i = funcctx->call_cntr;
- Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
- bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
- nulls[0] = false;
+ result = get_buffercache_tuple(i, fctx);
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ {
+ SRF_RETURN_DONE(funcctx);
+ }
+}
+
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+ bool query_numa = true;
+ static bool firstUseInBackend = true;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_pages_status = NULL;
+ int os_page_count = 0;
+ float pages_per_blk = 0;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");
+ query_numa = false;
+ }
+ fctx = init_buffercache_entries(funcctx, fcinfo);
+
+ if (query_numa)
+ {
+ /*
+ * This is for gathering some NUMA statistics. We might be using
+ * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
+ * allocated in various different OS memory pages sizes, so first
+ * we need to understand the OS memory page size before calling
+ * move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+ os_page_count = ((uint64) NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA: os_page_count=%d os_page_size=%zu pages_per_blk=%f",
+ os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(int) * os_page_count);
+ memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably
+ * have bug in our buffers to OS page mapping code here
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstUseInBackend == true)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+ }
/*
- * Set all fields except the bufferid to null if the buffer is unused
- * or not valid.
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
*/
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
+ for (i = 0; i < NBuffers; i++)
{
- nulls[1] = true;
- nulls[2] = true;
- nulls[3] = true;
- nulls[4] = true;
- nulls[5] = true;
- nulls[6] = true;
- nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
- nulls[8] = true;
+ populate_buffercache_entry(i, fctx);
+ if (query_numa)
+ pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
}
- else
+
+ if (query_numa)
{
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
- nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
- nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
- nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
- nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
- nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
- nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
- nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
- nulls[8] = false;
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ int blk2page = (int) i * pages_per_blk;
+
+ /*
+ * Technically we can get errors too here and pass that to
+ * user. Also we could somehow report single DB block spanning
+ * more than one NUMA zone, but it should be rare.
+ */
+ fctx->record[i].numa_zone_id = os_pages_status[blk2page];
+ }
}
+ }
- /* Build and return the tuple. */
- tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
- result = HeapTupleGetDatum(tuple);
+ funcctx = SRF_PERCALL_SETUP();
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ Datum result;
+ uint32 i = funcctx->call_cntr;
+
+ result = get_buffercache_tuple(i, fctx);
SRF_RETURN_NEXT(funcctx, result);
}
else
+ {
+ firstUseInBackend = false;
SRF_RETURN_DONE(funcctx);
+ }
}
Datum
diff --git a/contrib/pg_buffercache/sql/pg_buffercache.sql b/contrib/pg_buffercache/sql/pg_buffercache.sql
index 944fbb1beae..d982048425f 100644
--- a/contrib/pg_buffercache/sql/pg_buffercache.sql
+++ b/contrib/pg_buffercache/sql/pg_buffercache.sql
@@ -5,6 +5,14 @@ select count(*) = (select setting::bigint
where name = 'shared_buffers')
from pg_buffercache;
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+RESET client_min_messages;
+
select buffers_used + buffers_unused > 0,
buffers_dirty <= buffers_used,
buffers_pinned <= buffers_used
@@ -19,6 +27,10 @@ SELECT * FROM pg_buffercache;
SELECT * FROM pg_buffercache_pages() AS p (wrong int);
SELECT * FROM pg_buffercache_summary();
SELECT * FROM pg_buffercache_usage_counts();
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT * FROM pg_buffercache_numa;
+RESET client_min_messages;
RESET role;
-- Check that pg_monitor is allowed to query view / function
@@ -26,3 +38,7 @@ SET ROLE pg_monitor;
SELECT count(*) > 0 FROM pg_buffercache;
SELECT buffers_used + buffers_unused > 0 FROM pg_buffercache_summary();
SELECT count(*) > 0 FROM pg_buffercache_usage_counts();
+-- to ignore potential NOTICE: libnuma initialization failed..
+SET client_min_messages TO warning ;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET client_min_messages;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..75978a6eaed 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,13 @@
convenient use.
</para>
+ <para>
+ The similiar <function>pg_buffercache_numa_pages()</function> is a slower
+ variant of the above, but also can provide NUMA node ID for shared buffer entry.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +209,59 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache_numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed are almost identical to the previous
+ <structname>pg_buffercache</structname> view, but this one includes one additional
+ column numa_zone_id as defined in <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_zone_id</structfield> <type>integer</type>
+ </para>
+ <para>
+ ID (number) of the NUMA node for this particular buffer. NULL if the shared buffer
+ has not been used yet.On systems without NUMA this usually returns 0.
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ This is clone version of the original pg_buffercache view, however it provides
+ additional <structfield>numa_zone_id</structfield> column. Fetching this
+ information from OS is costly and might take much longer and querying it is not
+ recommended by automated or monitoring systems.
+ </para>
+
+ <para>
+ As NUMA node ID inquiry for each page requires memory pages to be paged-in, first
+ execution of this function can take long time especially on systems with bigint
+ shared_buffers and without huge_pages enabled.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index ad25cbb39c5..dd34c79f521 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -563,7 +563,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
--
2.39.5
Hi,
On Fri, Mar 07, 2025 at 12:33:27PM +0100, Jakub Wartak wrote:
On Fri, Mar 7, 2025 at 11:20 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:Hi,
On Wed, Mar 5, 2025 at 10:30 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:Hi,
Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
or such function: to put the "duplicated" code in it and then use this
shared function in pg_buffercache_pages() and in the new numa related one.OK, so hastily attempted that in 7b , I had to do a larger refactor
there to avoid code duplication between those two. I don't know which
attempt is better though (7 vs 7b)..I'm attaching basically the earlier stuff (v7b) as v8 with the
following minor changes:
- docs are included
- changed int8 to int4 in one function definition for numa_zone_id.. and v9 attached because cfbot partially complained about
.cirrus.tasks.yml being adjusted recently (it seems master is hot
these days).
Thanks for the new version!
Some random comments on 0001:
=== 1
It does not compiles "alone". It's missing:
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
and
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -563,7 +563,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
That come with 0002. So it has to be in 0001 instead.
=== 2
+else
+ as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
+fi
Maybe we should add the same test (checking for numa.h) for meson?
=== 3
+# FIXME: filter-out / with/without with_libnuma?
+LIBS += $(LIBNUMA_LIBS)
It looks to me that we can remove those 2 lines.
=== 4
+ Only supported on Linux.
s/on Linux/on platforms for which the libnuma library is implemented/?
I did a quick grep on "Only supported on" and it looks like that could be
a more consistent wording.
=== 5
+#include "c.h"
+#include "postgres.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+#include <unistd.h>
I had a closer look to other header files and it looks like it "should" be:
#include "c.h"
#include "postgres.h"
#include <unistd.h>
#ifdef WIN32
#include <windows.h>
#endif
#include "port/pg_numa.h"
#include "storage/pg_shmem.h"
And is "#include "c.h"" really needed?
=== 6
+/* FIXME not tested, might crash */
That's a bit scary.
=== 7
+ * XXX: for now we issue just WARNING, but long-term that might depend on
+ * numa_set_strict() here
s/here/here./
Did not look carefully at all the comments in 0001, 0002 and 0003 though.
A few random comments regarding 0002:
=== 8
# create extension pg_buffercache;
ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so: undefined symbol: pg_numa_query_pages
CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages()
RETURNS SETOF RECORD
AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
LANGUAGE C PARALLEL SAFE"
extension script file "pg_buffercache--1.2.sql", near line 7
While that's ok if 0003 is applied. I think that each individual patch should
"fully" work individually.
=== 9
+ do
+ {
+ if (os_page_ptrs[blk2page + j] == 0)
blk2page + j will be repeated multiple times, better to store it in a local
variable instead?
=== 10
+ if (firstUseInBackend == true)
if (firstUseInBackend) instead?
=== 11
+ int j = 0,
+ blk2page = (int) i * pages_per_blk;
I wonder if size_t is more appropriate for blk2page:
size_t blk2page = (size_t)(i * pages_per_blk) maybe?
=== 12
as we know that we'll iterate until pages_per_blk then would a for loop be more
appropriate here, something like?
"
for (size_t j = 0; j < pages_per_blk; j++)
{
if (os_page_ptrs[blk2page + j] == 0)
{
"
=== 13
+ if (buf_state & BM_DIRTY)
+ fctx->record[i].isdirty = true;
+ else
+ fctx->record[i].isdirty = false;
fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe?
=== 14
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ fctx->record[i].isvalid = true;
+ else
+ fctx->record[i].isvalid = false;
fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
maybe?
=== 15
+populate_buffercache_entry(int i, BufferCachePagesContext *fctx)
I wonder if "i" should get a more descriptive name?
=== 16
s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent
with pg_stat_io_build_tuples for example).
I now realize that I did propose populate_buffercache_entry() up-thread,
sorry for changing my mind.
=== 17
+ static bool firstUseInBackend = true;
maybe we should give it a more descriptive name?
Also I need to think more about how firstUseInBackend is used, for example:
==== 17.1
would it be better to define it as a file scope variable? (at least that
would avoid to pass it as an extra parameter in some functions).
=== 17.2
what happens if firstUseInBackend is set to false and later on the pages
are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is
called again by a backend that already had set firstUseInBackend to false,
would that provide accurate information?
=== 18
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+ bool query_numa = true;
I don't think we need query_numa anymore in pg_buffercache_numa_pages().
I think that we can just ERROR (or NOTICE and return) here:
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");
+ query_numa = false;
and fully get rid of query_numa.
=== 19
And then here:
for (i = 0; i < NBuffers; i++)
{
populate_buffercache_entry(i, fctx);
if (query_numa)
pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
}
if (query_numa)
{
if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
elog(ERROR, "failed NUMA pages inquiry: %m");
for (i = 0; i < NBuffers; i++)
{
int blk2page = (int) i * pages_per_blk;
/*
* Technically we can get errors too here and pass that to
* user. Also we could somehow report single DB block spanning
* more than one NUMA zone, but it should be rare.
*/
fctx->record[i].numa_zone_id = os_pages_status[blk2page];
}
maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"?
A few random comments regarding 0003:
=== 20
+ <indexterm zone="view-pg-shmem-numa-allocations">
+ <primary>pg_shmem_numa_allocations</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_allocations</structname> view shows NUMA nodes
s/pg_shmem_allocations/pg_shmem_numa_allocations/?
=== 21
+ /* FIXME: should we release LWlock here ? */
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
There is no need, see src/backend/storage/lmgr/README.
=== 22
+#define MAX_NUMA_ZONES 32 /* FIXME? */
+ Size zones[MAX_NUMA_ZONES];
could we rely on pg_numa_get_max_node() instead?
=== 23
+ if (s >= 0)
+ zones[s]++;
should we also check that s is below a limit?
=== 24
Regarding how we make use of pg_numa_get_max_node(), are we sure there is
no possible holes? I mean could a system have node 0,1 and 3 but not 2?
Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I
did a little in 0001 though)
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Thanks for the new version!
v10 is attached with most fixes after review and one new thing
introduced: pg_numa_available() for run-time decision inside tests
which was needed after simplifying code a little bit as you wanted.
I've also fixed -Dlibnuma=disabled as it was not properly implemented.
There are couple open items (minor/decision things), but most is fixed
or answered:
Some random comments on 0001:
=== 1
It does not compiles "alone". It's missing:
[..]
+extern PGDLLIMPORT int huge_pages_status;
[..]
-static int huge_pages_status = HUGE_PAGES_UNKNOWN; +int huge_pages_status = HUGE_PAGES_UNKNOWN;That come with 0002. So it has to be in 0001 instead.
Ugh, good catch, I haven't thought about it in isolation, they are
separate to just ease review, but should be committed together. Fixed.
=== 2
+else + as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5 +fiMaybe we should add the same test (checking for numa.h) for meson?
TBH, I have no idea, libnuma.h may exist but it may not link e.g. when
cross-compiling 32-bit on 64-bit. Or is this more about keeping sync
between meson and autoconf?
=== 3
+# FIXME: filter-out / with/without with_libnuma? +LIBS += $(LIBNUMA_LIBS)It looks to me that we can remove those 2 lines.
Done.
=== 4
+ Only supported on Linux.
s/on Linux/on platforms for which the libnuma library is implemented/?
I did a quick grep on "Only supported on" and it looks like that could be
a more consistent wording.
Fixed.
=== 5
+#include "c.h" +#include "postgres.h" +#include "port/pg_numa.h" +#include "storage/pg_shmem.h" +#include <unistd.h>I had a closer look to other header files and it looks like it "should" be:
#include "c.h"
#include "postgres.h"#include <unistd.h>
#ifdef WIN32
#include <windows.h>
#endif#include "port/pg_numa.h"
#include "storage/pg_shmem.h"And is "#include "c.h"" really needed?
Fixed both. It seems to compile without c.h.
=== 6
+/* FIXME not tested, might crash */
That's a bit scary.
When you are in support for long enough it is becoming the norm ;) But
on serious note Andres wanted have numa error/warning handlers (used
by libnuma), but current code has no realistic code-path to hit it
from numa_available(3), numa_move_pages(3) or numa_max_node(3). The
situation won't change until one day in future (I hope!) we start
using some more advanced libnuma functionality for interleaving
memory, please see my earlier reply:
/messages/by-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA+s+_5hcTOg@mail.gmail.com
E.g. numa_available(3) is tiny wrapper , see
https://github.com/numactl/numactl/blob/master/libnuma.c#L872
For now, I've adjusted that FIXME to XXX, but still don't know we
could inject failure to see this triggered...
=== 7
+ * XXX: for now we issue just WARNING, but long-term that might depend on + * numa_set_strict() heres/here/here./
Done.
Did not look carefully at all the comments in 0001, 0002 and 0003 though.
A few random comments regarding 0002:
=== 8
# create extension pg_buffercache;
ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so: undefined symbol: pg_numa_query_pages
CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages()
RETURNS SETOF RECORD
AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
LANGUAGE C PARALLEL SAFE"
extension script file "pg_buffercache--1.2.sql", near line 7While that's ok if 0003 is applied. I think that each individual patch should
"fully" work individually.
STILL OPEN QUESTION: Not sure I understand: you need to have 0001 +
0002 or 0001 + 0003, but here 0002 is complaining about lack of
pg_numa_query_pages() which is part of 0001 (?) Should I merge those
patches or keep them separate?
=== 9
+ do + { + if (os_page_ptrs[blk2page + j] == 0)blk2page + j will be repeated multiple times, better to store it in a local
variable instead?
Done.
=== 10
+ if (firstUseInBackend == true)
if (firstUseInBackend) instead?
Done everywhere.
=== 11
+ int j = 0, + blk2page = (int) i * pages_per_blk;I wonder if size_t is more appropriate for blk2page:
size_t blk2page = (size_t)(i * pages_per_blk) maybe?
Sure, done.
=== 12
as we know that we'll iterate until pages_per_blk then would a for loop be more
appropriate here, something like?"
for (size_t j = 0; j < pages_per_blk; j++)
{
if (os_page_ptrs[blk2page + j] == 0)
{
"
Sure.
=== 13
+ if (buf_state & BM_DIRTY) + fctx->record[i].isdirty = true; + else + fctx->record[i].isdirty = false;fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe?
=== 14
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)) + fctx->record[i].isvalid = true; + else + fctx->record[i].isvalid = false;fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
maybe?
It is coming from the original pg_buffercache and it is less readable
to me, so I don't want to touch that too much because that might open
refactoring doors too wide.
=== 15
+populate_buffercache_entry(int i, BufferCachePagesContext *fctx)
I wonder if "i" should get a more descriptive name?
Done, buffer_id.
=== 16
s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent
with pg_stat_io_build_tuples for example).I now realize that I did propose populate_buffercache_entry() up-thread,
sorry for changing my mind.
OK, I've tried to create a better name, but all my ideas were too
long, but pg_buffercache_build_tuple sounds nice, so lets use that.
=== 17
+ static bool firstUseInBackend = true;
maybe we should give it a more descriptive name?
I couldn't come up with anything that wouldn't look too long, so
instead I've added a comment explaining the meaning behind this
variable, hope that's good enough.
Also I need to think more about how firstUseInBackend is used, for example:
==== 17.1
would it be better to define it as a file scope variable? (at least that
would avoid to pass it as an extra parameter in some functions).
I have no strong opinion on this, but I have one doubt (for future):
isn't creating global variables making life harder for upcoming
multithreading guys ?
=== 17.2
what happens if firstUseInBackend is set to false and later on the pages
are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is
called again by a backend that already had set firstUseInBackend to false,
would that provide accurate information?
It is still the correct result. That "touching" (paging-in) is only
necessary probably to properly resolve PTEs as the fork() does not
seem to carry them over from parent:
postgres=# select 'create table foo' || s || ' as select
generate_series(1, 100000);' from generate_series(1, 4) s;
postgres=# \gexec
SELECT 100000
SELECT 100000
SELECT 100000
SELECT 100000
postgres=# select numa_zone_id, count(*) from pg_buffercache_numa
group by numa_zone_id order by numa_zone_id; -- before:
numa_zone_id | count
--------------+---------
0 | 256
1 | 4131
| 8384221
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
1451
-- now use another root (!) session to "migratepages(8)" from numactl
will also shift shm segment:
# migratepages 1451 1 3
-- while above will be in progress for a lot of time , but the outcome
is visible much faster in that backend (pg is functioning):
postgres=# select numa_zone_id, count(*) from pg_buffercache_numa
group by numa_zone_id order by numa_zone_id;
numa_zone_id | count
--------------+---------
0 | 256
3 | 4480
| 8383872
So the above clearly shows that initial touching of shm is required,
but just once and it stays valid afterwards.
BTW: migratepages(8) was stuck for 1-2 minutes there on
"__wait_rcu_gp" according to it's wchan, without any sign of activity
on the OS and then out of blue completed just fine, s_b=64GB,HP=on.
=== 18
+Datum +pg_buffercache_numa_pages(PG_FUNCTION_ARGS) +{ + FuncCallContext *funcctx; + BufferCachePagesContext *fctx; /* User function context. */ + bool query_numa = true;I don't think we need query_numa anymore in pg_buffercache_numa_pages().
I think that we can just ERROR (or NOTICE and return) here:
+ if (pg_numa_init() == -1) + { + elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable."); + query_numa = false;and fully get rid of query_numa.
Right... fixed it here, but it made the tests blow up, so I've had to
find a way to conditionally launch tests based on NUMA availability
and that's how pg_numa_available() was born. It's in ipc/shmem.c
because I couldn't find a better place for it...
=== 19
And then here:
for (i = 0; i < NBuffers; i++)
{
populate_buffercache_entry(i, fctx);
if (query_numa)
pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
}if (query_numa)
{
if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
elog(ERROR, "failed NUMA pages inquiry: %m");for (i = 0; i < NBuffers; i++)
{
int blk2page = (int) i * pages_per_blk;/*
* Technically we can get errors too here and pass that to
* user. Also we could somehow report single DB block spanning
* more than one NUMA zone, but it should be rare.
*/
fctx->record[i].numa_zone_id = os_pages_status[blk2page];
}maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"?
Well, pg_buffercache_numa_prepare_ptrs() is just an inlined wrapper
that prepares **os_page_ptrs, which is then used by
pg_numa_query_pages() and then we fill the data. But now after
removing query_numa , it reads smoother anyway. Can you please take a
look again on this, is this up to the project standards?
A few random comments regarding 0003:
=== 20
+ The <structname>pg_shmem_allocations</structname> view shows NUMA nodes
s/pg_shmem_allocations/pg_shmem_numa_allocations/?
Fixed.
=== 21
+ /* FIXME: should we release LWlock here ? */ + elog(ERROR, "failed NUMA pages inquiry status: %m");There is no need, see src/backend/storage/lmgr/README.
Thanks, fixed.
=== 22
+#define MAX_NUMA_ZONES 32 /* FIXME? */ + Size zones[MAX_NUMA_ZONES];could we rely on pg_numa_get_max_node() instead?
Sure, done.
=== 23
+ if (s >= 0) + zones[s]++;should we also check that s is below a limit?
STILL OPEN QUESTION: I'm not sure it would give us value to report
e.g. -2 on per shm entry/per numa node entry, would it? If it would we
could somehow overallocate that array and remember negative ones too.
=== 24
Regarding how we make use of pg_numa_get_max_node(), are we sure there is
no possible holes? I mean could a system have node 0,1 and 3 but not 2?
I have never seen a hole in numbering as it is established during
boot, and the only way that could get close to adjusting it could be
making processor books (CPU and RAM together) offline. Even *if*
someone would be doing some preventive hardware maintenance, that
still wouldn't hurt, as we are just using the Id of the zone to
display it -- the max would be already higher. I mean, technically one
could use lsmem(1) (it mentions removable flag, after let's pages and
processes migrated away and from there) and then use chcpu(1) to
--disable CPUs on that zone (to prevent new ones coming there and
allocating new local memory there) and then offlining that memory
region via chmem(1) --disable. Even with all of that, that still
shouldn't cause issues for this code I think, because
`numa_max_node()` says it `returns the >> highest node number <<<
available on the current system.`
Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I
did a little in 0001 though)
That was an attempt to say "Thank You", OK, I've aligned it that way,
so you are still referenced in 0001. I wouldn't find motivation to
work on this if you won't respond to those emails ;)
There is one known issue, CI returned for numa.out test, that I need
to get bottom to (it does not reproduce for me locally) inside
numa.out/sql:
SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations
WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90
WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90
WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90
-J.
Attachments:
v10-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchapplication/octet-stream; name=v10-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchDownload
From f77b7e1a680a7aad370cf2ac9233b9c17bd5a4a2 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 10:19:35 +0100
Subject: [PATCH v10 1/3] Add optional dependency to libnuma (Linux-only) for
basic NUMA awareness routines and add minimal src/port/pg_numa.c portability
wrapper.
Other platforms can be added later.
libnuma is unavailable on 32-bit builds, so due to lack of i386 shared object,
we disable it there (it does not make sense anyway as i386 is is very memory-only
limited even with PAE)
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 12 ++-
configure | 87 ++++++++++++++++
configure.ac | 13 +++
doc/src/sgml/func.sgml | 13 +++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 21 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 1 +
src/backend/storage/ipc/shmem.c | 11 ++
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 5 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 43 ++++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 151 ++++++++++++++++++++++++++++
18 files changed, 387 insertions(+), 5 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 5849cbb839a..7010dff7aef 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -445,8 +445,10 @@ task:
EOF
setup_additional_packages_script: |
- #apt-get update
- #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
+ apt-get update
+ DEBIAN_FRONTEND=noninteractive apt-get -y install \
+ libnuma1 \
+ libnuma-dev
matrix:
# SPECIAL:
@@ -471,6 +473,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
\
${LINUX_CONFIGURE_FEATURES} \
\
@@ -519,6 +522,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
@@ -835,8 +839,8 @@ task:
folder: $CCACHE_DIR
setup_additional_packages_script: |
- #apt-get update
- #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
+ apt-get update
+ DEBIAN_FRONTEND=noninteractive apt-get -y install libnuma1 libnuma-dev
###
# Test that code can be built with gcc/clang without warnings
diff --git a/configure b/configure
index 93fddd69981..23c33dd9971 100755
--- a/configure
+++ b/configure
@@ -711,6 +711,7 @@ with_libxml
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
+with_libnuma
with_uuid
with_readline
with_systemd
@@ -868,6 +869,7 @@ with_libedit_preferred
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -1581,6 +1583,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -9140,6 +9143,33 @@ fi
+#
+# NUMA
+#
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
@@ -12378,6 +12408,63 @@ fi
fi
+if test "$with_libnuma" = yes ; then
+
+ ac_fn_c_check_header_mongrel "$LINENO" "numa.h" "ac_cv_header_numa_h" "$ac_includes_default"
+if test "x$ac_cv_header_numa_h" = xyes; then :
+
+else
+ as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
+fi
+
+
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'numa' does not provide numa_available" "$LINENO" 5
+fi
+
+fi
+
# XXX libcurl must link after libgssapi_krb5 on FreeBSD to avoid segfaults
# during gss_acquire_cred(). This is possibly related to Curl's Heimdal
# dependency on that platform?
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..1a394dfc077 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1041,6 +1041,19 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 51dd8ad6571..071b98e6c9a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25061,6 +25061,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if a <acronym>NUMA</acronym> support is available.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index e076cefa3b9..9f56205a1d7 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-libxml">
<term><option>--with-libxml</option></term>
<listitem>
@@ -2611,6 +2621,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 13c13748e5d..19500ebdfb2 100644
--- a/meson.build
+++ b/meson.build
@@ -949,6 +949,25 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ else
+ libnuma = not_found_dep
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: libxml
@@ -3168,6 +3187,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
libxml,
lz4,
pam,
@@ -3823,6 +3843,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..adaadb5faf1 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5ac..0bd4b2d7d32 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..4c9c3cb320f 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -568,3 +569,13 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL level function returning whether NUMA support was compiled in. */
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ if(pg_numa_init() == -1)
+ PG_RETURN_BOOL(false);
+ PG_RETURN_BOOL(true);
+}
+
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index ad25cbb39c5..dd34c79f521 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -563,7 +563,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 42e427f8fe8..38612d8ae12 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8489,6 +8489,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '5102', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
@@ -12477,3 +12481,4 @@
prosrc => 'gist_stratnum_common' },
]
+
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..8894f800607 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..d3ebe8b5bd8
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "c.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..f786c191605 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS'
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c224319512..a68a29d5414 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_numa.o \
pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d43..7ffbd4d88d2 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_numa.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..b9348caaca9
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,151 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#ifndef FRONTEND
+/* XXX: not tested */
+void
+numa_warn(int num, char *fmt,...)
+{
+ va_list ap;
+ int olde = errno;
+ int needed;
+ StringInfoData msg;
+
+ initStringInfo(&msg);
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ if (needed > 0)
+ {
+ enlargeStringInfo(&msg, needed);
+ va_start(ap, fmt);
+ appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ }
+
+ ereport(WARNING,
+ (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
+ errmsg_internal("libnuma: WARNING: %s", msg.data)));
+
+ pfree(msg.data);
+
+ errno = olde;
+}
+
+void
+numa_error(char *where)
+{
+ int olde = errno;
+
+ /*
+ * XXX: for now we issue just WARNING, but long-term that might depend on
+ * numa_set_strict() here.
+ */
+ elog(WARNING, "libnuma: ERROR: %s", where);
+ errno = olde;
+}
+#endif /* FRONTEND */
+
+#else
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+#ifndef WIN32
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+#else
+ Size os_page_size;
+ SYSTEM_INFO sysinfo;
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#endif
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#endif
--
2.39.5
v10-0002-Extend-pg_buffercache-with-new-view-pg_buffercac.patchapplication/octet-stream; name=v10-0002-Extend-pg_buffercache-with-new-view-pg_buffercac.patchDownload
From df7f25e69e7e39d564e78783dab4d4fa4287dc4d Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 11:17:28 +0100
Subject: [PATCH v10 2/3] Extend pg_buffercache with new view
pg_buffercache_numa to show NUMA zone for indvidual buffer.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache_numa.out | 28 ++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 42 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 462 +++++++++++++-----
.../sql/pg_buffercache_numa.sql | 21 +
doc/src/sgml/pgbuffercache.sgml | 64 ++-
9 files changed, 493 insertions(+), 134 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..52f63aa258c
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,42 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the new function.
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+CREATE OR REPLACE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, numa_zone_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pages() TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3ae0a018e10..9cc04320d63 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,12 +11,12 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
-
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -43,6 +43,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_zone_id;
} BufferCachePagesRec;
@@ -61,84 +62,249 @@ typedef struct
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-Datum
-pg_buffercache_pages(PG_FUNCTION_ARGS)
+/*
+ * Helper routine to map Buffers into addresses that can be
+ * later consumed by pg_numa_query_pages()
+ *
+ * Many buffers can point to the same page (in case of
+ * BLCKSZ < 4kB), but we want to also query just first
+ * address.
+ *
+ * In order to get reliable results we also need to touch
+ * memory pages, so that inquiry about NUMA zone doesn't
+ * return -2.
+ */
+static inline void
+pg_buffercache_numa_prepare_ptrs(int buffer_id, float pages_per_blk, Size os_page_size,
+ void **os_page_ptrs, bool firstUseInBackend)
+{
+ size_t blk2page = (size_t)(buffer_id * pages_per_blk);
+
+ for (size_t j = 0; j < pages_per_blk; j++)
+ {
+ size_t blk2pageoff = blk2page + j;
+ if (os_page_ptrs[blk2pageoff] == 0)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ /* NBuffers count start really from 1 */
+ os_page_ptrs[blk2pageoff] = (char *) BufferGetBlock(buffer_id + 1) + (os_page_size * j);
+
+ /* We just need to do it only once in backend */
+ if (firstUseInBackend)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2pageoff]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ }
+}
+
+/*
+ * Helper routine for pg_buffercache_(numa_)pages.
+ *
+ * We need fcinfo here and we pass it here with PG_FUNCTION_ARGS
+ */
+static BufferCachePagesContext *
+pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS)
{
- FuncCallContext *funcctx;
- Datum result;
- MemoryContext oldcontext;
BufferCachePagesContext *fctx; /* User function context. */
+ MemoryContext oldcontext;
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
- HeapTuple tuple;
- if (SRF_IS_FIRSTCALL())
- {
- int i;
+ /*
+ * Switch context when allocating stuff to be used in later calls
+ */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
- funcctx = SRF_FIRSTCALL_INIT();
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
- /* Switch context when allocating stuff to be used in later calls */
- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. We unfortunately have to get the result type for that... - we
+ * can't use the result type determined by the function definition without
+ * potentially crashing when somebody uses the old (or even wrong)
+ * function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
- /* Create a user function context for cross-call persistence */
- fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+ if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+ expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
+ INT2OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
+ BOOLOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
+ INT2OID, -1, 0);
+
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+ INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "numa_zone_id",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCachePagesRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCachePagesRec) * NBuffers);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /*
+ * Return to original context when allocating transient memory
+ */
+ MemoryContextSwitchTo(oldcontext);
+ return fctx;
+}
+
+/*
+ * Helper routine for pg_buffercache_(numa_)pages
+ */
+static void
+pg_buffercache_build_tuple(int i, BufferCachePagesContext *fctx)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ bufHdr = GetBufferDescriptor(i);
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+
+ fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
+ fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ fctx->record[i].reltablespace = bufHdr->tag.spcOid;
+ fctx->record[i].reldatabase = bufHdr->tag.dbOid;
+ fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
+ fctx->record[i].blocknum = bufHdr->tag.blockNum;
+ fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+ if (buf_state & BM_DIRTY)
+ fctx->record[i].isdirty = true;
+ else
+ fctx->record[i].isdirty = false;
+
+ /*
+ * Note if the buffer is valid, and has storage created
+ */
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ fctx->record[i].isvalid = true;
+ else
+ fctx->record[i].isvalid = false;
+
+ fctx->record[i].numa_zone_id = -1;
+
+ UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
+ * Helper routine for pg_buffercache_(numa_)pages
+ */
+static Datum
+get_buffercache_tuple(int i, BufferCachePagesContext *fctx)
+{
+ Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+ HeapTuple tuple;
+
+ values[0] = Int32GetDatum(fctx->record[i].bufferid);
+ nulls[0] = false;
+
+ /*
+ * Set all fields except the bufferid to null if the buffer is unused or
+ * not valid.
+ */
+ if (fctx->record[i].blocknum == InvalidBlockNumber ||
+ fctx->record[i].isvalid == false)
+ {
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
+ nulls[6] = true;
+ nulls[7] = true;
/*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
+ * unused for v1.0 callers, but the array is always long enough
*/
- if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
- elog(ERROR, "return type must be a row type");
+ nulls[8] = true;
+ nulls[9] = true;
+ }
+ else
+ {
+ values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
+ nulls[1] = false;
+ values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
+ nulls[2] = false;
+ values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
+ nulls[3] = false;
+ values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
+ nulls[4] = false;
+ values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
+ nulls[5] = false;
+ values[6] = BoolGetDatum(fctx->record[i].isdirty);
+ nulls[6] = false;
+ values[7] = Int16GetDatum(fctx->record[i].usagecount);
+ nulls[7] = false;
- if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
- expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
- elog(ERROR, "incorrect number of output arguments");
+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */
+ values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
+ nulls[8] = false;
+ values[9] = Int32GetDatum(fctx->record[i].numa_zone_id);
+ nulls[9] = false;
+ }
- /* Construct a tuple descriptor for the result rows. */
- tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
- TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
- INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
- INT2OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
- INT8OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
- BOOLOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
- INT2OID, -1, 0);
-
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);
-
- fctx->tupdesc = BlessTupleDesc(tupledesc);
-
- /* Allocate NBuffers worth of BufferCachePagesRec records. */
- fctx->record = (BufferCachePagesRec *)
- MemoryContextAllocHuge(CurrentMemoryContext,
- sizeof(BufferCachePagesRec) * NBuffers);
-
- /* Set max calls and remember the user function context. */
- funcctx->max_calls = NBuffers;
- funcctx->user_fctx = fctx;
-
- /* Return to original context when allocating transient memory */
- MemoryContextSwitchTo(oldcontext);
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
+
+/*
+ * When updating this routine please sync it with below one:
+ * pg_buffercache_numa_pages()
+ */
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
* Scan through all the buffers, saving the relevant fields in the
@@ -149,36 +315,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
* locks, so the information of each buffer is self-consistent.
*/
for (i = 0; i < NBuffers; i++)
- {
- BufferDesc *bufHdr;
- uint32 buf_state;
-
- bufHdr = GetBufferDescriptor(i);
- /* Lock each buffer header before inspecting. */
- buf_state = LockBufHdr(bufHdr);
-
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
- if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
- else
- fctx->record[i].isdirty = false;
-
- /* Note if the buffer is valid, and has storage created */
- if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
- else
- fctx->record[i].isvalid = false;
-
- UnlockBufHdr(bufHdr, buf_state);
- }
+ pg_buffercache_build_tuple(i, fctx);
}
funcctx = SRF_PERCALL_SETUP();
@@ -188,59 +325,122 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
if (funcctx->call_cntr < funcctx->max_calls)
{
+ Datum result;
uint32 i = funcctx->call_cntr;
- Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
- bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
- nulls[0] = false;
+ result = get_buffercache_tuple(i, fctx);
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ {
+ SRF_RETURN_DONE(funcctx);
+ }
+}
+
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+ /*
+ * To get reliable results we need to "touch pages" once, see
+ * comments nearby pg_buffercache_numa_prepare_ptrs().
+ */
+ static bool firstUseInBackend = true;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_pages_status = NULL;
+ int os_page_count = 0;
+ float pages_per_blk = 0;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
+
+ /*
+ * This is for gathering some NUMA statistics. We might be using
+ * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
+ * allocated in various different OS memory pages sizes, so first
+ * we need to understand the OS memory page size before calling
+ * move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+ os_page_count = ((uint64) NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA: os_page_count=%d os_page_size=%zu pages_per_blk=%f",
+ os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(int) * os_page_count);
+ memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably
+ * have bug in our buffers to OS page mapping code here
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstUseInBackend)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
/*
- * Set all fields except the bufferid to null if the buffer is unused
- * or not valid.
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
*/
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
+ for (i = 0; i < NBuffers; i++)
{
- nulls[1] = true;
- nulls[2] = true;
- nulls[3] = true;
- nulls[4] = true;
- nulls[5] = true;
- nulls[6] = true;
- nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
- nulls[8] = true;
+ pg_buffercache_build_tuple(i, fctx);
+ pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
}
- else
+
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ for (i = 0; i < NBuffers; i++)
{
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
- nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
- nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
- nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
- nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
- nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
- nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
- nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
- nulls[8] = false;
+ int blk2page = (int) i * pages_per_blk;
+
+ /*
+ * Technically we can get errors too here and pass that to
+ * user. Also we could somehow report single DB block spanning
+ * more than one NUMA zone, but it should be rare.
+ */
+ fctx->record[i].numa_zone_id = os_pages_status[blk2page];
}
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
- /* Build and return the tuple. */
- tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
- result = HeapTupleGetDatum(tuple);
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ Datum result;
+ uint32 i = funcctx->call_cntr;
+
+ result = get_buffercache_tuple(i, fctx);
SRF_RETURN_NEXT(funcctx, result);
}
else
+ {
+ firstUseInBackend = false;
SRF_RETURN_DONE(funcctx);
+ }
}
Datum
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..e2e8cd6a241
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,21 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..75978a6eaed 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,13 @@
convenient use.
</para>
+ <para>
+ The similiar <function>pg_buffercache_numa_pages()</function> is a slower
+ variant of the above, but also can provide NUMA node ID for shared buffer entry.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +209,59 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache_numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed are almost identical to the previous
+ <structname>pg_buffercache</structname> view, but this one includes one additional
+ column numa_zone_id as defined in <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_zone_id</structfield> <type>integer</type>
+ </para>
+ <para>
+ ID (number) of the NUMA node for this particular buffer. NULL if the shared buffer
+ has not been used yet.On systems without NUMA this usually returns 0.
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ This is clone version of the original pg_buffercache view, however it provides
+ additional <structfield>numa_zone_id</structfield> column. Fetching this
+ information from OS is costly and might take much longer and querying it is not
+ recommended by automated or monitoring systems.
+ </para>
+
+ <para>
+ As NUMA node ID inquiry for each page requires memory pages to be paged-in, first
+ execution of this function can take long time especially on systems with bigint
+ shared_buffers and without huge_pages enabled.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
--
2.39.5
v10-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones.patchapplication/octet-stream; name=v10-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones.patchDownload
From 189be69e5a256bf966ab7f883f6635f377e0a6bc Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v10 3/3] Add pg_shmem_numa_allocations to show NUMA zones for
shared memory allocations
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 78 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 123 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 +++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 10 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 265 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f5a306247e..5164083131a 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -176,6 +176,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-numa-allocations"><structname>pg_shmem_numa_allocations</structname></link></entry>
+ <entry>NUMA mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -3746,6 +3751,79 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-numa-allocations">
+ <title><structname>pg_shmem_numa_allocations</structname></title>
+
+ <indexterm zone="view-pg-shmem-numa-allocations">
+ <primary>pg_shmem_numa_allocations</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_numa_allocations</structname> view shows NUMA nodes
+ assigned allocations made from the server's main shared memory segment.
+ This includes both memory allocated by <productname>PostgreSQL</productname>
+ itself and memory allocated by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_numa_allocations</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_zone_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of NUMA node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_numa_allocations</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf5..cc014a62dc2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 4c9c3cb320f..61e603fc42a 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -579,3 +579,126 @@ pg_numa_available(PG_FUNCTION_ARGS)
PG_RETURN_BOOL(true);
}
+/* SQL SRF showing NUMA zones for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ int shm_total_page_count,
+ shm_ent_page_count,
+ max_zones;
+ /* To get reliable results we need to "touch pages" once */
+ static bool firstUseInBackend = true;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");;
+ return (Datum) 0;
+ }
+ max_zones = pg_numa_get_max_node();
+
+ /*
+ * This is for gathering some NUMA statistics. We might be using various
+ * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in
+ * various different OS memory pages sizes, so first we need to understand
+ * the OS memory page size before calling move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Preallocate memory all at once without going into details which shared
+ * memory segment is the biggest (technically min s_b can be as low as
+ * 16xBLCKSZ)
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+ memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count);
+
+ if (firstUseInBackend)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+ Size *zones;
+
+ shm_ent_page_count = ent->allocated_size / os_page_size;
+ /* It is always at least 1 page */
+ shm_ent_page_count = shm_ent_page_count == 0 ? 1 : shm_ent_page_count;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages so that inquiry about NUMA zone doesn't return -2.
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstUseInBackend)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ zones = palloc0(sizeof(Size) * max_zones);
+ /* Count number of NUMA zones used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ if (s >= 0)
+ zones[s]++;
+ }
+
+ for (i = 0; i <= max_zones; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(zones[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+
+ pfree(zones);
+ }
+
+ /*
+ * XXX: We are ignoring in NUMA version reporting of the following regions
+ * (compare to pg_get_shmem_allocations() case): 1. output shared memory
+ * allocated but not counted via the shmem index 2. output as-of-yet
+ * unused shared memory
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstUseInBackend = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 38612d8ae12..7fd68702be9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8493,6 +8493,14 @@
proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '5101', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,numa_zone_id,numa_size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..fb882c5b771
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index a76256405fe..38ff8bcabe8 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3144,6 +3144,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
has_table_privilege
@@ -3157,6 +3163,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b2..b63c6e0f744 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1740,6 +1740,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ numa_zone_id,
+ numa_size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, numa_zone_id, numa_size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 37b6d21e1f9..c07a4c7633a 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..e748434c2fe
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,10 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index d195aaf1377..28261fd774b 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1921,11 +1921,13 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.39.5
On Wed, Mar 12, 2025 at 4:41 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:Thanks for the new version!
v10 is attached with most fixes after review and one new thing
introduced: pg_numa_available() for run-time decision inside tests
which was needed after simplifying code a little bit as you wanted.
I've also fixed -Dlibnuma=disabled as it was not properly implemented.
There are couple open items (minor/decision things), but most is fixed
or answered:Some random comments on 0001:
=== 1
It does not compiles "alone". It's missing:
[..]
+extern PGDLLIMPORT int huge_pages_status;
[..]
-static int huge_pages_status = HUGE_PAGES_UNKNOWN; +int huge_pages_status = HUGE_PAGES_UNKNOWN;That come with 0002. So it has to be in 0001 instead.
Ugh, good catch, I haven't thought about it in isolation, they are
separate to just ease review, but should be committed together. Fixed.=== 2
+else + as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5 +fiMaybe we should add the same test (checking for numa.h) for meson?
TBH, I have no idea, libnuma.h may exist but it may not link e.g. when
cross-compiling 32-bit on 64-bit. Or is this more about keeping sync
between meson and autoconf?=== 3
+# FIXME: filter-out / with/without with_libnuma? +LIBS += $(LIBNUMA_LIBS)It looks to me that we can remove those 2 lines.
Done.
=== 4
+ Only supported on Linux.
s/on Linux/on platforms for which the libnuma library is implemented/?
I did a quick grep on "Only supported on" and it looks like that could be
a more consistent wording.Fixed.
=== 5
+#include "c.h" +#include "postgres.h" +#include "port/pg_numa.h" +#include "storage/pg_shmem.h" +#include <unistd.h>I had a closer look to other header files and it looks like it "should" be:
#include "c.h"
#include "postgres.h"#include <unistd.h>
#ifdef WIN32
#include <windows.h>
#endif#include "port/pg_numa.h"
#include "storage/pg_shmem.h"And is "#include "c.h"" really needed?
Fixed both. It seems to compile without c.h.
=== 6
+/* FIXME not tested, might crash */
That's a bit scary.
When you are in support for long enough it is becoming the norm ;) But
on serious note Andres wanted have numa error/warning handlers (used
by libnuma), but current code has no realistic code-path to hit it
from numa_available(3), numa_move_pages(3) or numa_max_node(3). The
situation won't change until one day in future (I hope!) we start
using some more advanced libnuma functionality for interleaving
memory, please see my earlier reply:
/messages/by-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA+s+_5hcTOg@mail.gmail.com
E.g. numa_available(3) is tiny wrapper , see
https://github.com/numactl/numactl/blob/master/libnuma.c#L872For now, I've adjusted that FIXME to XXX, but still don't know we
could inject failure to see this triggered...=== 7
+ * XXX: for now we issue just WARNING, but long-term that might depend on + * numa_set_strict() heres/here/here./
Done.
Did not look carefully at all the comments in 0001, 0002 and 0003 though.
A few random comments regarding 0002:
=== 8
# create extension pg_buffercache;
ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so: undefined symbol: pg_numa_query_pages
CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages()
RETURNS SETOF RECORD
AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
LANGUAGE C PARALLEL SAFE"
extension script file "pg_buffercache--1.2.sql", near line 7While that's ok if 0003 is applied. I think that each individual patch should
"fully" work individually.STILL OPEN QUESTION: Not sure I understand: you need to have 0001 +
0002 or 0001 + 0003, but here 0002 is complaining about lack of
pg_numa_query_pages() which is part of 0001 (?) Should I merge those
patches or keep them separate?=== 9
+ do + { + if (os_page_ptrs[blk2page + j] == 0)blk2page + j will be repeated multiple times, better to store it in a local
variable instead?Done.
=== 10
+ if (firstUseInBackend == true)
if (firstUseInBackend) instead?
Done everywhere.
=== 11
+ int j = 0, + blk2page = (int) i * pages_per_blk;I wonder if size_t is more appropriate for blk2page:
size_t blk2page = (size_t)(i * pages_per_blk) maybe?
Sure, done.
=== 12
as we know that we'll iterate until pages_per_blk then would a for loop be more
appropriate here, something like?"
for (size_t j = 0; j < pages_per_blk; j++)
{
if (os_page_ptrs[blk2page + j] == 0)
{
"Sure.
=== 13
+ if (buf_state & BM_DIRTY) + fctx->record[i].isdirty = true; + else + fctx->record[i].isdirty = false;fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe?
=== 14
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)) + fctx->record[i].isvalid = true; + else + fctx->record[i].isvalid = false;fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
maybe?It is coming from the original pg_buffercache and it is less readable
to me, so I don't want to touch that too much because that might open
refactoring doors too wide.=== 15
+populate_buffercache_entry(int i, BufferCachePagesContext *fctx)
I wonder if "i" should get a more descriptive name?
Done, buffer_id.
=== 16
s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent
with pg_stat_io_build_tuples for example).I now realize that I did propose populate_buffercache_entry() up-thread,
sorry for changing my mind.OK, I've tried to create a better name, but all my ideas were too
long, but pg_buffercache_build_tuple sounds nice, so lets use that.=== 17
+ static bool firstUseInBackend = true;
maybe we should give it a more descriptive name?
I couldn't come up with anything that wouldn't look too long, so
instead I've added a comment explaining the meaning behind this
variable, hope that's good enough.Also I need to think more about how firstUseInBackend is used, for example:
==== 17.1
would it be better to define it as a file scope variable? (at least that
would avoid to pass it as an extra parameter in some functions).I have no strong opinion on this, but I have one doubt (for future):
isn't creating global variables making life harder for upcoming
multithreading guys ?=== 17.2
what happens if firstUseInBackend is set to false and later on the pages
are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is
called again by a backend that already had set firstUseInBackend to false,
would that provide accurate information?It is still the correct result. That "touching" (paging-in) is only
necessary probably to properly resolve PTEs as the fork() does not
seem to carry them over from parent:postgres=# select 'create table foo' || s || ' as select
generate_series(1, 100000);' from generate_series(1, 4) s;
postgres=# \gexec
SELECT 100000
SELECT 100000
SELECT 100000
SELECT 100000
postgres=# select numa_zone_id, count(*) from pg_buffercache_numa
group by numa_zone_id order by numa_zone_id; -- before:
numa_zone_id | count
--------------+---------
0 | 256
1 | 4131
| 8384221
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
1451-- now use another root (!) session to "migratepages(8)" from numactl
will also shift shm segment:
# migratepages 1451 1 3-- while above will be in progress for a lot of time , but the outcome
is visible much faster in that backend (pg is functioning):
postgres=# select numa_zone_id, count(*) from pg_buffercache_numa
group by numa_zone_id order by numa_zone_id;
numa_zone_id | count
--------------+---------
0 | 256
3 | 4480
| 8383872So the above clearly shows that initial touching of shm is required,
but just once and it stays valid afterwards.BTW: migratepages(8) was stuck for 1-2 minutes there on
"__wait_rcu_gp" according to it's wchan, without any sign of activity
on the OS and then out of blue completed just fine, s_b=64GB,HP=on.=== 18
+Datum +pg_buffercache_numa_pages(PG_FUNCTION_ARGS) +{ + FuncCallContext *funcctx; + BufferCachePagesContext *fctx; /* User function context. */ + bool query_numa = true;I don't think we need query_numa anymore in pg_buffercache_numa_pages().
I think that we can just ERROR (or NOTICE and return) here:
+ if (pg_numa_init() == -1) + { + elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable."); + query_numa = false;and fully get rid of query_numa.
Right... fixed it here, but it made the tests blow up, so I've had to
find a way to conditionally launch tests based on NUMA availability
and that's how pg_numa_available() was born. It's in ipc/shmem.c
because I couldn't find a better place for it...=== 19
And then here:
for (i = 0; i < NBuffers; i++)
{
populate_buffercache_entry(i, fctx);
if (query_numa)
pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
}if (query_numa)
{
if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
elog(ERROR, "failed NUMA pages inquiry: %m");for (i = 0; i < NBuffers; i++)
{
int blk2page = (int) i * pages_per_blk;/*
* Technically we can get errors too here and pass that to
* user. Also we could somehow report single DB block spanning
* more than one NUMA zone, but it should be rare.
*/
fctx->record[i].numa_zone_id = os_pages_status[blk2page];
}maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"?
Well, pg_buffercache_numa_prepare_ptrs() is just an inlined wrapper
that prepares **os_page_ptrs, which is then used by
pg_numa_query_pages() and then we fill the data. But now after
removing query_numa , it reads smoother anyway. Can you please take a
look again on this, is this up to the project standards?A few random comments regarding 0003:
=== 20
+ The <structname>pg_shmem_allocations</structname> view shows NUMA nodes
s/pg_shmem_allocations/pg_shmem_numa_allocations/?
Fixed.
=== 21
+ /* FIXME: should we release LWlock here ? */ + elog(ERROR, "failed NUMA pages inquiry status: %m");There is no need, see src/backend/storage/lmgr/README.
Thanks, fixed.
=== 22
+#define MAX_NUMA_ZONES 32 /* FIXME? */ + Size zones[MAX_NUMA_ZONES];could we rely on pg_numa_get_max_node() instead?
Sure, done.
=== 23
+ if (s >= 0) + zones[s]++;should we also check that s is below a limit?
STILL OPEN QUESTION: I'm not sure it would give us value to report
e.g. -2 on per shm entry/per numa node entry, would it? If it would we
could somehow overallocate that array and remember negative ones too.=== 24
Regarding how we make use of pg_numa_get_max_node(), are we sure there is
no possible holes? I mean could a system have node 0,1 and 3 but not 2?I have never seen a hole in numbering as it is established during
boot, and the only way that could get close to adjusting it could be
making processor books (CPU and RAM together) offline. Even *if*
someone would be doing some preventive hardware maintenance, that
still wouldn't hurt, as we are just using the Id of the zone to
display it -- the max would be already higher. I mean, technically one
could use lsmem(1) (it mentions removable flag, after let's pages and
processes migrated away and from there) and then use chcpu(1) to
--disable CPUs on that zone (to prevent new ones coming there and
allocating new local memory there) and then offlining that memory
region via chmem(1) --disable. Even with all of that, that still
shouldn't cause issues for this code I think, because
`numa_max_node()` says it `returns the >> highest node number <<<
available on the current system.`Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I
did a little in 0001 though)That was an attempt to say "Thank You", OK, I've aligned it that way,
so you are still referenced in 0001. I wouldn't find motivation to
work on this if you won't respond to those emails ;)There is one known issue, CI returned for numa.out test, that I need
to get bottom to (it does not reproduce for me locally) inside
numa.out/sql:SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations
WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90
WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90
WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90
Hi, ok, so v11 should fix this last problem and make cfbot happy.
-J.
Attachments:
v11-0002-Extend-pg_buffercache-with-new-view-pg_buffercac.patchapplication/octet-stream; name=v11-0002-Extend-pg_buffercache-with-new-view-pg_buffercac.patchDownload
From df7f25e69e7e39d564e78783dab4d4fa4287dc4d Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 11:17:28 +0100
Subject: [PATCH v11 2/3] Extend pg_buffercache with new view
pg_buffercache_numa to show NUMA zone for indvidual buffer.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache_numa.out | 28 ++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 42 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 462 +++++++++++++-----
.../sql/pg_buffercache_numa.sql | 21 +
doc/src/sgml/pgbuffercache.sgml | 64 ++-
9 files changed, 493 insertions(+), 134 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..52f63aa258c
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,42 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the new function.
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+CREATE OR REPLACE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, numa_zone_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pages() TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3ae0a018e10..9cc04320d63 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,12 +11,12 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
-
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -43,6 +43,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_zone_id;
} BufferCachePagesRec;
@@ -61,84 +62,249 @@ typedef struct
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-Datum
-pg_buffercache_pages(PG_FUNCTION_ARGS)
+/*
+ * Helper routine to map Buffers into addresses that can be
+ * later consumed by pg_numa_query_pages()
+ *
+ * Many buffers can point to the same page (in case of
+ * BLCKSZ < 4kB), but we want to also query just first
+ * address.
+ *
+ * In order to get reliable results we also need to touch
+ * memory pages, so that inquiry about NUMA zone doesn't
+ * return -2.
+ */
+static inline void
+pg_buffercache_numa_prepare_ptrs(int buffer_id, float pages_per_blk, Size os_page_size,
+ void **os_page_ptrs, bool firstUseInBackend)
+{
+ size_t blk2page = (size_t)(buffer_id * pages_per_blk);
+
+ for (size_t j = 0; j < pages_per_blk; j++)
+ {
+ size_t blk2pageoff = blk2page + j;
+ if (os_page_ptrs[blk2pageoff] == 0)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ /* NBuffers count start really from 1 */
+ os_page_ptrs[blk2pageoff] = (char *) BufferGetBlock(buffer_id + 1) + (os_page_size * j);
+
+ /* We just need to do it only once in backend */
+ if (firstUseInBackend)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2pageoff]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ }
+}
+
+/*
+ * Helper routine for pg_buffercache_(numa_)pages.
+ *
+ * We need fcinfo here and we pass it here with PG_FUNCTION_ARGS
+ */
+static BufferCachePagesContext *
+pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS)
{
- FuncCallContext *funcctx;
- Datum result;
- MemoryContext oldcontext;
BufferCachePagesContext *fctx; /* User function context. */
+ MemoryContext oldcontext;
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
- HeapTuple tuple;
- if (SRF_IS_FIRSTCALL())
- {
- int i;
+ /*
+ * Switch context when allocating stuff to be used in later calls
+ */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
- funcctx = SRF_FIRSTCALL_INIT();
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
- /* Switch context when allocating stuff to be used in later calls */
- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. We unfortunately have to get the result type for that... - we
+ * can't use the result type determined by the function definition without
+ * potentially crashing when somebody uses the old (or even wrong)
+ * function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
- /* Create a user function context for cross-call persistence */
- fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+ if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+ expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
+ INT2OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
+ BOOLOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
+ INT2OID, -1, 0);
+
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+ INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "numa_zone_id",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCachePagesRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCachePagesRec) * NBuffers);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /*
+ * Return to original context when allocating transient memory
+ */
+ MemoryContextSwitchTo(oldcontext);
+ return fctx;
+}
+
+/*
+ * Helper routine for pg_buffercache_(numa_)pages
+ */
+static void
+pg_buffercache_build_tuple(int i, BufferCachePagesContext *fctx)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ bufHdr = GetBufferDescriptor(i);
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+
+ fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
+ fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ fctx->record[i].reltablespace = bufHdr->tag.spcOid;
+ fctx->record[i].reldatabase = bufHdr->tag.dbOid;
+ fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
+ fctx->record[i].blocknum = bufHdr->tag.blockNum;
+ fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+ if (buf_state & BM_DIRTY)
+ fctx->record[i].isdirty = true;
+ else
+ fctx->record[i].isdirty = false;
+
+ /*
+ * Note if the buffer is valid, and has storage created
+ */
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ fctx->record[i].isvalid = true;
+ else
+ fctx->record[i].isvalid = false;
+
+ fctx->record[i].numa_zone_id = -1;
+
+ UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
+ * Helper routine for pg_buffercache_(numa_)pages
+ */
+static Datum
+get_buffercache_tuple(int i, BufferCachePagesContext *fctx)
+{
+ Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+ HeapTuple tuple;
+
+ values[0] = Int32GetDatum(fctx->record[i].bufferid);
+ nulls[0] = false;
+
+ /*
+ * Set all fields except the bufferid to null if the buffer is unused or
+ * not valid.
+ */
+ if (fctx->record[i].blocknum == InvalidBlockNumber ||
+ fctx->record[i].isvalid == false)
+ {
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
+ nulls[6] = true;
+ nulls[7] = true;
/*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
+ * unused for v1.0 callers, but the array is always long enough
*/
- if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
- elog(ERROR, "return type must be a row type");
+ nulls[8] = true;
+ nulls[9] = true;
+ }
+ else
+ {
+ values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
+ nulls[1] = false;
+ values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
+ nulls[2] = false;
+ values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
+ nulls[3] = false;
+ values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
+ nulls[4] = false;
+ values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
+ nulls[5] = false;
+ values[6] = BoolGetDatum(fctx->record[i].isdirty);
+ nulls[6] = false;
+ values[7] = Int16GetDatum(fctx->record[i].usagecount);
+ nulls[7] = false;
- if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
- expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
- elog(ERROR, "incorrect number of output arguments");
+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */
+ values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
+ nulls[8] = false;
+ values[9] = Int32GetDatum(fctx->record[i].numa_zone_id);
+ nulls[9] = false;
+ }
- /* Construct a tuple descriptor for the result rows. */
- tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
- TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
- INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
- INT2OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
- INT8OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
- BOOLOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
- INT2OID, -1, 0);
-
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);
-
- fctx->tupdesc = BlessTupleDesc(tupledesc);
-
- /* Allocate NBuffers worth of BufferCachePagesRec records. */
- fctx->record = (BufferCachePagesRec *)
- MemoryContextAllocHuge(CurrentMemoryContext,
- sizeof(BufferCachePagesRec) * NBuffers);
-
- /* Set max calls and remember the user function context. */
- funcctx->max_calls = NBuffers;
- funcctx->user_fctx = fctx;
-
- /* Return to original context when allocating transient memory */
- MemoryContextSwitchTo(oldcontext);
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
+
+/*
+ * When updating this routine please sync it with below one:
+ * pg_buffercache_numa_pages()
+ */
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
* Scan through all the buffers, saving the relevant fields in the
@@ -149,36 +315,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
* locks, so the information of each buffer is self-consistent.
*/
for (i = 0; i < NBuffers; i++)
- {
- BufferDesc *bufHdr;
- uint32 buf_state;
-
- bufHdr = GetBufferDescriptor(i);
- /* Lock each buffer header before inspecting. */
- buf_state = LockBufHdr(bufHdr);
-
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
- if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
- else
- fctx->record[i].isdirty = false;
-
- /* Note if the buffer is valid, and has storage created */
- if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
- else
- fctx->record[i].isvalid = false;
-
- UnlockBufHdr(bufHdr, buf_state);
- }
+ pg_buffercache_build_tuple(i, fctx);
}
funcctx = SRF_PERCALL_SETUP();
@@ -188,59 +325,122 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
if (funcctx->call_cntr < funcctx->max_calls)
{
+ Datum result;
uint32 i = funcctx->call_cntr;
- Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
- bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
- nulls[0] = false;
+ result = get_buffercache_tuple(i, fctx);
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ {
+ SRF_RETURN_DONE(funcctx);
+ }
+}
+
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+ /*
+ * To get reliable results we need to "touch pages" once, see
+ * comments nearby pg_buffercache_numa_prepare_ptrs().
+ */
+ static bool firstUseInBackend = true;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_pages_status = NULL;
+ int os_page_count = 0;
+ float pages_per_blk = 0;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
+
+ /*
+ * This is for gathering some NUMA statistics. We might be using
+ * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
+ * allocated in various different OS memory pages sizes, so first
+ * we need to understand the OS memory page size before calling
+ * move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+ os_page_count = ((uint64) NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA: os_page_count=%d os_page_size=%zu pages_per_blk=%f",
+ os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(int) * os_page_count);
+ memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably
+ * have bug in our buffers to OS page mapping code here
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstUseInBackend)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
/*
- * Set all fields except the bufferid to null if the buffer is unused
- * or not valid.
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
*/
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
+ for (i = 0; i < NBuffers; i++)
{
- nulls[1] = true;
- nulls[2] = true;
- nulls[3] = true;
- nulls[4] = true;
- nulls[5] = true;
- nulls[6] = true;
- nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
- nulls[8] = true;
+ pg_buffercache_build_tuple(i, fctx);
+ pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
}
- else
+
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ for (i = 0; i < NBuffers; i++)
{
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
- nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
- nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
- nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
- nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
- nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
- nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
- nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
- nulls[8] = false;
+ int blk2page = (int) i * pages_per_blk;
+
+ /*
+ * Technically we can get errors too here and pass that to
+ * user. Also we could somehow report single DB block spanning
+ * more than one NUMA zone, but it should be rare.
+ */
+ fctx->record[i].numa_zone_id = os_pages_status[blk2page];
}
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
- /* Build and return the tuple. */
- tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
- result = HeapTupleGetDatum(tuple);
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ Datum result;
+ uint32 i = funcctx->call_cntr;
+
+ result = get_buffercache_tuple(i, fctx);
SRF_RETURN_NEXT(funcctx, result);
}
else
+ {
+ firstUseInBackend = false;
SRF_RETURN_DONE(funcctx);
+ }
}
Datum
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..e2e8cd6a241
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,21 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..75978a6eaed 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,13 @@
convenient use.
</para>
+ <para>
+ The similiar <function>pg_buffercache_numa_pages()</function> is a slower
+ variant of the above, but also can provide NUMA node ID for shared buffer entry.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +209,59 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache_numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed are almost identical to the previous
+ <structname>pg_buffercache</structname> view, but this one includes one additional
+ column numa_zone_id as defined in <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_zone_id</structfield> <type>integer</type>
+ </para>
+ <para>
+ ID (number) of the NUMA node for this particular buffer. NULL if the shared buffer
+ has not been used yet.On systems without NUMA this usually returns 0.
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ This is clone version of the original pg_buffercache view, however it provides
+ additional <structfield>numa_zone_id</structfield> column. Fetching this
+ information from OS is costly and might take much longer and querying it is not
+ recommended by automated or monitoring systems.
+ </para>
+
+ <para>
+ As NUMA node ID inquiry for each page requires memory pages to be paged-in, first
+ execution of this function can take long time especially on systems with bigint
+ shared_buffers and without huge_pages enabled.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
--
2.39.5
v11-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones.patchapplication/octet-stream; name=v11-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones.patchDownload
From 1161550310b14529d4a42d0ab2ab58f5116708d7 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v11 3/3] Add pg_shmem_numa_allocations to show NUMA zones for
shared memory allocations
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 78 +++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 122 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 +++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 10 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 264 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f5a306247e..5164083131a 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -176,6 +176,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-numa-allocations"><structname>pg_shmem_numa_allocations</structname></link></entry>
+ <entry>NUMA mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -3746,6 +3751,79 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-numa-allocations">
+ <title><structname>pg_shmem_numa_allocations</structname></title>
+
+ <indexterm zone="view-pg-shmem-numa-allocations">
+ <primary>pg_shmem_numa_allocations</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_numa_allocations</structname> view shows NUMA nodes
+ assigned allocations made from the server's main shared memory segment.
+ This includes both memory allocated by <productname>PostgreSQL</productname>
+ itself and memory allocated by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_numa_allocations</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_zone_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of NUMA node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_numa_allocations</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf5..cc014a62dc2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 4c9c3cb320f..0dda3c99dd8 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -579,3 +579,125 @@ pg_numa_available(PG_FUNCTION_ARGS)
PG_RETURN_BOOL(true);
}
+/* SQL SRF showing NUMA zones for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ int shm_total_page_count,
+ shm_ent_page_count,
+ max_zones;
+ /* To get reliable results we need to "touch pages" once */
+ static bool firstUseInBackend = true;
+ Size *zones;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");;
+ return (Datum) 0;
+ }
+ max_zones = pg_numa_get_max_node();
+ zones = palloc(sizeof(Size) * (max_zones + 1));
+
+ /*
+ * This is for gathering some NUMA statistics. We might be using various
+ * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in
+ * various different OS memory pages sizes, so first we need to understand
+ * the OS memory page size before calling move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Preallocate memory all at once without going into details which shared
+ * memory segment is the biggest (technically min s_b can be as low as
+ * 16xBLCKSZ)
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+ memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count);
+
+ if (firstUseInBackend)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ shm_ent_page_count = ent->allocated_size / os_page_size;
+ /* It is always at least 1 page */
+ shm_ent_page_count = shm_ent_page_count == 0 ? 1 : shm_ent_page_count;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages so that inquiry about NUMA zone doesn't return -2.
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstUseInBackend)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ memset(zones, 0, sizeof(Size) * (max_zones + 1));
+ /* Count number of NUMA zones used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ if (s >= 0)
+ zones[s]++;
+ }
+
+ for (i = 0; i <= max_zones; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(zones[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * XXX: We are ignoring in NUMA version reporting of the following regions
+ * (compare to pg_get_shmem_allocations() case): 1. output shared memory
+ * allocated but not counted via the shmem index 2. output as-of-yet
+ * unused shared memory
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstUseInBackend = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 38612d8ae12..7fd68702be9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8493,6 +8493,14 @@
proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '5101', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,numa_zone_id,numa_size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..fb882c5b771
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index a76256405fe..38ff8bcabe8 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3144,6 +3144,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
has_table_privilege
@@ -3157,6 +3163,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b2..b63c6e0f744 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1740,6 +1740,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ numa_zone_id,
+ numa_size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, numa_zone_id, numa_size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 37b6d21e1f9..c07a4c7633a 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..e748434c2fe
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,10 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index d195aaf1377..28261fd774b 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1921,11 +1921,13 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.39.5
v11-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchapplication/octet-stream; name=v11-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchDownload
From f77b7e1a680a7aad370cf2ac9233b9c17bd5a4a2 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 10:19:35 +0100
Subject: [PATCH v11 1/3] Add optional dependency to libnuma (Linux-only) for
basic NUMA awareness routines and add minimal src/port/pg_numa.c portability
wrapper.
Other platforms can be added later.
libnuma is unavailable on 32-bit builds, so due to lack of i386 shared object,
we disable it there (it does not make sense anyway as i386 is is very memory-only
limited even with PAE)
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 12 ++-
configure | 87 ++++++++++++++++
configure.ac | 13 +++
doc/src/sgml/func.sgml | 13 +++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 21 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 1 +
src/backend/storage/ipc/shmem.c | 11 ++
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 5 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 43 ++++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 151 ++++++++++++++++++++++++++++
18 files changed, 387 insertions(+), 5 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 5849cbb839a..7010dff7aef 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -445,8 +445,10 @@ task:
EOF
setup_additional_packages_script: |
- #apt-get update
- #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
+ apt-get update
+ DEBIAN_FRONTEND=noninteractive apt-get -y install \
+ libnuma1 \
+ libnuma-dev
matrix:
# SPECIAL:
@@ -471,6 +473,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
\
${LINUX_CONFIGURE_FEATURES} \
\
@@ -519,6 +522,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
@@ -835,8 +839,8 @@ task:
folder: $CCACHE_DIR
setup_additional_packages_script: |
- #apt-get update
- #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
+ apt-get update
+ DEBIAN_FRONTEND=noninteractive apt-get -y install libnuma1 libnuma-dev
###
# Test that code can be built with gcc/clang without warnings
diff --git a/configure b/configure
index 93fddd69981..23c33dd9971 100755
--- a/configure
+++ b/configure
@@ -711,6 +711,7 @@ with_libxml
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
+with_libnuma
with_uuid
with_readline
with_systemd
@@ -868,6 +869,7 @@ with_libedit_preferred
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -1581,6 +1583,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -9140,6 +9143,33 @@ fi
+#
+# NUMA
+#
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
@@ -12378,6 +12408,63 @@ fi
fi
+if test "$with_libnuma" = yes ; then
+
+ ac_fn_c_check_header_mongrel "$LINENO" "numa.h" "ac_cv_header_numa_h" "$ac_includes_default"
+if test "x$ac_cv_header_numa_h" = xyes; then :
+
+else
+ as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
+fi
+
+
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'numa' does not provide numa_available" "$LINENO" 5
+fi
+
+fi
+
# XXX libcurl must link after libgssapi_krb5 on FreeBSD to avoid segfaults
# during gss_acquire_cred(). This is possibly related to Curl's Heimdal
# dependency on that platform?
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..1a394dfc077 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1041,6 +1041,19 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 51dd8ad6571..071b98e6c9a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25061,6 +25061,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if a <acronym>NUMA</acronym> support is available.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index e076cefa3b9..9f56205a1d7 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-libxml">
<term><option>--with-libxml</option></term>
<listitem>
@@ -2611,6 +2621,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 13c13748e5d..19500ebdfb2 100644
--- a/meson.build
+++ b/meson.build
@@ -949,6 +949,25 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ else
+ libnuma = not_found_dep
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: libxml
@@ -3168,6 +3187,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
libxml,
lz4,
pam,
@@ -3823,6 +3843,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..adaadb5faf1 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5ac..0bd4b2d7d32 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..4c9c3cb320f 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -568,3 +569,13 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL level function returning whether NUMA support was compiled in. */
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ if(pg_numa_init() == -1)
+ PG_RETURN_BOOL(false);
+ PG_RETURN_BOOL(true);
+}
+
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index ad25cbb39c5..dd34c79f521 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -563,7 +563,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 42e427f8fe8..38612d8ae12 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8489,6 +8489,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '5102', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
@@ -12477,3 +12481,4 @@
prosrc => 'gist_stratnum_common' },
]
+
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..8894f800607 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..d3ebe8b5bd8
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "c.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..f786c191605 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS'
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c224319512..a68a29d5414 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_numa.o \
pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d43..7ffbd4d88d2 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_numa.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..b9348caaca9
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,151 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#ifndef FRONTEND
+/* XXX: not tested */
+void
+numa_warn(int num, char *fmt,...)
+{
+ va_list ap;
+ int olde = errno;
+ int needed;
+ StringInfoData msg;
+
+ initStringInfo(&msg);
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ if (needed > 0)
+ {
+ enlargeStringInfo(&msg, needed);
+ va_start(ap, fmt);
+ appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ }
+
+ ereport(WARNING,
+ (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
+ errmsg_internal("libnuma: WARNING: %s", msg.data)));
+
+ pfree(msg.data);
+
+ errno = olde;
+}
+
+void
+numa_error(char *where)
+{
+ int olde = errno;
+
+ /*
+ * XXX: for now we issue just WARNING, but long-term that might depend on
+ * numa_set_strict() here.
+ */
+ elog(WARNING, "libnuma: ERROR: %s", where);
+ errno = olde;
+}
+#endif /* FRONTEND */
+
+#else
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+#ifndef WIN32
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+#else
+ Size os_page_size;
+ SYSTEM_INFO sysinfo;
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#endif
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#endif
--
2.39.5
Hi,
On Wed, Mar 12, 2025 at 04:41:15PM +0100, Jakub Wartak wrote:
On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:Thanks for the new version!
v10 is attached with most fixes after review and one new thing
introduced: pg_numa_available() for run-time decision inside tests
which was needed after simplifying code a little bit as you wanted.
I've also fixed -Dlibnuma=disabled as it was not properly implemented.
There are couple open items (minor/decision things), but most is fixed
or answered:
Thanks for the new version!
=== 2
+else + as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5 +fiMaybe we should add the same test (checking for numa.h) for meson?
TBH, I have no idea, libnuma.h may exist but it may not link e.g. when
cross-compiling 32-bit on 64-bit. Or is this more about keeping sync
between meson and autoconf?
Yeah, idea was to have both in sync.
=== 8
# create extension pg_buffercache;
ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so: undefined symbol: pg_numa_query_pages
CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages()
RETURNS SETOF RECORD
AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
LANGUAGE C PARALLEL SAFE"
extension script file "pg_buffercache--1.2.sql", near line 7While that's ok if 0003 is applied. I think that each individual patch should
"fully" work individually.STILL OPEN QUESTION: Not sure I understand: you need to have 0001 +
0002 or 0001 + 0003, but here 0002 is complaining about lack of
pg_numa_query_pages() which is part of 0001 (?) Should I merge those
patches or keep them separate?
Applying 0001 + 0002 only does produce the issue above. I think that each
sub-patch once applied should pass make check-world, that means:
0001 : should pass
0001 + 0002 : should pass
0001 + 0002 + 0003 : should pass
Also I need to think more about how firstUseInBackend is used, for example:
==== 17.1
would it be better to define it as a file scope variable? (at least that
would avoid to pass it as an extra parameter in some functions).I have no strong opinion on this, but I have one doubt (for future):
isn't creating global variables making life harder for upcoming
multithreading guys ?
That would be "just" one more. I think it's better to use it to avoid the "current"
code using "useless" function parameters.
=== 17.2
what happens if firstUseInBackend is set to false and later on the pages
are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is
called again by a backend that already had set firstUseInBackend to false,
would that provide accurate information?It is still the correct result. That "touching" (paging-in) is only
necessary probably to properly resolve PTEs as the fork() does not
seem to carry them over from parent:So the above clearly shows that initial touching of shm is required,
but just once and it stays valid afterwards.
Great, thanks for the demo and the explanation!
=== 19
Can you please take a look again on this
Sure, will do.
=== 23
+ if (s >= 0) + zones[s]++;should we also check that s is below a limit?
STILL OPEN QUESTION: I'm not sure it would give us value to report
e.g. -2 on per shm entry/per numa node entry, would it? If it would we
could somehow overallocate that array and remember negative ones too.
I meant to say, ensure that it is below the max node number.
=== 24
Regarding how we make use of pg_numa_get_max_node(), are we sure there is
no possible holes? I mean could a system have node 0,1 and 3 but not 2?I have never seen a hole in numbering as it is established during
boot, and the only way that could get close to adjusting it could be
making processor books (CPU and RAM together) offline.
Yeah probably.
Even with all of that, that still
shouldn't cause issues for this code I think, because
`numa_max_node()` says it `returns the >> highest node number <<<
available on the current system.`
Yeah, I agree, thanks for clearing my doubts on it.
Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I
did a little in 0001 though)That was an attempt to say "Thank You",
Thanks a lot! OTOH, I don't want to get credit for something that I did not do ;-)
I'll have a look at v11 soon.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Thu, Mar 13, 2025 at 02:15:14PM +0000, Bertrand Drouvot wrote:
=== 19
Can you please take a look again on this
Sure, will do.
I'll have a look at v11 soon.
About 0001:
=== 1
git am produces:
.git/rebase-apply/patch:378: new blank line at EOF.
+
.git/rebase-apply/patch:411: new blank line at EOF.
+
warning: 2 lines add whitespace errors.
=== 2
+ Returns true if a <acronym>NUMA</acronym> support is available.
What about "Returns true if the server has been compiled with NUMA support"?
=== 3
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ if(pg_numa_init() == -1)
+ PG_RETURN_BOOL(false);
+ PG_RETURN_BOOL(true);
+}
What about PG_RETURN_BOOL(pg_numa_init() != -1)?
Also I wonder if pg_numa.c would not be a better place for it.
=== 4
+{ oid => '5102', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
Not sure that 5102 is a good choice.
src/include/catalog/unused_oids is telling me:
Best practice is to start with a random choice in the range 8000-9999.
Suggested random unused OID: 9685 (23 consecutive OID(s) available starting here)
So maybe use 9685 instead?
=== 5
@@ -12477,3 +12481,4 @@
prosrc => 'gist_stratnum_common' },
]
+
garbage?
=== 6
Run pgindent as it looks like it's finding some things to do in src/backend/storage/ipc/shmem.c
and src/port/pg_numa.c.
=== 7
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
I think that's more usual to:
Size
pg_numa_get_pagesize(void)
{
Size os_page_size = sysconf(_SC_PAGESIZE);
if (huge_pages_status == HUGE_PAGES_ON)
GetHugePageSize(&os_page_size, NULL);
return os_page_size;
}
I think that makes sense to check huge_pages_status as you did.
=== 8
+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);
I wonder if that would make sense to add a comment mentioning
github.com/numactl/numactl/blob/master/libnuma.c here.
I still need to look at 0002 and 0003.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Thu, Mar 13, 2025 at 3:15 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi,
Thank you very much for the review! I'm answering to both reviews in
one go and the results is attached v12, seems it all should be solved
now:
=== 2
+else + as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5 +fiMaybe we should add the same test (checking for numa.h) for meson?
TBH, I have no idea, libnuma.h may exist but it may not link e.g. when
cross-compiling 32-bit on 64-bit. Or is this more about keeping sync
between meson and autoconf?Yeah, idea was to have both in sync.
Added.
=== 8
# create extension pg_buffercache;
ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so: undefined symbol: pg_numa_query_pages
CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages()
RETURNS SETOF RECORD
AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
LANGUAGE C PARALLEL SAFE"
extension script file "pg_buffercache--1.2.sql", near line 7While that's ok if 0003 is applied. I think that each individual patch should
"fully" work individually.STILL OPEN QUESTION: Not sure I understand: you need to have 0001 +
0002 or 0001 + 0003, but here 0002 is complaining about lack of
pg_numa_query_pages() which is part of 0001 (?) Should I merge those
patches or keep them separate?Applying 0001 + 0002 only does produce the issue above. I think that each
sub-patch once applied should pass make check-world, that means:0001 : should pass
0001 + 0002 : should pass
0001 + 0002 + 0003 : should pass
OK, I've retested v11 for all three of them. It worked fine (I think
in v10 I've moved one function to 0001, but pg_numa_query_pages() as
per error message above was always in the 0001).
Also I need to think more about how firstUseInBackend is used, for example:
==== 17.1
would it be better to define it as a file scope variable? (at least that
would avoid to pass it as an extra parameter in some functions).I have no strong opinion on this, but I have one doubt (for future):
isn't creating global variables making life harder for upcoming
multithreading guys ?That would be "just" one more. I think it's better to use it to avoid the "current"
code using "useless" function parameters.
Done.
=== 23
+ if (s >= 0) + zones[s]++;should we also check that s is below a limit?
STILL OPEN QUESTION: I'm not sure it would give us value to report
e.g. -2 on per shm entry/per numa node entry, would it? If it would we
could somehow overallocate that array and remember negative ones too.I meant to say, ensure that it is below the max node number.
Done, but I doubt the kernel would return a value higher than
numa_max_nodes(), but who knows. Additional defense for this array is
now there.
SECOND REVIEW//v11-0001 review
=== 1
git am produces:
.git/rebase-apply/patch:378: new blank line at EOF.
+
.git/rebase-apply/patch:411: new blank line at EOF.
+
warning: 2 lines add whitespace errors.
Should be gone, but in at least one case (0003/numa.out) we need to
have empty EOF because otherwise expected tests don't pass (even if
numa.sql doesnt have EOF in numa.sql)
=== 2
+ Returns true if a <acronym>NUMA</acronym> support is available.
What about "Returns true if the server has been compiled with NUMA support"?
Done.
=== 3
+Datum +pg_numa_available(PG_FUNCTION_ARGS) +{ + if(pg_numa_init() == -1) + PG_RETURN_BOOL(false); + PG_RETURN_BOOL(true); +}What about PG_RETURN_BOOL(pg_numa_init() != -1)?
Also I wonder if pg_numa.c would not be a better place for it.
Both done.
=== 4
+{ oid => '5102', descr => 'Is NUMA compilation available?', + proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool', + proargtypes => '', prosrc => 'pg_numa_available' }, +Not sure that 5102 is a good choice.
src/include/catalog/unused_oids is telling me:
Best practice is to start with a random choice in the range 8000-9999.
Suggested random unused OID: 9685 (23 consecutive OID(s) available starting here)So maybe use 9685 instead?
It's because i've got 5101 there earlier (for pg_shm_numa_allocs
view), but I've aligned both (5102@0001 and 5101@0003) to 968[56].
=== 5
@@ -12477,3 +12481,4 @@
prosrc => 'gist_stratnum_common' },]
+garbage?
Yea, fixed.
=== 6
Run pgindent as it looks like it's finding some things to do in src/backend/storage/ipc/shmem.c
and src/port/pg_numa.c.
Fixed.
=== 7
+Size +pg_numa_get_pagesize(void) +{ + Size os_page_size = sysconf(_SC_PAGESIZE); + if (huge_pages_status == HUGE_PAGES_ON) + GetHugePageSize(&os_page_size, NULL); + return os_page_size; +}I think that's more usual to:
Size
pg_numa_get_pagesize(void)
{
Size os_page_size = sysconf(_SC_PAGESIZE);if (huge_pages_status == HUGE_PAGES_ON)
GetHugePageSize(&os_page_size, NULL);return os_page_size;
}I think that makes sense to check huge_pages_status as you did.
Done.
=== 8
+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3); +extern void numa_error(char *where);I wonder if that would make sense to add a comment mentioning
github.com/numactl/numactl/blob/master/libnuma.c here.
Sure thing, added.
-J.
Attachments:
v12-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones.patchapplication/octet-stream; name=v12-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones.patchDownload
From 6068f8b5c4a0eb29d684e8865221801a0c682543 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v12 3/3] Add pg_shmem_numa_allocations to show NUMA zones for
shared memory allocations
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 78 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 125 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 +++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 266 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f5a306247e..5164083131a 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -176,6 +176,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-numa-allocations"><structname>pg_shmem_numa_allocations</structname></link></entry>
+ <entry>NUMA mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -3746,6 +3751,79 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-numa-allocations">
+ <title><structname>pg_shmem_numa_allocations</structname></title>
+
+ <indexterm zone="view-pg-shmem-numa-allocations">
+ <primary>pg_shmem_numa_allocations</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_numa_allocations</structname> view shows NUMA nodes
+ assigned allocations made from the server's main shared memory segment.
+ This includes both memory allocated by <productname>PostgreSQL</productname>
+ itself and memory allocated by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_numa_allocations</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_zone_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of NUMA node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_numa_allocations</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf5..cc014a62dc2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..9331a5760f6 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port//pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstUseInBackend = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,125 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA zones for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ int shm_total_page_count,
+ shm_ent_page_count,
+ max_zones;
+ Size *zones;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");;
+ return (Datum) 0;
+ }
+ max_zones = pg_numa_get_max_node();
+ zones = palloc(sizeof(Size) * (max_zones + 1));
+
+ /*
+ * This is for gathering some NUMA statistics. We might be using various
+ * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in
+ * various different OS memory pages sizes, so first we need to understand
+ * the OS memory page size before calling move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Preallocate memory all at once without going into details which shared
+ * memory segment is the biggest (technically min s_b can be as low as
+ * 16xBLCKSZ)
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+ memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count);
+
+ if (firstUseInBackend)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ shm_ent_page_count = ent->allocated_size / os_page_size;
+ /* It is always at least 1 page */
+ shm_ent_page_count = shm_ent_page_count == 0 ? 1 : shm_ent_page_count;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages so that inquiry about NUMA zone doesn't return -2.
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstUseInBackend)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ memset(zones, 0, sizeof(Size) * (max_zones + 1));
+ /* Count number of NUMA zones used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s >= 0 && s <= max_zones)
+ zones[s]++;
+ }
+
+ for (i = 0; i <= max_zones; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(zones[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * XXX: We are ignoring in NUMA version reporting of the following regions
+ * (compare to pg_get_shmem_allocations() case): 1. output shared memory
+ * allocated but not counted via the shmem index 2. output as-of-yet
+ * unused shared memory
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstUseInBackend = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 85902903653..55ff305a713 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8496,6 +8496,14 @@
proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,numa_zone_id,numa_size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..fb882c5b771
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 954f549555e..d9d62470cdc 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3144,6 +3144,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
has_table_privilege
@@ -3157,6 +3163,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b2..b63c6e0f744 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1740,6 +1740,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ numa_zone_id,
+ numa_size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, numa_zone_id, numa_size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 37b6d21e1f9..c07a4c7633a 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..fddb21a260a
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index b81694c24f2..f93d4829702 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1921,11 +1921,13 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.39.5
v12-0002-Extend-pg_buffercache-with-new-view-pg_buffercac.patchapplication/octet-stream; name=v12-0002-Extend-pg_buffercache-with-new-view-pg_buffercac.patchDownload
From 78607bd84be0b9a448491bcb0a7d3c6b8a042d1c Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 11:17:28 +0100
Subject: [PATCH v12 2/3] Extend pg_buffercache with new view
pg_buffercache_numa to show NUMA zone for indvidual buffer.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache_numa.out | 28 ++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 42 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 463 +++++++++++++-----
.../sql/pg_buffercache_numa.sql | 20 +
doc/src/sgml/pgbuffercache.sgml | 64 ++-
9 files changed, 493 insertions(+), 134 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..52f63aa258c
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,42 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the new function.
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+CREATE OR REPLACE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, numa_zone_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pages() TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3ae0a018e10..b27add81f0a 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,12 +11,12 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
-
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -43,6 +43,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_zone_id;
} BufferCachePagesRec;
@@ -61,84 +62,255 @@ typedef struct
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-Datum
-pg_buffercache_pages(PG_FUNCTION_ARGS)
+/*
+ * To get reliable results we need to "touch pages" once, see
+ * comments nearby pg_buffercache_numa_prepare_ptrs().
+ */
+static bool firstUseInBackend = true;
+
+/*
+ * Helper routine to map Buffers into addresses that can be
+ * later consumed by pg_numa_query_pages()
+ *
+ * Many buffers can point to the same page (in case of
+ * BLCKSZ < 4kB), but we want to also query just first
+ * address.
+ *
+ * In order to get reliable results we also need to touch
+ * memory pages, so that inquiry about NUMA zone doesn't
+ * return -2.
+ */
+static inline void
+pg_buffercache_numa_prepare_ptrs(int buffer_id, float pages_per_blk, Size os_page_size,
+ void **os_page_ptrs)
+{
+ size_t blk2page = (size_t)(buffer_id * pages_per_blk);
+
+ for (size_t j = 0; j < pages_per_blk; j++)
+ {
+ size_t blk2pageoff = blk2page + j;
+ if (os_page_ptrs[blk2pageoff] == 0)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ /* NBuffers count start really from 1 */
+ os_page_ptrs[blk2pageoff] = (char *) BufferGetBlock(buffer_id + 1) + (os_page_size * j);
+
+ /* We just need to do it only once in backend */
+ if (firstUseInBackend)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2pageoff]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ }
+}
+
+/*
+ * Helper routine for pg_buffercache_(numa_)pages.
+ *
+ * We need fcinfo here and we pass it here with PG_FUNCTION_ARGS
+ */
+static BufferCachePagesContext *
+pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS)
{
- FuncCallContext *funcctx;
- Datum result;
- MemoryContext oldcontext;
BufferCachePagesContext *fctx; /* User function context. */
+ MemoryContext oldcontext;
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
- HeapTuple tuple;
- if (SRF_IS_FIRSTCALL())
- {
- int i;
+ /*
+ * Switch context when allocating stuff to be used in later calls
+ */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
- funcctx = SRF_FIRSTCALL_INIT();
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
- /* Switch context when allocating stuff to be used in later calls */
- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. We unfortunately have to get the result type for that... - we
+ * can't use the result type determined by the function definition without
+ * potentially crashing when somebody uses the old (or even wrong)
+ * function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+ expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
+ INT2OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
+ BOOLOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
+ INT2OID, -1, 0);
+
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+ INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "numa_zone_id",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCachePagesRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCachePagesRec) * NBuffers);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /*
+ * Return to original context when allocating transient memory
+ */
+ MemoryContextSwitchTo(oldcontext);
+ return fctx;
+}
+
+/*
+ * Helper routine for pg_buffercache_(numa_)pages
+ */
+static void
+pg_buffercache_build_tuple(int i, BufferCachePagesContext *fctx)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ bufHdr = GetBufferDescriptor(i);
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+
+ fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
+ fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ fctx->record[i].reltablespace = bufHdr->tag.spcOid;
+ fctx->record[i].reldatabase = bufHdr->tag.dbOid;
+ fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
+ fctx->record[i].blocknum = bufHdr->tag.blockNum;
+ fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+ if (buf_state & BM_DIRTY)
+ fctx->record[i].isdirty = true;
+ else
+ fctx->record[i].isdirty = false;
+
+ /*
+ * Note if the buffer is valid, and has storage created
+ */
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ fctx->record[i].isvalid = true;
+ else
+ fctx->record[i].isvalid = false;
+
+ fctx->record[i].numa_zone_id = -1;
+
+ UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
+ * Helper routine for pg_buffercache_(numa_)pages
+ */
+static Datum
+get_buffercache_tuple(int i, BufferCachePagesContext *fctx)
+{
+ Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+ HeapTuple tuple;
+
+ values[0] = Int32GetDatum(fctx->record[i].bufferid);
+ nulls[0] = false;
- /* Create a user function context for cross-call persistence */
- fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+ /*
+ * Set all fields except the bufferid to null if the buffer is unused or
+ * not valid.
+ */
+ if (fctx->record[i].blocknum == InvalidBlockNumber ||
+ fctx->record[i].isvalid == false)
+ {
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
+ nulls[6] = true;
+ nulls[7] = true;
+
+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */
+ nulls[8] = true;
+ nulls[9] = true;
+ }
+ else
+ {
+ values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
+ nulls[1] = false;
+ values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
+ nulls[2] = false;
+ values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
+ nulls[3] = false;
+ values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
+ nulls[4] = false;
+ values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
+ nulls[5] = false;
+ values[6] = BoolGetDatum(fctx->record[i].isdirty);
+ nulls[6] = false;
+ values[7] = Int16GetDatum(fctx->record[i].usagecount);
+ nulls[7] = false;
/*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
+ * unused for v1.0 callers, but the array is always long enough
*/
- if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
- elog(ERROR, "return type must be a row type");
+ values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
+ nulls[8] = false;
+ values[9] = Int32GetDatum(fctx->record[i].numa_zone_id);
+ nulls[9] = false;
+ }
- if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
- expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
- elog(ERROR, "incorrect number of output arguments");
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
- /* Construct a tuple descriptor for the result rows. */
- tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
- TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
- INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
- INT2OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
- INT8OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
- BOOLOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
- INT2OID, -1, 0);
-
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);
-
- fctx->tupdesc = BlessTupleDesc(tupledesc);
-
- /* Allocate NBuffers worth of BufferCachePagesRec records. */
- fctx->record = (BufferCachePagesRec *)
- MemoryContextAllocHuge(CurrentMemoryContext,
- sizeof(BufferCachePagesRec) * NBuffers);
-
- /* Set max calls and remember the user function context. */
- funcctx->max_calls = NBuffers;
- funcctx->user_fctx = fctx;
-
- /* Return to original context when allocating transient memory */
- MemoryContextSwitchTo(oldcontext);
+/*
+ * When updating this routine please sync it with below one:
+ * pg_buffercache_numa_pages()
+ */
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
* Scan through all the buffers, saving the relevant fields in the
@@ -149,36 +321,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
* locks, so the information of each buffer is self-consistent.
*/
for (i = 0; i < NBuffers; i++)
- {
- BufferDesc *bufHdr;
- uint32 buf_state;
-
- bufHdr = GetBufferDescriptor(i);
- /* Lock each buffer header before inspecting. */
- buf_state = LockBufHdr(bufHdr);
-
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
- if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
- else
- fctx->record[i].isdirty = false;
-
- /* Note if the buffer is valid, and has storage created */
- if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
- else
- fctx->record[i].isvalid = false;
-
- UnlockBufHdr(bufHdr, buf_state);
- }
+ pg_buffercache_build_tuple(i, fctx);
}
funcctx = SRF_PERCALL_SETUP();
@@ -188,59 +331,117 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
if (funcctx->call_cntr < funcctx->max_calls)
{
+ Datum result;
uint32 i = funcctx->call_cntr;
- Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
- bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
- nulls[0] = false;
+ result = get_buffercache_tuple(i, fctx);
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ {
+ SRF_RETURN_DONE(funcctx);
+ }
+}
+
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_pages_status = NULL;
+ int os_page_count = 0;
+ float pages_per_blk = 0;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
- * Set all fields except the bufferid to null if the buffer is unused
- * or not valid.
+ * This is for gathering some NUMA statistics. We might be using
+ * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
+ * allocated in various different OS memory pages sizes, so first
+ * we need to understand the OS memory page size before calling
+ * move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+ os_page_count = ((uint64) NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA: os_page_count=%d os_page_size=%zu pages_per_blk=%f",
+ os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(int) * os_page_count);
+ memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably
+ * have bug in our buffers to OS page mapping code here
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstUseInBackend)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
*/
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
+ for (i = 0; i < NBuffers; i++)
{
- nulls[1] = true;
- nulls[2] = true;
- nulls[3] = true;
- nulls[4] = true;
- nulls[5] = true;
- nulls[6] = true;
- nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
- nulls[8] = true;
+ pg_buffercache_build_tuple(i, fctx);
+ pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs);
}
- else
+
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ for (i = 0; i < NBuffers; i++)
{
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
- nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
- nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
- nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
- nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
- nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
- nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
- nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
- nulls[8] = false;
+ int blk2page = (int) i * pages_per_blk;
+
+ /*
+ * Technically we can get errors too here and pass that to
+ * user. Also we could somehow report single DB block spanning
+ * more than one NUMA zone, but it should be rare.
+ */
+ fctx->record[i].numa_zone_id = os_pages_status[blk2page];
}
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
- /* Build and return the tuple. */
- tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
- result = HeapTupleGetDatum(tuple);
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ Datum result;
+ uint32 i = funcctx->call_cntr;
+
+ result = get_buffercache_tuple(i, fctx);
SRF_RETURN_NEXT(funcctx, result);
}
else
+ {
+ firstUseInBackend = false;
SRF_RETURN_DONE(funcctx);
+ }
}
Datum
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..75978a6eaed 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,13 @@
convenient use.
</para>
+ <para>
+ The similiar <function>pg_buffercache_numa_pages()</function> is a slower
+ variant of the above, but also can provide NUMA node ID for shared buffer entry.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +209,59 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache_numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed are almost identical to the previous
+ <structname>pg_buffercache</structname> view, but this one includes one additional
+ column numa_zone_id as defined in <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_zone_id</structfield> <type>integer</type>
+ </para>
+ <para>
+ ID (number) of the NUMA node for this particular buffer. NULL if the shared buffer
+ has not been used yet.On systems without NUMA this usually returns 0.
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ This is clone version of the original pg_buffercache view, however it provides
+ additional <structfield>numa_zone_id</structfield> column. Fetching this
+ information from OS is costly and might take much longer and querying it is not
+ recommended by automated or monitoring systems.
+ </para>
+
+ <para>
+ As NUMA node ID inquiry for each page requires memory pages to be paged-in, first
+ execution of this function can take long time especially on systems with bigint
+ shared_buffers and without huge_pages enabled.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
--
2.39.5
v12-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchapplication/octet-stream; name=v12-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchDownload
From 0754817d5081f18c9d7edb9c95d412fb4dba7552 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 10:19:35 +0100
Subject: [PATCH v12 1/3] Add optional dependency to libnuma (Linux-only) for
basic NUMA awareness routines and add minimal src/port/pg_numa.c portability
wrapper. Other platforms can be added later.
This also adds function pg_numa_available() that can be used to check if
the server was linked with NUMA support.
libnuma is unavailable on 32-bit builds, so due to lack of i386 shared object,
we disable it there (it does not make sense anyway on i386 it is very memory
limited platform even with PAE)
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 12 +-
configure | 87 ++++++++++++++
configure.ac | 13 +++
doc/src/sgml/func.sgml | 13 +++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 1 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 46 ++++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 168 ++++++++++++++++++++++++++++
17 files changed, 397 insertions(+), 5 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 5849cbb839a..7010dff7aef 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -445,8 +445,10 @@ task:
EOF
setup_additional_packages_script: |
- #apt-get update
- #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
+ apt-get update
+ DEBIAN_FRONTEND=noninteractive apt-get -y install \
+ libnuma1 \
+ libnuma-dev
matrix:
# SPECIAL:
@@ -471,6 +473,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
\
${LINUX_CONFIGURE_FEATURES} \
\
@@ -519,6 +522,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
@@ -835,8 +839,8 @@ task:
folder: $CCACHE_DIR
setup_additional_packages_script: |
- #apt-get update
- #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
+ apt-get update
+ DEBIAN_FRONTEND=noninteractive apt-get -y install libnuma1 libnuma-dev
###
# Test that code can be built with gcc/clang without warnings
diff --git a/configure b/configure
index 93fddd69981..23c33dd9971 100755
--- a/configure
+++ b/configure
@@ -711,6 +711,7 @@ with_libxml
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
+with_libnuma
with_uuid
with_readline
with_systemd
@@ -868,6 +869,7 @@ with_libedit_preferred
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -1581,6 +1583,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -9140,6 +9143,33 @@ fi
+#
+# NUMA
+#
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
@@ -12378,6 +12408,63 @@ fi
fi
+if test "$with_libnuma" = yes ; then
+
+ ac_fn_c_check_header_mongrel "$LINENO" "numa.h" "ac_cv_header_numa_h" "$ac_includes_default"
+if test "x$ac_cv_header_numa_h" = xyes; then :
+
+else
+ as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
+fi
+
+
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'numa' does not provide numa_available" "$LINENO" 5
+fi
+
+fi
+
# XXX libcurl must link after libgssapi_krb5 on FreeBSD to avoid segfaults
# during gss_acquire_cred(). This is possibly related to Curl's Heimdal
# dependency on that platform?
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..1a394dfc077 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1041,6 +1041,19 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1c3810e1a04..113588defdd 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25078,6 +25078,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index e076cefa3b9..9f56205a1d7 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-libxml">
<term><option>--with-libxml</option></term>
<listitem>
@@ -2611,6 +2621,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 13c13748e5d..4106c4b13f5 100644
--- a/meson.build
+++ b/meson.build
@@ -949,6 +949,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: libxml
@@ -3168,6 +3189,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
libxml,
lz4,
pam,
@@ -3823,6 +3845,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..adaadb5faf1 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5ac..0bd4b2d7d32 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 508970680d1..4255b47c29a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -563,7 +563,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 890822eaf79..85902903653 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8492,6 +8492,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..8894f800607 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..986152e0942
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,46 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "c.h"
+#include "postgres.h"
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..f786c191605 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS'
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c224319512..a68a29d5414 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_numa.o \
pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d43..7ffbd4d88d2 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_numa.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..7d905ef31f5
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,168 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
+
+#ifndef FRONTEND
+/*
+ * XXX: not really tested as there is no way to trigger this in our
+ * current usage of libnuma.
+ *
+ * The libnuma built-in code can be seen here:
+ * https://github.com/numactl/numactl/blob/master/libnuma.c
+ *
+ */
+void
+numa_warn(int num, char *fmt,...)
+{
+ va_list ap;
+ int olde = errno;
+ int needed;
+ StringInfoData msg;
+
+ initStringInfo(&msg);
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ if (needed > 0)
+ {
+ enlargeStringInfo(&msg, needed);
+ va_start(ap, fmt);
+ appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ }
+
+ ereport(WARNING,
+ (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
+ errmsg_internal("libnuma: WARNING: %s", msg.data)));
+
+ pfree(msg.data);
+
+ errno = olde;
+}
+
+void
+numa_error(char *where)
+{
+ int olde = errno;
+
+ /*
+ * XXX: for now we issue just WARNING, but long-term that might depend on
+ * numa_set_strict() here.
+ */
+ elog(WARNING, "libnuma: ERROR: %s", where);
+ errno = olde;
+}
+#endif /* FRONTEND */
+
+#else
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+#ifndef WIN32
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+#else
+ Size os_page_size;
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#endif
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
--
2.39.5
Hi,
On Fri, Mar 14, 2025 at 11:05:28AM +0100, Jakub Wartak wrote:
On Thu, Mar 13, 2025 at 3:15 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:Hi,
Thank you very much for the review! I'm answering to both reviews in
one go and the results is attached v12, seems it all should be solved
now:
Thanks for v12!
I'll review 0001 and 0003 later, but want to share what I've done for 0002.
I did prepare a patch file (attached as .txt to not disturb the cfbot) to apply
on top of v11 0002 (I just rebased it a bit so that it now applies on top of
v12 0002).
The main changes are:
=== 1
Some doc wording changes.
=== 2
Some comments, pgindent and friends changes.
=== 3
relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
- pinning_backends int4, numa_zone_id int4);
+ pinning_backends int4, zone_id int4);
I changed numa_zone_id to zone_id as we are already in "numa" related functions
and/or views.
=== 4
- CHECK_FOR_INTERRUPTS();
}
+
+ CHECK_FOR_INTERRUPTS();
I think that it's better to put the CFI outside of the if condition, so that it
can be called for every page.
=== 5
-pg_buffercache_build_tuple(int i, BufferCachePagesContext *fctx)
+pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)
for clarity.
=== 6
-get_buffercache_tuple(int i, BufferCachePagesContext *fctx)
+get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
for clarity.
=== 7
- int os_page_count = 0;
+ uint64 os_page_count = 0;
I think that's better.
=== 8
But then, here, we need to change its format to %lu and and to cast to (unsigned long)
to make the CI CompilerWarnings happy.
That's fine because we don't provide NUMA support for 32 bits anyway.
- elog(DEBUG1, "NUMA: os_page_count=%d os_page_size=%zu pages_per_blk=%f",
- os_page_count, os_page_size, pages_per_blk);
+ elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu pages_per_blk=%.2f",
+ (unsigned long) os_page_count, os_page_size, pages_per_blk);
=== 9
-static bool firstUseInBackend = true;
+static bool firstNumaTouch = true;
Looks better to me but still not 100% convinced by the name.
=== 10
static BufferCachePagesContext *
-pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS)
+pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
as PG_FUNCTION_ARGS is usually used for fmgr-compatible function and there is
a lof of examples in the code that make use of "FunctionCallInfo" for non
fmgr-compatible function.
and also:
=== 11
I don't like the fact that we iterate 2 times over NBuffers in
pg_buffercache_numa_pages().
But I'm having having hard time finding a better approach given the fact that
pg_numa_query_pages() needs all the pointers "prepared" before it can be called...
Those 2 loops are probably the best approach, unless someone has a better idea.
=== 12
Upthread you asked "Can you please take a look again on this, is this up to the
project standards?"
Was the question about using pg_buffercache_numa_prepare_ptrs() as an inlined wrapper?
What do you think? The comments, doc and code changes are just proposals and are
fully open to discussion.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
on_top_of_v12_0002.patchtext/x-diff; charset=us-asciiDownload
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
index 52f63aa258c..42a693aa4d4 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -28,7 +28,7 @@ CREATE OR REPLACE VIEW pg_buffercache_numa AS
SELECT P.* FROM pg_buffercache_numa_pages() AS P
(bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
- pinning_backends int4, numa_zone_id int4);
+ pinning_backends int4, zone_id int4);
-- Don't want these to be available to public.
REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index b27add81f0a..c5cfa32fa07 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -67,56 +67,59 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-/*
- * To get reliable results we need to "touch pages" once, see
- * comments nearby pg_buffercache_numa_prepare_ptrs().
- */
-static bool firstUseInBackend = true;
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
/*
- * Helper routine to map Buffers into addresses that can be
- * later consumed by pg_numa_query_pages()
+ * Helper routine to map Buffers into addresses that is used by
+ * pg_numa_query_pages().
*
- * Many buffers can point to the same page (in case of
- * BLCKSZ < 4kB), but we want to also query just first
- * address.
+ * When database block size (BLCKSZ) is smaller than the OS page size (4kB),
+ * multiple database buffers will map to the same OS memory page. In this case,
+ * we only need to query the NUMA zone for the first memory address of each
+ * unique OS page rather than for every buffer.
*
- * In order to get reliable results we also need to touch
- * memory pages, so that inquiry about NUMA zone doesn't
- * return -2.
+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA zone doesn't return -2 (which indicates unmapped/unallocated
+ * pages)
*/
static inline void
-pg_buffercache_numa_prepare_ptrs(int buffer_id, float pages_per_blk, Size os_page_size,
+pg_buffercache_numa_prepare_ptrs(int buffer_id, float pages_per_blk,
+ Size os_page_size,
void **os_page_ptrs)
{
- size_t blk2page = (size_t)(buffer_id * pages_per_blk);
+ size_t blk2page = (size_t) (buffer_id * pages_per_blk);
for (size_t j = 0; j < pages_per_blk; j++)
{
- size_t blk2pageoff = blk2page + j;
+ size_t blk2pageoff = blk2page + j;
+
if (os_page_ptrs[blk2pageoff] == 0)
{
volatile uint64 touch pg_attribute_unused();
- /* NBuffers count start really from 1 */
- os_page_ptrs[blk2pageoff] = (char *) BufferGetBlock(buffer_id + 1) + (os_page_size * j);
+ /* NBuffers starts from 1 */
+ os_page_ptrs[blk2pageoff] = (char *) BufferGetBlock(buffer_id + 1) +
+ (os_page_size * j);
- /* We just need to do it only once in backend */
- if (firstUseInBackend)
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2pageoff]);
- CHECK_FOR_INTERRUPTS();
}
+
+ CHECK_FOR_INTERRUPTS();
}
}
/*
- * Helper routine for pg_buffercache_(numa_)pages.
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
*
- * We need fcinfo here and we pass it here with PG_FUNCTION_ARGS
+ * This is almost identical to pg_buffercache_numa_pages(), but this one performs
+ * memory mapping inquiries to display NUMA zone information for each buffer.
*/
static BufferCachePagesContext *
-pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS)
+pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
{
BufferCachePagesContext *fctx; /* User function context. */
MemoryContext oldcontext;
@@ -191,64 +194,68 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS)
}
/*
- * Helper routine for pg_buffercache_(numa_)pages
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * Build buffer cache information for a single buffer.
*/
static void
-pg_buffercache_build_tuple(int i, BufferCachePagesContext *fctx)
+pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)
{
BufferDesc *bufHdr;
uint32 buf_state;
- bufHdr = GetBufferDescriptor(i);
+ bufHdr = GetBufferDescriptor(record_id);
/* Lock each buffer header before inspecting. */
buf_state = LockBufHdr(bufHdr);
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+ fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr);
+ fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ fctx->record[record_id].reltablespace = bufHdr->tag.spcOid;
+ fctx->record[record_id].reldatabase = bufHdr->tag.dbOid;
+ fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag);
+ fctx->record[record_id].blocknum = bufHdr->tag.blockNum;
+ fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
+ fctx->record[record_id].isdirty = true;
else
- fctx->record[i].isdirty = false;
+ fctx->record[record_id].isdirty = false;
/*
* Note if the buffer is valid, and has storage created
*/
if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
+ fctx->record[record_id].isvalid = true;
else
- fctx->record[i].isvalid = false;
+ fctx->record[record_id].isvalid = false;
- fctx->record[i].numa_zone_id = -1;
+ fctx->record[record_id].numa_zone_id = -1;
UnlockBufHdr(bufHdr, buf_state);
}
/*
- * Helper routine for pg_buffercache_(numa_)pages
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * Format and return a tuple for a single buffer cache entry.
*/
static Datum
-get_buffercache_tuple(int i, BufferCachePagesContext *fctx)
+get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
{
Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
HeapTuple tuple;
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
+ values[0] = Int32GetDatum(fctx->record[record_id].bufferid);
nulls[0] = false;
/*
* Set all fields except the bufferid to null if the buffer is unused or
* not valid.
*/
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
+ if (fctx->record[record_id].blocknum == InvalidBlockNumber ||
+ fctx->record[record_id].isvalid == false)
{
nulls[1] = true;
nulls[2] = true;
@@ -266,27 +273,27 @@ get_buffercache_tuple(int i, BufferCachePagesContext *fctx)
}
else
{
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
+ values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber);
nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
+ values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace);
nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
+ values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase);
nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
+ values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum);
nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
+ values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum);
nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
+ values[6] = BoolGetDatum(fctx->record[record_id].isdirty);
nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
+ values[7] = Int16GetDatum(fctx->record[record_id].usagecount);
nulls[7] = false;
/*
* unused for v1.0 callers, but the array is always long enough
*/
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
+ values[8] = Int32GetDatum(fctx->record[record_id].pinning_backends);
nulls[8] = false;
- values[9] = Int32GetDatum(fctx->record[i].numa_zone_id);
+ values[9] = Int32GetDatum(fctx->record[record_id].numa_zone_id);
nulls[9] = false;
}
@@ -295,10 +302,6 @@ get_buffercache_tuple(int i, BufferCachePagesContext *fctx)
return HeapTupleGetDatum(tuple);
}
-/*
- * When updating this routine please sync it with below one:
- * pg_buffercache_numa_pages()
- */
Datum
pg_buffercache_pages(PG_FUNCTION_ARGS)
{
@@ -359,39 +362,46 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
Size os_page_size = 0;
void **os_page_ptrs = NULL;
int *os_pages_status = NULL;
- int os_page_count = 0;
+ uint64 os_page_count = 0;
float pages_per_blk = 0;
funcctx = SRF_FIRSTCALL_INIT();
+
if (pg_numa_init() == -1)
- elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
- * This is for gathering some NUMA statistics. We might be using
- * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
- * allocated in various different OS memory pages sizes, so first
- * we need to understand the OS memory page size before calling
- * move_pages()
- */
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: - Determine the OS
+ * memory page size - Calculate how many OS pages are used by all
+ * buffer blocks - Calculate how many OS pages are contained within
+ * each database block
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * zone inquiry.
+ */
os_page_size = pg_numa_get_pagesize();
os_page_count = ((uint64) NBuffers * BLCKSZ) / os_page_size;
pages_per_blk = (float) BLCKSZ / os_page_size;
- elog(DEBUG1, "NUMA: os_page_count=%d os_page_size=%zu pages_per_blk=%f",
- os_page_count, os_page_size, pages_per_blk);
+ elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu pages_per_blk=%.2f",
+ (unsigned long) os_page_count, os_page_size, pages_per_blk);
os_page_ptrs = palloc(sizeof(void *) * os_page_count);
- os_pages_status = palloc(sizeof(int) * os_page_count);
+ os_pages_status = palloc(sizeof(uint64) * os_page_count);
memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
/*
- * If we ever get 0xff back from kernel inquiry, then we probably
- * have bug in our buffers to OS page mapping code here
- */
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
- if (firstUseInBackend)
+ if (firstNumaTouch)
elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
/*
@@ -405,7 +415,8 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
for (i = 0; i < NBuffers; i++)
{
pg_buffercache_build_tuple(i, fctx);
- pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs);
+ pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size,
+ os_page_ptrs);
}
if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
@@ -416,10 +427,15 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
int blk2page = (int) i * pages_per_blk;
/*
- * Technically we can get errors too here and pass that to
- * user. Also we could somehow report single DB block spanning
- * more than one NUMA zone, but it should be rare.
- */
+ * Set the NUMA zone ID for this buffer based on the first OS page
+ * it maps to.
+ *
+ * Note: We could check for errors in os_pages_status and report
+ * them. Also, a single DB block might span multiple NUMA zones if
+ * it crosses OS pages on zone boundaries, but we only record the
+ * zone of the first page. This is a simplification but should be
+ * sufficient for most analyses.
+ */
fctx->record[i].numa_zone_id = os_pages_status[blk2page];
}
}
@@ -439,7 +455,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
}
else
{
- firstUseInBackend = false;
+ firstNumaTouch = false;
SRF_RETURN_DONE(funcctx);
}
}
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 75978a6eaed..4b49bb2974a 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -45,8 +45,9 @@
</para>
<para>
- The similiar <function>pg_buffercache_numa_pages()</function> is a slower
- variant of the above, but also can provide NUMA node ID for shared buffer entry.
+ The <function>pg_buffercache_numa_pages()</function> provides the same information
+ as <function>pg_buffercache_pages()</function> but is slower because it also
+ provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
The <structname>pg_buffercache_numa</structname> view wraps the function for
convenient use.
</para>
@@ -213,13 +214,14 @@
<title>The <structname>pg_buffercache_numa</structname> View</title>
<para>
- The definitions of the columns exposed are almost identical to the previous
- <structname>pg_buffercache</structname> view, but this one includes one additional
- column numa_zone_id as defined in <xref linkend="pgbuffercache-numa-columns"/>.
+ The definitions of the columns exposed are identical to the
+ <structname>pg_buffercache</structname> view, except that this one includes
+ one additional <structfield>zone_id</structfield> column as defined in
+ <xref linkend="pgbuffercache-numa-columns"/>.
</para>
<table id="pgbuffercache-numa-columns">
- <title><structname>pg_buffercache_numa</structname> Columns</title>
+ <title><structname>pg_buffercache_numa</structname> Extra column</title>
<tgroup cols="1">
<thead>
<row>
@@ -235,11 +237,12 @@
<tbody>
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>numa_zone_id</structfield> <type>integer</type>
+ <structfield>zone_id</structfield> <type>integer</type>
</para>
<para>
- ID (number) of the NUMA node for this particular buffer. NULL if the shared buffer
- has not been used yet.On systems without NUMA this usually returns 0.
+ <acronym>NUMA</acronym> node ID. NULL if the shared buffer
+ has not been used yet. On systems without <acronym>NUMA</acronym> support
+ this returns 0.
</para></entry>
</row>
@@ -248,16 +251,10 @@
</table>
<para>
- This is clone version of the original pg_buffercache view, however it provides
- additional <structfield>numa_zone_id</structfield> column. Fetching this
- information from OS is costly and might take much longer and querying it is not
- recommended by automated or monitoring systems.
- </para>
-
- <para>
- As NUMA node ID inquiry for each page requires memory pages to be paged-in, first
- execution of this function can take long time especially on systems with bigint
- shared_buffers and without huge_pages enabled.
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
</para>
</sect2>
On Fri, Mar 14, 2025 at 1:08 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
On Fri, Mar 14, 2025 at 11:05:28AM +0100, Jakub Wartak wrote:
On Thu, Mar 13, 2025 at 3:15 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:Hi,
Thank you very much for the review! I'm answering to both reviews in
one go and the results is attached v12, seems it all should be solved
now:Thanks for v12!
I'll review 0001 and 0003 later, but want to share what I've done for 0002.
I did prepare a patch file (attached as .txt to not disturb the cfbot) to apply
on top of v11 0002 (I just rebased it a bit so that it now applies on top of
v12 0002).
Hey Bertrand,
all LGTM (good ideas), so here's v13 attached with applied all of that
(rebased, tested). BTW: I'm sending to make cfbot as it still tried to
apply that .patch (on my side it .patch, not .txt)
=== 9
-static bool firstUseInBackend = true; +static bool firstNumaTouch = true;Looks better to me but still not 100% convinced by the name.
IMHO, Yes, it looks much better.
=== 10
static BufferCachePagesContext * -pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS) +pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)as PG_FUNCTION_ARGS is usually used for fmgr-compatible function and there is
a lof of examples in the code that make use of "FunctionCallInfo" for non
fmgr-compatible function.
Cool, thanks.
and also:
=== 11
I don't like the fact that we iterate 2 times over NBuffers in
pg_buffercache_numa_pages().But I'm having having hard time finding a better approach given the fact that
pg_numa_query_pages() needs all the pointers "prepared" before it can be called...Those 2 loops are probably the best approach, unless someone has a better idea.
IMHO, it doesn't hurt and I've also not been able to think of any better idea.
=== 12
Upthread you asked "Can you please take a look again on this, is this up to the
project standards?"Was the question about using pg_buffercache_numa_prepare_ptrs() as an inlined wrapper?
Yes, this was for an earlier doubt regarding question "19" about
reviewing the code after removal of `query_numa` variable. This is the
same code for 2 loops, IMHO it is good now.
What do you think? The comments, doc and code changes are just proposals and are
fully open to discussion.
They are great, thank You!
-J.
Attachments:
v13-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchapplication/x-patch; name=v13-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchDownload
From 1a55056446fff06e0441d8d05a9e84832dbdc821 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 10:19:35 +0100
Subject: [PATCH v13 1/3] Add optional dependency to libnuma (Linux-only) for
basic NUMA awareness routines and add minimal src/port/pg_numa.c portability
wrapper. Other platforms can be added later.
This also adds function pg_numa_available() that can be used to check if
the server was linked with NUMA support.
libnuma is unavailable on 32-bit builds, so due to lack of i386 shared object,
we disable it there (it does not make sense anyway on i386 it is very memory
limited platform even with PAE)
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 12 +-
configure | 87 ++++++++++++++
configure.ac | 13 +++
doc/src/sgml/func.sgml | 13 +++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 1 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 46 ++++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 168 ++++++++++++++++++++++++++++
17 files changed, 397 insertions(+), 5 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 5849cbb839a..7010dff7aef 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -445,8 +445,10 @@ task:
EOF
setup_additional_packages_script: |
- #apt-get update
- #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
+ apt-get update
+ DEBIAN_FRONTEND=noninteractive apt-get -y install \
+ libnuma1 \
+ libnuma-dev
matrix:
# SPECIAL:
@@ -471,6 +473,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
\
${LINUX_CONFIGURE_FEATURES} \
\
@@ -519,6 +522,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
@@ -835,8 +839,8 @@ task:
folder: $CCACHE_DIR
setup_additional_packages_script: |
- #apt-get update
- #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
+ apt-get update
+ DEBIAN_FRONTEND=noninteractive apt-get -y install libnuma1 libnuma-dev
###
# Test that code can be built with gcc/clang without warnings
diff --git a/configure b/configure
index 93fddd69981..23c33dd9971 100755
--- a/configure
+++ b/configure
@@ -711,6 +711,7 @@ with_libxml
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
+with_libnuma
with_uuid
with_readline
with_systemd
@@ -868,6 +869,7 @@ with_libedit_preferred
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -1581,6 +1583,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -9140,6 +9143,33 @@ fi
+#
+# NUMA
+#
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
@@ -12378,6 +12408,63 @@ fi
fi
+if test "$with_libnuma" = yes ; then
+
+ ac_fn_c_check_header_mongrel "$LINENO" "numa.h" "ac_cv_header_numa_h" "$ac_includes_default"
+if test "x$ac_cv_header_numa_h" = xyes; then :
+
+else
+ as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
+fi
+
+
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'numa' does not provide numa_available" "$LINENO" 5
+fi
+
+fi
+
# XXX libcurl must link after libgssapi_krb5 on FreeBSD to avoid segfaults
# during gss_acquire_cred(). This is possibly related to Curl's Heimdal
# dependency on that platform?
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..1a394dfc077 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1041,6 +1041,19 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1c3810e1a04..113588defdd 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25078,6 +25078,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index e076cefa3b9..9f56205a1d7 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-libxml">
<term><option>--with-libxml</option></term>
<listitem>
@@ -2611,6 +2621,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 13c13748e5d..4106c4b13f5 100644
--- a/meson.build
+++ b/meson.build
@@ -949,6 +949,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: libxml
@@ -3168,6 +3189,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
libxml,
lz4,
pam,
@@ -3823,6 +3845,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..adaadb5faf1 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5ac..0bd4b2d7d32 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 9c0b10ad4dc..c5e8ce06c97 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -563,7 +563,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 890822eaf79..85902903653 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8492,6 +8492,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..8894f800607 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..986152e0942
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,46 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "c.h"
+#include "postgres.h"
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..f786c191605 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS'
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c224319512..a68a29d5414 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_numa.o \
pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d43..7ffbd4d88d2 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_numa.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..7d905ef31f5
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,168 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
+
+#ifndef FRONTEND
+/*
+ * XXX: not really tested as there is no way to trigger this in our
+ * current usage of libnuma.
+ *
+ * The libnuma built-in code can be seen here:
+ * https://github.com/numactl/numactl/blob/master/libnuma.c
+ *
+ */
+void
+numa_warn(int num, char *fmt,...)
+{
+ va_list ap;
+ int olde = errno;
+ int needed;
+ StringInfoData msg;
+
+ initStringInfo(&msg);
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ if (needed > 0)
+ {
+ enlargeStringInfo(&msg, needed);
+ va_start(ap, fmt);
+ appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ }
+
+ ereport(WARNING,
+ (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
+ errmsg_internal("libnuma: WARNING: %s", msg.data)));
+
+ pfree(msg.data);
+
+ errno = olde;
+}
+
+void
+numa_error(char *where)
+{
+ int olde = errno;
+
+ /*
+ * XXX: for now we issue just WARNING, but long-term that might depend on
+ * numa_set_strict() here.
+ */
+ elog(WARNING, "libnuma: ERROR: %s", where);
+ errno = olde;
+}
+#endif /* FRONTEND */
+
+#else
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+#ifndef WIN32
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+#else
+ Size os_page_size;
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#endif
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
--
2.39.5
v13-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones.patchapplication/x-patch; name=v13-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones.patchDownload
From 3110606f68cc40e02b8ab4670c66089be4e2e305 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v13 3/3] Add pg_shmem_numa_allocations to show NUMA zones for
shared memory allocations
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 78 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 125 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 +++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 266 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f5a306247e..5164083131a 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -176,6 +176,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-numa-allocations"><structname>pg_shmem_numa_allocations</structname></link></entry>
+ <entry>NUMA mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -3746,6 +3751,79 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-numa-allocations">
+ <title><structname>pg_shmem_numa_allocations</structname></title>
+
+ <indexterm zone="view-pg-shmem-numa-allocations">
+ <primary>pg_shmem_numa_allocations</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_numa_allocations</structname> view shows NUMA nodes
+ assigned allocations made from the server's main shared memory segment.
+ This includes both memory allocated by <productname>PostgreSQL</productname>
+ itself and memory allocated by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_numa_allocations</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_zone_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of NUMA node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_numa_allocations</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf5..cc014a62dc2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..9331a5760f6 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port//pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstUseInBackend = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,125 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA zones for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ int shm_total_page_count,
+ shm_ent_page_count,
+ max_zones;
+ Size *zones;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");;
+ return (Datum) 0;
+ }
+ max_zones = pg_numa_get_max_node();
+ zones = palloc(sizeof(Size) * (max_zones + 1));
+
+ /*
+ * This is for gathering some NUMA statistics. We might be using various
+ * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in
+ * various different OS memory pages sizes, so first we need to understand
+ * the OS memory page size before calling move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Preallocate memory all at once without going into details which shared
+ * memory segment is the biggest (technically min s_b can be as low as
+ * 16xBLCKSZ)
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+ memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count);
+
+ if (firstUseInBackend)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ shm_ent_page_count = ent->allocated_size / os_page_size;
+ /* It is always at least 1 page */
+ shm_ent_page_count = shm_ent_page_count == 0 ? 1 : shm_ent_page_count;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages so that inquiry about NUMA zone doesn't return -2.
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstUseInBackend)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ memset(zones, 0, sizeof(Size) * (max_zones + 1));
+ /* Count number of NUMA zones used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s >= 0 && s <= max_zones)
+ zones[s]++;
+ }
+
+ for (i = 0; i <= max_zones; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(zones[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * XXX: We are ignoring in NUMA version reporting of the following regions
+ * (compare to pg_get_shmem_allocations() case): 1. output shared memory
+ * allocated but not counted via the shmem index 2. output as-of-yet
+ * unused shared memory
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstUseInBackend = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 85902903653..55ff305a713 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8496,6 +8496,14 @@
proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,numa_zone_id,numa_size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..fb882c5b771
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 954f549555e..d9d62470cdc 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3144,6 +3144,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
has_table_privilege
@@ -3157,6 +3163,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b2..b63c6e0f744 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1740,6 +1740,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ numa_zone_id,
+ numa_size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, numa_zone_id, numa_size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 37b6d21e1f9..c07a4c7633a 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..fddb21a260a
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index b81694c24f2..f93d4829702 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1921,11 +1921,13 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.39.5
v13-0002-Extend-pg_buffercache-with-new-view-pg_buffercac.patchapplication/x-patch; name=v13-0002-Extend-pg_buffercache-with-new-view-pg_buffercac.patchDownload
From 661c36d0a98e572ad0d3d47174f273f5fa3943c4 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 11:17:28 +0100
Subject: [PATCH v13 2/3] Extend pg_buffercache with new view
pg_buffercache_numa to show NUMA zone for indvidual buffer.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache_numa.out | 28 +
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 42 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 479 +++++++++++++-----
.../sql/pg_buffercache_numa.sql | 20 +
doc/src/sgml/pgbuffercache.sgml | 61 ++-
9 files changed, 506 insertions(+), 134 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..42a693aa4d4
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,42 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the new function.
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+CREATE OR REPLACE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, zone_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pages() TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3ae0a018e10..c5cfa32fa07 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,12 +11,12 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
-
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -43,6 +43,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_zone_id;
} BufferCachePagesRec;
@@ -61,84 +62,258 @@ typedef struct
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-Datum
-pg_buffercache_pages(PG_FUNCTION_ARGS)
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+/*
+ * Helper routine to map Buffers into addresses that is used by
+ * pg_numa_query_pages().
+ *
+ * When database block size (BLCKSZ) is smaller than the OS page size (4kB),
+ * multiple database buffers will map to the same OS memory page. In this case,
+ * we only need to query the NUMA zone for the first memory address of each
+ * unique OS page rather than for every buffer.
+ *
+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA zone doesn't return -2 (which indicates unmapped/unallocated
+ * pages)
+ */
+static inline void
+pg_buffercache_numa_prepare_ptrs(int buffer_id, float pages_per_blk,
+ Size os_page_size,
+ void **os_page_ptrs)
+{
+ size_t blk2page = (size_t) (buffer_id * pages_per_blk);
+
+ for (size_t j = 0; j < pages_per_blk; j++)
+ {
+ size_t blk2pageoff = blk2page + j;
+
+ if (os_page_ptrs[blk2pageoff] == 0)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ /* NBuffers starts from 1 */
+ os_page_ptrs[blk2pageoff] = (char *) BufferGetBlock(buffer_id + 1) +
+ (os_page_size * j);
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2pageoff]);
+
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+}
+
+/*
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * This is almost identical to pg_buffercache_numa_pages(), but this one performs
+ * memory mapping inquiries to display NUMA zone information for each buffer.
+ */
+static BufferCachePagesContext *
+pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
{
- FuncCallContext *funcctx;
- Datum result;
- MemoryContext oldcontext;
BufferCachePagesContext *fctx; /* User function context. */
+ MemoryContext oldcontext;
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
- HeapTuple tuple;
- if (SRF_IS_FIRSTCALL())
- {
- int i;
+ /*
+ * Switch context when allocating stuff to be used in later calls
+ */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
- funcctx = SRF_FIRSTCALL_INIT();
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
- /* Switch context when allocating stuff to be used in later calls */
- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. We unfortunately have to get the result type for that... - we
+ * can't use the result type determined by the function definition without
+ * potentially crashing when somebody uses the old (or even wrong)
+ * function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
- /* Create a user function context for cross-call persistence */
- fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+ if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+ expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
+ INT2OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
+ BOOLOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
+ INT2OID, -1, 0);
+
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+ INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "numa_zone_id",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCachePagesRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCachePagesRec) * NBuffers);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /*
+ * Return to original context when allocating transient memory
+ */
+ MemoryContextSwitchTo(oldcontext);
+ return fctx;
+}
+
+/*
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * Build buffer cache information for a single buffer.
+ */
+static void
+pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ bufHdr = GetBufferDescriptor(record_id);
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+
+ fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr);
+ fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ fctx->record[record_id].reltablespace = bufHdr->tag.spcOid;
+ fctx->record[record_id].reldatabase = bufHdr->tag.dbOid;
+ fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag);
+ fctx->record[record_id].blocknum = bufHdr->tag.blockNum;
+ fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+ if (buf_state & BM_DIRTY)
+ fctx->record[record_id].isdirty = true;
+ else
+ fctx->record[record_id].isdirty = false;
+
+ /*
+ * Note if the buffer is valid, and has storage created
+ */
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ fctx->record[record_id].isvalid = true;
+ else
+ fctx->record[record_id].isvalid = false;
+
+ fctx->record[record_id].numa_zone_id = -1;
+
+ UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * Format and return a tuple for a single buffer cache entry.
+ */
+static Datum
+get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+ HeapTuple tuple;
+
+ values[0] = Int32GetDatum(fctx->record[record_id].bufferid);
+ nulls[0] = false;
+
+ /*
+ * Set all fields except the bufferid to null if the buffer is unused or
+ * not valid.
+ */
+ if (fctx->record[record_id].blocknum == InvalidBlockNumber ||
+ fctx->record[record_id].isvalid == false)
+ {
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
+ nulls[6] = true;
+ nulls[7] = true;
+
+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */
+ nulls[8] = true;
+ nulls[9] = true;
+ }
+ else
+ {
+ values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber);
+ nulls[1] = false;
+ values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace);
+ nulls[2] = false;
+ values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase);
+ nulls[3] = false;
+ values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum);
+ nulls[4] = false;
+ values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum);
+ nulls[5] = false;
+ values[6] = BoolGetDatum(fctx->record[record_id].isdirty);
+ nulls[6] = false;
+ values[7] = Int16GetDatum(fctx->record[record_id].usagecount);
+ nulls[7] = false;
/*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
+ * unused for v1.0 callers, but the array is always long enough
*/
- if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
- elog(ERROR, "return type must be a row type");
+ values[8] = Int32GetDatum(fctx->record[record_id].pinning_backends);
+ nulls[8] = false;
+ values[9] = Int32GetDatum(fctx->record[record_id].numa_zone_id);
+ nulls[9] = false;
+ }
- if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
- expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
- elog(ERROR, "incorrect number of output arguments");
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
- /* Construct a tuple descriptor for the result rows. */
- tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
- TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
- INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
- INT2OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
- INT8OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
- BOOLOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
- INT2OID, -1, 0);
-
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);
-
- fctx->tupdesc = BlessTupleDesc(tupledesc);
-
- /* Allocate NBuffers worth of BufferCachePagesRec records. */
- fctx->record = (BufferCachePagesRec *)
- MemoryContextAllocHuge(CurrentMemoryContext,
- sizeof(BufferCachePagesRec) * NBuffers);
-
- /* Set max calls and remember the user function context. */
- funcctx->max_calls = NBuffers;
- funcctx->user_fctx = fctx;
-
- /* Return to original context when allocating transient memory */
- MemoryContextSwitchTo(oldcontext);
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
* Scan through all the buffers, saving the relevant fields in the
@@ -149,36 +324,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
* locks, so the information of each buffer is self-consistent.
*/
for (i = 0; i < NBuffers; i++)
- {
- BufferDesc *bufHdr;
- uint32 buf_state;
-
- bufHdr = GetBufferDescriptor(i);
- /* Lock each buffer header before inspecting. */
- buf_state = LockBufHdr(bufHdr);
-
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
- if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
- else
- fctx->record[i].isdirty = false;
-
- /* Note if the buffer is valid, and has storage created */
- if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
- else
- fctx->record[i].isvalid = false;
-
- UnlockBufHdr(bufHdr, buf_state);
- }
+ pg_buffercache_build_tuple(i, fctx);
}
funcctx = SRF_PERCALL_SETUP();
@@ -188,59 +334,130 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
if (funcctx->call_cntr < funcctx->max_calls)
{
+ Datum result;
uint32 i = funcctx->call_cntr;
- Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
- bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
- nulls[0] = false;
+ result = get_buffercache_tuple(i, fctx);
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ {
+ SRF_RETURN_DONE(funcctx);
+ }
+}
+
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_pages_status = NULL;
+ uint64 os_page_count = 0;
+ float pages_per_blk = 0;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
- * Set all fields except the bufferid to null if the buffer is unused
- * or not valid.
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: - Determine the OS
+ * memory page size - Calculate how many OS pages are used by all
+ * buffer blocks - Calculate how many OS pages are contained within
+ * each database block
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * zone inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+ os_page_count = ((uint64) NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu pages_per_blk=%.2f",
+ (unsigned long) os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(uint64) * os_page_count);
+ memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
*/
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
+ for (i = 0; i < NBuffers; i++)
{
- nulls[1] = true;
- nulls[2] = true;
- nulls[3] = true;
- nulls[4] = true;
- nulls[5] = true;
- nulls[6] = true;
- nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
- nulls[8] = true;
+ pg_buffercache_build_tuple(i, fctx);
+ pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size,
+ os_page_ptrs);
}
- else
+
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ for (i = 0; i < NBuffers; i++)
{
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
- nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
- nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
- nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
- nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
- nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
- nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
- nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
- nulls[8] = false;
+ int blk2page = (int) i * pages_per_blk;
+
+ /*
+ * Set the NUMA zone ID for this buffer based on the first OS page
+ * it maps to.
+ *
+ * Note: We could check for errors in os_pages_status and report
+ * them. Also, a single DB block might span multiple NUMA zones if
+ * it crosses OS pages on zone boundaries, but we only record the
+ * zone of the first page. This is a simplification but should be
+ * sufficient for most analyses.
+ */
+ fctx->record[i].numa_zone_id = os_pages_status[blk2page];
}
+ }
- /* Build and return the tuple. */
- tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
- result = HeapTupleGetDatum(tuple);
+ funcctx = SRF_PERCALL_SETUP();
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ Datum result;
+ uint32 i = funcctx->call_cntr;
+
+ result = get_buffercache_tuple(i, fctx);
SRF_RETURN_NEXT(funcctx, result);
}
else
+ {
+ firstNumaTouch = false;
SRF_RETURN_DONE(funcctx);
+ }
}
Datum
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..4b49bb2974a 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,14 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides the same information
+ as <function>pg_buffercache_pages()</function> but is slower because it also
+ provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +210,55 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache_numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed are identical to the
+ <structname>pg_buffercache</structname> view, except that this one includes
+ one additional <structfield>zone_id</structfield> column as defined in
+ <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Extra column</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>zone_id</structfield> <type>integer</type>
+ </para>
+ <para>
+ <acronym>NUMA</acronym> node ID. NULL if the shared buffer
+ has not been used yet. On systems without <acronym>NUMA</acronym> support
+ this returns 0.
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
--
2.39.5
Hi,
On Mon, Mar 17, 2025 at 08:28:46AM +0100, Jakub Wartak wrote:
On Fri, Mar 14, 2025 at 1:08 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:I did prepare a patch file (attached as .txt to not disturb the cfbot) to apply
on top of v11 0002 (I just rebased it a bit so that it now applies on top of
v12 0002).Hey Bertrand,
all LGTM (good ideas), so here's v13 attached with applied all of that
(rebased, tested).
Thanks for v13!
Looking at 0003:
=== 1
+ <entry>NUMA mappings for shared memory allocations</entry>
s/NUMA mappings/NUMA node mappings/ maybe?
=== 2
+ <para>
+ The <structname>pg_shmem_numa_allocations</structname> view shows NUMA nodes
+ assigned allocations made from the server's main shared memory segment.
What about?
"
shows how shared memory allocations in the server's main shared memory segment
are distributed across NUMA nodes" ?
=== 3
+ <structfield>numa_zone_id</structfield> <type>int4</type>
s/numa_zone_id/zone_id? to be consistent with pg_buffercache_numa introduced in
0002.
BTW, I wonder if "node_id" would be better (to match the descriptions...).
If so, would also need to be done in 0002.
=== 4
+ ID of NUMA node
<acronym>NUMA</acronym> node ID? (to be consistent with 0002).
=== 5
+static bool firstUseInBackend = true;
Let's use firstNumaTouch to be consistent with 0002.
=== 6
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");;
There is 2 ";" + I think that we should used the same wording as in
pg_buffercache_numa_pages().
=== 7
What about using ERROR instead? (like in pg_buffercache_numa_pages())
=== 8
+ /*
+ * This is for gathering some NUMA statistics. We might be using various
+ * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in
+ * various different OS memory pages sizes, so first we need to understand
+ * the OS memory page size before calling move_pages()
+ */
+ os_page_size = pg_numa_get_pagesize();
Maybe use the same comment as the one in pg_buffercache_numa_pages() before calling
pg_numa_get_pagesize()?
=== 9
+ max_zones = pg_numa_get_max_node();
I think we are mixing "zone" and "node". I think we should standardize on one
and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to
vote for node, but zone is fine too if you prefer.
=== 10
+ /*
+ * Preallocate memory all at once without going into details which shared
+ * memory segment is the biggest (technically min s_b can be as low as
+ * 16xBLCKSZ)
+ */
What about?
"
Allocate memory for page pointers and status based on total shared memory size.
This simplified approach allocates enough space for all pages in shared memory
rather than calculating the exact requirements for each segment.
" instead?
=== 11
+ int shm_total_page_count,
+ shm_ent_page_count,
I think those 2 should be uint64.
=== 12
+ /*
+ * XXX: We are ignoring in NUMA version reporting of the following regions
+ * (compare to pg_get_shmem_allocations() case): 1. output shared memory
+ * allocated but not counted via the shmem index 2. output as-of-yet
+ * unused shared memory
+ */
why XXX?
what about?
"
We are ignoring the following memory regions (as compared to
pg_get_shmem_allocations())....
=== 13
+ page_ptrs = palloc(sizeof(void *) * shm_total_page_count);
+ memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count);
maybe we could use palloc0() here?
=== 14
and I realize that we could probably use it in 0002 for os_page_ptrs.
=== 15
I think there is still some multi-lines comments that are missing a period. I
probably also missed some in 0002 during the previous review. I think that's
worth another check.
=== 16
+ * In order to get reliable results we also need to touch memory
+ * pages so that inquiry about NUMA zone doesn't return -2.
+ */
maybe use the same wording as in 0002?
=== 17
The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count
but (as for 0002) it looks like we can not avoid it (or at least I don't see
a way to avoid it).
I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is
updated.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, Mar 17, 2025 at 5:11 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Thanks for v13!
Rebased and fixes inside in the attached v14 (it passes CI too):
Looking at 0003:
=== 1
+ <entry>NUMA mappings for shared memory allocations</entry>
s/NUMA mappings/NUMA node mappings/ maybe?
Done.
=== 2
+ <para> + The <structname>pg_shmem_numa_allocations</structname> view shows NUMA nodes + assigned allocations made from the server's main shared memory segment.What about?
"
shows how shared memory allocations in the server's main shared memory segment
are distributed across NUMA nodes" ?
Done.
=== 3
+ <structfield>numa_zone_id</structfield> <type>int4</type>
s/numa_zone_id/zone_id? to be consistent with pg_buffercache_numa introduced in
0002.BTW, I wonder if "node_id" would be better (to match the descriptions...).
If so, would also need to be done in 0002.
Somewhat duplicate, please see answer for #9
=== 4
+ ID of NUMA node
<acronym>NUMA</acronym> node ID? (to be consistent with 0002).
=== 5
+static bool firstUseInBackend = true;
Let's use firstNumaTouch to be consistent with 0002.
Done.
=== 6
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");;
There is 2 ";" + I think that we should used the same wording as in
pg_buffercache_numa_pages().=== 7
What about using ERROR instead? (like in pg_buffercache_numa_pages())
Both are synced now.
=== 8
+ /* + * This is for gathering some NUMA statistics. We might be using various + * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in + * various different OS memory pages sizes, so first we need to understand + * the OS memory page size before calling move_pages() + */ + os_page_size = pg_numa_get_pagesize();Maybe use the same comment as the one in pg_buffercache_numa_pages() before calling
pg_numa_get_pagesize()?
Done, improved style of the comment there and synced pg_buffercache
one to shmem.c one.
=== 9
+ max_zones = pg_numa_get_max_node();
I think we are mixing "zone" and "node". I think we should standardize on one
and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to
vote for node, but zone is fine too if you prefer.
Given that numa(7) does not use "zone" keyword at all and both
/proc/zoneinfo and /proc/pagetypeinfo shows that NUMA nodes are split
into zones, we can conclude that "zone" is simply a subdivision within
a NUMA node's memory (internal kernel thing). Examples are ZONE_DMA,
ZONE_NORMAL, ZONE_HIGHMEM. We are fetching just node id info (without
internal information about zones), therefore we should stay away from
using "zone" within the patch at all, as we are just fetching NUMA
node info. My bad, it's a terminology error on my side from start -
I've probably saw "zone" info in some command output back then when we
had that workshop and started using it and somehow it propagated
through the patchset up to this day... I've adjusted it all and
settled on "numa_node_id" column name.
=== 10
+ /* + * Preallocate memory all at once without going into details which shared + * memory segment is the biggest (technically min s_b can be as low as + * 16xBLCKSZ) + */What about?
"
Allocate memory for page pointers and status based on total shared memory size.
This simplified approach allocates enough space for all pages in shared memory
rather than calculating the exact requirements for each segment.
" instead?
Done.
=== 11
+ int shm_total_page_count, + shm_ent_page_count,I think those 2 should be uint64.
Right...
=== 12
+ /* + * XXX: We are ignoring in NUMA version reporting of the following regions + * (compare to pg_get_shmem_allocations() case): 1. output shared memory + * allocated but not counted via the shmem index 2. output as-of-yet + * unused shared memory + */why XXX?
what about?
"
We are ignoring the following memory regions (as compared to
pg_get_shmem_allocations())....
Fixed , it was apparently leftover of when I was thinking if we should
still report it.
=== 13
+ page_ptrs = palloc(sizeof(void *) * shm_total_page_count); + memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count);maybe we could use palloc0() here?
Of course!++
=== 14
and I realize that we could probably use it in 0002 for os_page_ptrs.
Of course!++
=== 15
I think there is still some multi-lines comments that are missing a period. I
probably also missed some in 0002 during the previous review. I think that's
worth another check.
Please do such a check, I've tried pgident on all .c files, but I'm
apparently blind to such issues. BTW if patch has anything left that
causes pgident to fix, that is not picked by CI but it is picked by
buildfarm??
=== 16
+ * In order to get reliable results we also need to touch memory + * pages so that inquiry about NUMA zone doesn't return -2. + */maybe use the same wording as in 0002?
But 0002 used:
"In order to get reliable results we also need to touch memory pages, so that
inquiry about NUMA zone doesn't return -2 (which indicates
unmapped/unallocated
pages)"
or are you looking at something different?
=== 17
The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count
but (as for 0002) it looks like we can not avoid it (or at least I don't see
a way to avoid it).
Hm, it's literally debug code. Why would we care so much if it is 2
loops rather than 1? (as stated earlier we need to pack ptrs and then
analyze it)
I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is
updated.
Cool, thanks in advance.
-J.
Attachments:
v14-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-memor.patchapplication/octet-stream; name=v14-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-memor.patchDownload
From ddde9c9eb099b81892bdfe243d68870402eb8f4f Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v14 3/3] Add pg_shmem_numa_allocations to show NUMA memory
node for shared memory allocations.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 79 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 129 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 +++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 271 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f5a306247e..310ee861dd8 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -176,6 +176,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-numa-allocations"><structname>pg_shmem_numa_allocations</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -3746,6 +3751,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-numa-allocations">
+ <title><structname>pg_shmem_numa_allocations</structname></title>
+
+ <indexterm zone="view-pg-shmem-numa-allocations">
+ <primary>pg_shmem_numa_allocations</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_numa_allocations</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_numa_allocations</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_node_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_numa_allocations</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf5..cc014a62dc2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..9094d39ba3c 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port//pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,129 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+ return (Datum) 0;
+ }
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: - Determine the OS memory
+ * page size - Calculate how many OS pages are used by all buffer blocks -
+ * Calculate how many OS pages are contained within each database block
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ shm_ent_page_count = ent->allocated_size / os_page_size;
+ /* It is always at least 1 page */
+ shm_ent_page_count = shm_ent_page_count == 0 ? 1 : shm_ent_page_count;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages so that inquiry about NUMA memory node doesn't return -2.
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+ /* Count number of NUMA nodes used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s >= 0 && s <= max_nodes)
+ nodes[s]++;
+ }
+
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 85902903653..f7e7c90a886 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8496,6 +8496,14 @@
proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,numa_node_id,numa_size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..fb882c5b771
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 954f549555e..d9d62470cdc 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3144,6 +3144,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
has_table_privilege
@@ -3157,6 +3163,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b2..0a0a989fcf6 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1740,6 +1740,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ numa_node_id,
+ numa_size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, numa_node_id, numa_size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..fddb21a260a
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index b81694c24f2..f93d4829702 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1921,11 +1921,13 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.39.5
v14-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchapplication/octet-stream; name=v14-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchDownload
From 75f4af43e74320058af914fe0ccc1b07f2e61cd6 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 10:19:35 +0100
Subject: [PATCH v14 1/3] Add optional dependency to libnuma (Linux-only) for
basic NUMA awareness routines and add minimal src/port/pg_numa.c portability
wrapper. Other platforms can be added later.
This also adds function pg_numa_available() that can be used to check if
the server was linked with NUMA support.
libnuma is unavailable on 32-bit builds, so due to lack of i386 shared object,
we disable it there (it does not make sense anyway on i386 it is very memory
limited platform even with PAE)
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 12 +-
configure | 87 ++++++++++++++
configure.ac | 13 +++
doc/src/sgml/func.sgml | 13 +++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 1 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 46 ++++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 168 ++++++++++++++++++++++++++++
17 files changed, 397 insertions(+), 5 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 5849cbb839a..7010dff7aef 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -445,8 +445,10 @@ task:
EOF
setup_additional_packages_script: |
- #apt-get update
- #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
+ apt-get update
+ DEBIAN_FRONTEND=noninteractive apt-get -y install \
+ libnuma1 \
+ libnuma-dev
matrix:
# SPECIAL:
@@ -471,6 +473,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
\
${LINUX_CONFIGURE_FEATURES} \
\
@@ -519,6 +522,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
@@ -835,8 +839,8 @@ task:
folder: $CCACHE_DIR
setup_additional_packages_script: |
- #apt-get update
- #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
+ apt-get update
+ DEBIAN_FRONTEND=noninteractive apt-get -y install libnuma1 libnuma-dev
###
# Test that code can be built with gcc/clang without warnings
diff --git a/configure b/configure
index 93fddd69981..23c33dd9971 100755
--- a/configure
+++ b/configure
@@ -711,6 +711,7 @@ with_libxml
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
+with_libnuma
with_uuid
with_readline
with_systemd
@@ -868,6 +869,7 @@ with_libedit_preferred
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -1581,6 +1583,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -9140,6 +9143,33 @@ fi
+#
+# NUMA
+#
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
@@ -12378,6 +12408,63 @@ fi
fi
+if test "$with_libnuma" = yes ; then
+
+ ac_fn_c_check_header_mongrel "$LINENO" "numa.h" "ac_cv_header_numa_h" "$ac_includes_default"
+if test "x$ac_cv_header_numa_h" = xyes; then :
+
+else
+ as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
+fi
+
+
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'numa' does not provide numa_available" "$LINENO" 5
+fi
+
+fi
+
# XXX libcurl must link after libgssapi_krb5 on FreeBSD to avoid segfaults
# during gss_acquire_cred(). This is possibly related to Curl's Heimdal
# dependency on that platform?
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..1a394dfc077 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1041,6 +1041,19 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1c3810e1a04..113588defdd 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25078,6 +25078,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index e076cefa3b9..9f56205a1d7 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-libxml">
<term><option>--with-libxml</option></term>
<listitem>
@@ -2611,6 +2621,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 13c13748e5d..4106c4b13f5 100644
--- a/meson.build
+++ b/meson.build
@@ -949,6 +949,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: libxml
@@ -3168,6 +3189,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
libxml,
lz4,
pam,
@@ -3823,6 +3845,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..adaadb5faf1 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5ac..0bd4b2d7d32 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 0d3ebf06a95..37cda877f57 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -564,7 +564,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 890822eaf79..85902903653 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8492,6 +8492,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..8894f800607 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..986152e0942
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,46 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "c.h"
+#include "postgres.h"
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..f786c191605 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS'
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c224319512..a68a29d5414 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_numa.o \
pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d43..7ffbd4d88d2 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_numa.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..7d905ef31f5
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,168 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
+
+#ifndef FRONTEND
+/*
+ * XXX: not really tested as there is no way to trigger this in our
+ * current usage of libnuma.
+ *
+ * The libnuma built-in code can be seen here:
+ * https://github.com/numactl/numactl/blob/master/libnuma.c
+ *
+ */
+void
+numa_warn(int num, char *fmt,...)
+{
+ va_list ap;
+ int olde = errno;
+ int needed;
+ StringInfoData msg;
+
+ initStringInfo(&msg);
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ if (needed > 0)
+ {
+ enlargeStringInfo(&msg, needed);
+ va_start(ap, fmt);
+ appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ }
+
+ ereport(WARNING,
+ (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
+ errmsg_internal("libnuma: WARNING: %s", msg.data)));
+
+ pfree(msg.data);
+
+ errno = olde;
+}
+
+void
+numa_error(char *where)
+{
+ int olde = errno;
+
+ /*
+ * XXX: for now we issue just WARNING, but long-term that might depend on
+ * numa_set_strict() here.
+ */
+ elog(WARNING, "libnuma: ERROR: %s", where);
+ errno = olde;
+}
+#endif /* FRONTEND */
+
+#else
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+#ifndef WIN32
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+#else
+ Size os_page_size;
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#endif
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
--
2.39.5
v14-0002-Extend-pg_buffercache-with-new-view-pg_buffercac.patchapplication/octet-stream; name=v14-0002-Extend-pg_buffercache-with-new-view-pg_buffercac.patchDownload
From 1684b2a2e8d89ec94e1669e8707d6ff6ebaff737 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 11:17:28 +0100
Subject: [PATCH v14 2/3] Extend pg_buffercache with new view
pg_buffercache_numa to show NUMA memory node for individual buffer.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache_numa.out | 28 +
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 42 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 478 +++++++++++++-----
.../sql/pg_buffercache_numa.sql | 20 +
doc/src/sgml/pgbuffercache.sgml | 61 ++-
9 files changed, 505 insertions(+), 134 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..4b5e864cb79
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,42 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the new function.
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+CREATE OR REPLACE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, numa_node_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pages() TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3ae0a018e10..27762db24ba 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,12 +11,12 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
-
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -43,6 +43,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_node_id;
} BufferCachePagesRec;
@@ -61,84 +62,258 @@ typedef struct
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-Datum
-pg_buffercache_pages(PG_FUNCTION_ARGS)
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+/*
+ * Helper routine to map Buffers into addresses that is used by
+ * pg_numa_query_pages().
+ *
+ * When database block size (BLCKSZ) is smaller than the OS page size (4kB),
+ * multiple database buffers will map to the same OS memory page. In this case,
+ * we only need to query the NUMA node for the first memory address of each
+ * unique OS page rather than for every buffer.
+ *
+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages)
+ */
+static inline void
+pg_buffercache_numa_prepare_ptrs(int buffer_id, float pages_per_blk,
+ Size os_page_size,
+ void **os_page_ptrs)
+{
+ size_t blk2page = (size_t) (buffer_id * pages_per_blk);
+
+ for (size_t j = 0; j < pages_per_blk; j++)
+ {
+ size_t blk2pageoff = blk2page + j;
+
+ if (os_page_ptrs[blk2pageoff] == 0)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ /* NBuffers starts from 1 */
+ os_page_ptrs[blk2pageoff] = (char *) BufferGetBlock(buffer_id + 1) +
+ (os_page_size * j);
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2pageoff]);
+
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+}
+
+/*
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * This is almost identical to pg_buffercache_numa_pages(), but this one performs
+ * memory mapping inquiries to display NUMA node information for each buffer.
+ */
+static BufferCachePagesContext *
+pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
{
- FuncCallContext *funcctx;
- Datum result;
- MemoryContext oldcontext;
BufferCachePagesContext *fctx; /* User function context. */
+ MemoryContext oldcontext;
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
- HeapTuple tuple;
- if (SRF_IS_FIRSTCALL())
- {
- int i;
+ /*
+ * Switch context when allocating stuff to be used in later calls
+ */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
- funcctx = SRF_FIRSTCALL_INIT();
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
- /* Switch context when allocating stuff to be used in later calls */
- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. We unfortunately have to get the result type for that... - we
+ * can't use the result type determined by the function definition without
+ * potentially crashing when somebody uses the old (or even wrong)
+ * function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
- /* Create a user function context for cross-call persistence */
- fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+ if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+ expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
+ INT2OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
+ BOOLOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
+ INT2OID, -1, 0);
+
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+ INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "numa_node_id",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCachePagesRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCachePagesRec) * NBuffers);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /*
+ * Return to original context when allocating transient memory
+ */
+ MemoryContextSwitchTo(oldcontext);
+ return fctx;
+}
+
+/*
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * Build buffer cache information for a single buffer.
+ */
+static void
+pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ bufHdr = GetBufferDescriptor(record_id);
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+
+ fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr);
+ fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ fctx->record[record_id].reltablespace = bufHdr->tag.spcOid;
+ fctx->record[record_id].reldatabase = bufHdr->tag.dbOid;
+ fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag);
+ fctx->record[record_id].blocknum = bufHdr->tag.blockNum;
+ fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+ if (buf_state & BM_DIRTY)
+ fctx->record[record_id].isdirty = true;
+ else
+ fctx->record[record_id].isdirty = false;
+
+ /*
+ * Note if the buffer is valid, and has storage created
+ */
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ fctx->record[record_id].isvalid = true;
+ else
+ fctx->record[record_id].isvalid = false;
+
+ fctx->record[record_id].numa_node_id = -1;
+
+ UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * Format and return a tuple for a single buffer cache entry.
+ */
+static Datum
+get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+ HeapTuple tuple;
+
+ values[0] = Int32GetDatum(fctx->record[record_id].bufferid);
+ nulls[0] = false;
+
+ /*
+ * Set all fields except the bufferid to null if the buffer is unused or
+ * not valid.
+ */
+ if (fctx->record[record_id].blocknum == InvalidBlockNumber ||
+ fctx->record[record_id].isvalid == false)
+ {
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
+ nulls[6] = true;
+ nulls[7] = true;
+
+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */
+ nulls[8] = true;
+ nulls[9] = true;
+ }
+ else
+ {
+ values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber);
+ nulls[1] = false;
+ values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace);
+ nulls[2] = false;
+ values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase);
+ nulls[3] = false;
+ values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum);
+ nulls[4] = false;
+ values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum);
+ nulls[5] = false;
+ values[6] = BoolGetDatum(fctx->record[record_id].isdirty);
+ nulls[6] = false;
+ values[7] = Int16GetDatum(fctx->record[record_id].usagecount);
+ nulls[7] = false;
/*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
+ * unused for v1.0 callers, but the array is always long enough
*/
- if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
- elog(ERROR, "return type must be a row type");
+ values[8] = Int32GetDatum(fctx->record[record_id].pinning_backends);
+ nulls[8] = false;
+ values[9] = Int32GetDatum(fctx->record[record_id].numa_node_id);
+ nulls[9] = false;
+ }
- if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
- expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
- elog(ERROR, "incorrect number of output arguments");
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
- /* Construct a tuple descriptor for the result rows. */
- tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
- TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
- INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
- INT2OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
- INT8OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
- BOOLOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
- INT2OID, -1, 0);
-
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);
-
- fctx->tupdesc = BlessTupleDesc(tupledesc);
-
- /* Allocate NBuffers worth of BufferCachePagesRec records. */
- fctx->record = (BufferCachePagesRec *)
- MemoryContextAllocHuge(CurrentMemoryContext,
- sizeof(BufferCachePagesRec) * NBuffers);
-
- /* Set max calls and remember the user function context. */
- funcctx->max_calls = NBuffers;
- funcctx->user_fctx = fctx;
-
- /* Return to original context when allocating transient memory */
- MemoryContextSwitchTo(oldcontext);
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
* Scan through all the buffers, saving the relevant fields in the
@@ -149,36 +324,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
* locks, so the information of each buffer is self-consistent.
*/
for (i = 0; i < NBuffers; i++)
- {
- BufferDesc *bufHdr;
- uint32 buf_state;
-
- bufHdr = GetBufferDescriptor(i);
- /* Lock each buffer header before inspecting. */
- buf_state = LockBufHdr(bufHdr);
-
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
- if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
- else
- fctx->record[i].isdirty = false;
-
- /* Note if the buffer is valid, and has storage created */
- if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
- else
- fctx->record[i].isvalid = false;
-
- UnlockBufHdr(bufHdr, buf_state);
- }
+ pg_buffercache_build_tuple(i, fctx);
}
funcctx = SRF_PERCALL_SETUP();
@@ -188,59 +334,129 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
if (funcctx->call_cntr < funcctx->max_calls)
{
+ Datum result;
uint32 i = funcctx->call_cntr;
- Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
- bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
- nulls[0] = false;
+ result = get_buffercache_tuple(i, fctx);
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ {
+ SRF_RETURN_DONE(funcctx);
+ }
+}
+
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_pages_status = NULL;
+ uint64 os_page_count = 0;
+ float pages_per_blk = 0;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
- * Set all fields except the bufferid to null if the buffer is unused
- * or not valid.
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: - Determine the OS
+ * memory page size - Calculate how many OS pages are used by all
+ * buffer blocks - Calculate how many OS pages are contained within
+ * each database block
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+ os_page_count = ((uint64) NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu pages_per_blk=%.2f",
+ (unsigned long) os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(uint64) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
*/
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
+ for (i = 0; i < NBuffers; i++)
{
- nulls[1] = true;
- nulls[2] = true;
- nulls[3] = true;
- nulls[4] = true;
- nulls[5] = true;
- nulls[6] = true;
- nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
- nulls[8] = true;
+ pg_buffercache_build_tuple(i, fctx);
+ pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size,
+ os_page_ptrs);
}
- else
+
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ for (i = 0; i < NBuffers; i++)
{
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
- nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
- nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
- nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
- nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
- nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
- nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
- nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
- nulls[8] = false;
+ int blk2page = (int) i * pages_per_blk;
+
+ /*
+ * Set the NUMA node id for this buffer based on the first OS page
+ * it maps to.
+ *
+ * Note: We could check for errors in os_pages_status and report
+ * them. Also, a single DB block might span multiple NUMA nodes if
+ * it crosses OS pages on node boundaries, but we only record the
+ * node of the first page. This is a simplification but should be
+ * sufficient for most analyses.
+ */
+ fctx->record[i].numa_node_id = os_pages_status[blk2page];
}
+ }
- /* Build and return the tuple. */
- tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
- result = HeapTupleGetDatum(tuple);
+ funcctx = SRF_PERCALL_SETUP();
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ Datum result;
+ uint32 i = funcctx->call_cntr;
+
+ result = get_buffercache_tuple(i, fctx);
SRF_RETURN_NEXT(funcctx, result);
}
else
+ {
+ firstNumaTouch = false;
SRF_RETURN_DONE(funcctx);
+ }
}
Datum
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..8216b7cd93b 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,14 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides the same information
+ as <function>pg_buffercache_pages()</function> but is slower because it also
+ provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +210,55 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache_numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed are identical to the
+ <structname>pg_buffercache</structname> view, except that this one includes
+ one additional <structfield>numa_node_id</structfield> column as defined in
+ <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Extra column</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_node_id</structfield> <type>integer</type>
+ </para>
+ <para>
+ <acronym>NUMA</acronym> node ID. NULL if the shared buffer
+ has not been used yet. On systems without <acronym>NUMA</acronym> support
+ this returns 0.
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
--
2.39.5
Hi,
On Tue, Mar 18, 2025 at 11:19:32AM +0100, Jakub Wartak wrote:
On Mon, Mar 17, 2025 at 5:11 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:Thanks for v13!
Rebased and fixes inside in the attached v14 (it passes CI too):
Thanks!
=== 9
+ max_zones = pg_numa_get_max_node();
I think we are mixing "zone" and "node". I think we should standardize on one
and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to
vote for node, but zone is fine too if you prefer.Given that numa(7) does not use "zone" keyword at all and both
/proc/zoneinfo and /proc/pagetypeinfo shows that NUMA nodes are split
into zones, we can conclude that "zone" is simply a subdivision within
a NUMA node's memory (internal kernel thing). Examples are ZONE_DMA,
ZONE_NORMAL, ZONE_HIGHMEM. We are fetching just node id info (without
internal information about zones), therefore we should stay away from
using "zone" within the patch at all, as we are just fetching NUMA
node info. My bad, it's a terminology error on my side from start -
I've probably saw "zone" info in some command output back then when we
had that workshop and started using it and somehow it propagated
through the patchset up to this day...
Thanks for the explanation.
I've adjusted it all and settled on "numa_node_id" column name.
Yeah, I can see, things like:
+ <para>
+ The definitions of the columns exposed are identical to the
+ <structname>pg_buffercache</structname> view, except that this one includes
+ one additional <structfield>numa_node_id</structfield> column as defined in
+ <xref linkend="pgbuffercache-numa-columns"/>.
and like:
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_size</structfield> <type>int4</type>
+ </para>
I think that you re-introduced the "numa_" in the column(s) name that we get
rid (or agreed to) of previously.
I think that we can get rid of the "numa_" stuff in column(s) name as
the column(s) are part of "numa" relation views/output anyway.
I think "node_id", "size" as column(s) name should be enough.
Or maybe that re-adding "numa_" was intentional?
=== 15
I think there is still some multi-lines comments that are missing a period. I
probably also missed some in 0002 during the previous review. I think that's
worth another check.Please do such a check,
Found much more:
+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages)
is missing the period
and
+ /*
+ * Switch context when allocating stuff to be used in later calls
+ */
should be as before, meaning on current HEAD:
/* Switch context when allocating stuff to be used in later calls */
and
+ /*
+ * Return to original context when allocating transient memory
+ */
should be as before, meaning on current HEAD:
/* Return to original context when allocating transient memory */
and
+ /*
+ * Note if the buffer is valid, and has storage created
+ */
should be as before, meaning on current HEAD:
/* Note if the buffer is valid, and has storage created */
and
+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */
should be as before, meaning on current HEAD:
/* unused for v1.0 callers, but the array is always long enough */
and
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings
is missing the period
and
+ * To correctly map between them, we need to: - Determine the OS
+ * memory page size - Calculate how many OS pages are used by all
+ * buffer blocks - Calculate how many OS pages are contained within
+ * each database block
is missing the period (2 times as this comment appears in 0002 and 0003)
and
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
is missing the period (2 times as this comment appears in 0002 and 0003)
and
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory
is missing the period
I've tried pgident on all .c files, but I'm
apparently blind to such issues.
I don't think pgident would report missing period.
BTW if patch has anything left that
causes pgident to fix, that is not picked by CI but it is picked by
buildfarm??
I think it has to be done manually before each commit and that this is anyway
done at least once per release cycle.
=== 16
+ * In order to get reliable results we also need to touch memory + * pages so that inquiry about NUMA zone doesn't return -2. + */maybe use the same wording as in 0002?
But 0002 used:
"In order to get reliable results we also need to touch memory pages, so that
inquiry about NUMA zone doesn't return -2 (which indicates
unmapped/unallocated
pages)"or are you looking at something different?
Nope, I meant to say that it could make sense to have the exact same comment.
=== 17
The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count
but (as for 0002) it looks like we can not avoid it (or at least I don't see
a way to avoid it).Hm, it's literally debug code. Why would we care so much if it is 2
loops rather than 1? (as stated earlier we need to pack ptrs and then
analyze it)
Yeah, but if we could just loop one time I'm pretty sure we'd have done that.
I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is
updated.:w
Cool, thanks in advance.
0001 looks in a good shape from my point of view.
For 0002:
=== 1
I wonder if pg_buffercache_init_entries() and pg_buffercache_build_tuple() could
deserve their own patch. That would ease the review for the "real" numa stuff.
Maybe something like:
0001 as it is
0002 introduces (and uses) pg_buffercache_init_entries() and
pg_buffercache_build_tuple()
0003 current 0002 attached minus 0002 above
We did it that way in c2a50ac678e and ff7c40d7fd6 for example.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Tue, Mar 18, 2025 at 3:29 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi! v15 attached, rebased, CI-tested, all fixed incorporated.
I've adjusted it all and settled on "numa_node_id" column name.
[...]
I think that we can get rid of the "numa_" stuff in column(s) name as
the column(s) are part of "numa" relation views/output anyway.
[...]
Done, you are probably right (it was done to keep consistency between
those two views probably), I'm just not that strongly attached to the
naming things.
Please do such a check,
Found much more:
[.. 9 issues with missing dots at the end of sentences in comments +
fixes to comment structure in relation to HEAD..]
All fixed.
BTW if patch has anything left that
causes pgident to fix, that is not picked by CI but it is picked by
buildfarm??I think it has to be done manually before each commit and that this is anyway
done at least once per release cycle.
OK, thanks for clarification.
[..]
But 0002 used:
"In order to get reliable results we also need to touch memory pages, so that
inquiry about NUMA zone doesn't return -2 (which indicates
unmapped/unallocated
pages)"or are you looking at something different?
Nope, I meant to say that it could make sense to have the exact same comment.
Synced those two.
[..]
0001 looks in a good shape from my point of view.
Cool!
For 0002:
=== 1
I wonder if pg_buffercache_init_entries() and pg_buffercache_build_tuple() could
deserve their own patch. That would ease the review for the "real" numa stuff.
Done, 0001+0002 alone passes the meson test.
-J.
Attachments:
v15-0003-Extend-pg_buffercache-with-new-view-pg_buffercac.patchapplication/octet-stream; name=v15-0003-Extend-pg_buffercache-with-new-view-pg_buffercac.patchDownload
From ee17287bcab2178bfd473a7043ece3b54f498817 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v15 3/4] Extend pg_buffercache with new view
pg_buffercache_numa to show NUMA memory node for individual buffer.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache_numa.out | 28 +++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 42 +++++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 172 +++++++++++++++++-
.../sql/pg_buffercache_numa.sql | 20 ++
doc/src/sgml/pgbuffercache.sgml | 61 ++++++-
9 files changed, 329 insertions(+), 4 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..0f4b2eaf444
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,42 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the new function.
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+CREATE OR REPLACE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, node_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pages() TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index f342005fd96..35b019206f5 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,11 +11,12 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -42,6 +43,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_node_id;
} BufferCachePagesRec;
@@ -60,10 +62,56 @@ typedef struct
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+/*
+ * Helper routine to map Buffers into addresses that is used by
+ * pg_numa_query_pages().
+ *
+ * When database block size (BLCKSZ) is smaller than the OS page size (4kB),
+ * multiple database buffers will map to the same OS memory page. In this case,
+ * we only need to query the NUMA node for the first memory address of each
+ * unique OS page rather than for every buffer.
+ *
+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ */
+static inline void
+pg_buffercache_numa_prepare_ptrs(int buffer_id, float pages_per_blk,
+ Size os_page_size,
+ void **os_page_ptrs)
+{
+ size_t blk2page = (size_t) (buffer_id * pages_per_blk);
+
+ for (size_t j = 0; j < pages_per_blk; j++)
+ {
+ size_t blk2pageoff = blk2page + j;
+
+ if (os_page_ptrs[blk2pageoff] == 0)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ /* NBuffers starts from 1 */
+ os_page_ptrs[blk2pageoff] = (char *) BufferGetBlock(buffer_id + 1) +
+ (os_page_size * j);
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2pageoff]);
+
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+}
+
/*
* Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
*
@@ -121,6 +169,9 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "node_id",
+ INT4OID, -1, 0);
fctx->tupdesc = BlessTupleDesc(tupledesc);
@@ -173,6 +224,8 @@ pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)
else
fctx->record[record_id].isvalid = false;
+ fctx->record[record_id].numa_node_id = -1;
+
UnlockBufHdr(bufHdr, buf_state);
}
@@ -208,6 +261,7 @@ get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
/* unused for v1.0 callers, but the array is always long enough */
nulls[8] = true;
+ nulls[9] = true;
}
else
{
@@ -231,6 +285,8 @@ get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
*/
values[8] = Int32GetDatum(fctx->record[record_id].pinning_backends);
nulls[8] = false;
+ values[9] = Int32GetDatum(fctx->record[record_id].numa_node_id);
+ nulls[9] = false;
}
/* Build and return the tuple. */
@@ -282,6 +338,120 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
}
}
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings.
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_pages_status = NULL;
+ uint64 os_page_count = 0;
+ float pages_per_blk = 0;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS
+ * memory page size 2. Calculate how many OS pages are used by all
+ * buffer blocks 3. Calculate how many OS pages are contained within
+ * each database block.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+ os_page_count = ((uint64) NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu pages_per_blk=%.2f",
+ (unsigned long) os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(uint64) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ *
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ */
+ for (i = 0; i < NBuffers; i++)
+ {
+ pg_buffercache_build_tuple(i, fctx);
+ pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size,
+ os_page_ptrs);
+ }
+
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ int blk2page = (int) i * pages_per_blk;
+
+ /*
+ * Set the NUMA node id for this buffer based on the first OS page
+ * it maps to.
+ *
+ * Note: We could check for errors in os_pages_status and report
+ * them. Also, a single DB block might span multiple NUMA nodes if
+ * it crosses OS pages on node boundaries, but we only record the
+ * node of the first page. This is a simplification but should be
+ * sufficient for most analyses.
+ */
+ fctx->record[i].numa_node_id = os_pages_status[blk2page];
+ }
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ Datum result;
+ uint32 i = funcctx->call_cntr;
+
+ result = get_buffercache_tuple(i, fctx);
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ {
+ firstNumaTouch = false;
+ SRF_RETURN_DONE(funcctx);
+ }
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..086e0062a17 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,14 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides the same information
+ as <function>pg_buffercache_pages()</function> but is slower because it also
+ provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +210,55 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache_numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed are identical to the
+ <structname>pg_buffercache</structname> view, except that this one includes
+ one additional <structfield>node_id</structfield> column as defined in
+ <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Extra column</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>integer</type>
+ </para>
+ <para>
+ <acronym>NUMA</acronym> node ID. NULL if the shared buffer
+ has not been used yet. On systems without <acronym>NUMA</acronym> support
+ this returns 0.
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
--
2.39.5
v15-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchapplication/octet-stream; name=v15-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchDownload
From 244d6e9c95adf8857219bbfefefe76c07addde66 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 10:19:35 +0100
Subject: [PATCH v15 1/4] Add optional dependency to libnuma (Linux-only) for
basic NUMA awareness routines and add minimal src/port/pg_numa.c portability
wrapper. Other platforms can be added later.
This also adds function pg_numa_available() that can be used to check if
the server was linked with NUMA support.
libnuma is unavailable on 32-bit builds, so due to lack of i386 shared object,
we disable it there (it does not make sense anyway on i386 it is very memory
limited platform even with PAE)
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 12 +-
configure | 87 ++++++++++++++
configure.ac | 13 +++
doc/src/sgml/func.sgml | 13 +++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 1 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 46 ++++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 168 ++++++++++++++++++++++++++++
17 files changed, 397 insertions(+), 5 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 5849cbb839a..7010dff7aef 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -445,8 +445,10 @@ task:
EOF
setup_additional_packages_script: |
- #apt-get update
- #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
+ apt-get update
+ DEBIAN_FRONTEND=noninteractive apt-get -y install \
+ libnuma1 \
+ libnuma-dev
matrix:
# SPECIAL:
@@ -471,6 +473,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
\
${LINUX_CONFIGURE_FEATURES} \
\
@@ -519,6 +522,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
@@ -835,8 +839,8 @@ task:
folder: $CCACHE_DIR
setup_additional_packages_script: |
- #apt-get update
- #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
+ apt-get update
+ DEBIAN_FRONTEND=noninteractive apt-get -y install libnuma1 libnuma-dev
###
# Test that code can be built with gcc/clang without warnings
diff --git a/configure b/configure
index 559f535f5cd..0931331f627 100755
--- a/configure
+++ b/configure
@@ -711,6 +711,7 @@ with_libxml
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
+with_libnuma
with_uuid
with_readline
with_systemd
@@ -868,6 +869,7 @@ with_libedit_preferred
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -1581,6 +1583,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -9140,6 +9143,33 @@ fi
+#
+# NUMA
+#
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
@@ -12378,6 +12408,63 @@ fi
fi
+if test "$with_libnuma" = yes ; then
+
+ ac_fn_c_check_header_mongrel "$LINENO" "numa.h" "ac_cv_header_numa_h" "$ac_includes_default"
+if test "x$ac_cv_header_numa_h" = xyes; then :
+
+else
+ as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
+fi
+
+
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'numa' does not provide numa_available" "$LINENO" 5
+fi
+
+fi
+
# XXX libcurl must link after libgssapi_krb5 on FreeBSD to avoid segfaults
# during gss_acquire_cred(). This is possibly related to Curl's Heimdal
# dependency on that platform?
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..1a394dfc077 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1041,6 +1041,19 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2ab5661602c..d7b33c67ec6 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25078,6 +25078,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index e076cefa3b9..9f56205a1d7 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-libxml">
<term><option>--with-libxml</option></term>
<listitem>
@@ -2611,6 +2621,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index b6daa5b7040..b4cb6929bfc 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: libxml
@@ -3162,6 +3183,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
libxml,
lz4,
pam,
@@ -3817,6 +3839,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..adaadb5faf1 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8fe9d61e82a..7ff45cf86e7 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index cc8f2b1230a..ae5452d9539 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 890822eaf79..85902903653 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8492,6 +8492,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..8894f800607 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..986152e0942
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,46 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "c.h"
+#include "postgres.h"
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..f786c191605 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS'
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c224319512..a68a29d5414 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_numa.o \
pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d43..7ffbd4d88d2 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_numa.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..7d905ef31f5
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,168 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
+
+#ifndef FRONTEND
+/*
+ * XXX: not really tested as there is no way to trigger this in our
+ * current usage of libnuma.
+ *
+ * The libnuma built-in code can be seen here:
+ * https://github.com/numactl/numactl/blob/master/libnuma.c
+ *
+ */
+void
+numa_warn(int num, char *fmt,...)
+{
+ va_list ap;
+ int olde = errno;
+ int needed;
+ StringInfoData msg;
+
+ initStringInfo(&msg);
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ if (needed > 0)
+ {
+ enlargeStringInfo(&msg, needed);
+ va_start(ap, fmt);
+ appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ }
+
+ ereport(WARNING,
+ (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
+ errmsg_internal("libnuma: WARNING: %s", msg.data)));
+
+ pfree(msg.data);
+
+ errno = olde;
+}
+
+void
+numa_error(char *where)
+{
+ int olde = errno;
+
+ /*
+ * XXX: for now we issue just WARNING, but long-term that might depend on
+ * numa_set_strict() here.
+ */
+ elog(WARNING, "libnuma: ERROR: %s", where);
+ errno = olde;
+}
+#endif /* FRONTEND */
+
+#else
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+#ifndef WIN32
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+#else
+ Size os_page_size;
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#endif
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
--
2.39.5
v15-0002-This-extracts-code-from-contrib-pg_buffercache-s.patchapplication/octet-stream; name=v15-0002-This-extracts-code-from-contrib-pg_buffercache-s.patchDownload
From fbb7f669d18dfcc730e64287ab662965d03c460a Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v15 2/4] This extracts code from contrib/pg_buffercache's
primary function to separate functions.
This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple()
and get_buffercache_tuple() that help fill result tuplestores based on the
buffercache contents. This will be used in a follow-up commit that implements
NUMA observability in pg_buffercache.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++++++++++--------
1 file changed, 178 insertions(+), 139 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3ae0a018e10..f342005fd96 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -14,7 +14,6 @@
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
-
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
#define NUM_BUFFERCACHE_PAGES_ELEM 9
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
@@ -65,80 +64,192 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-Datum
-pg_buffercache_pages(PG_FUNCTION_ARGS)
+/*
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * This is almost identical to pg_buffercache_numa_pages(), but this one performs
+ * memory mapping inquiries to display NUMA node information for each buffer.
+ */
+static BufferCachePagesContext *
+pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
{
- FuncCallContext *funcctx;
- Datum result;
- MemoryContext oldcontext;
BufferCachePagesContext *fctx; /* User function context. */
+ MemoryContext oldcontext;
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
- HeapTuple tuple;
- if (SRF_IS_FIRSTCALL())
- {
- int i;
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
- funcctx = SRF_FIRSTCALL_INIT();
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. ee unfortunately have to get the result type for that... - we
+ * can't use the result type determined by the function definition without
+ * potentially crashing when somebody uses the old (or even wrong)
+ * function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+ expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
+ INT2OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
+ BOOLOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
+ INT2OID, -1, 0);
+
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+ INT4OID, -1, 0);
- /* Switch context when allocating stuff to be used in later calls */
- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
- /* Create a user function context for cross-call persistence */
- fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCachePagesRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCachePagesRec) * NBuffers);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+ return fctx;
+}
+
+/*
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * Build buffer cache information for a single buffer.
+ */
+static void
+pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ bufHdr = GetBufferDescriptor(record_id);
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+
+ fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr);
+ fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ fctx->record[record_id].reltablespace = bufHdr->tag.spcOid;
+ fctx->record[record_id].reldatabase = bufHdr->tag.dbOid;
+ fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag);
+ fctx->record[record_id].blocknum = bufHdr->tag.blockNum;
+ fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+ if (buf_state & BM_DIRTY)
+ fctx->record[record_id].isdirty = true;
+ else
+ fctx->record[record_id].isdirty = false;
+
+ /* Note if the buffer is valid, and has storage created */
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ fctx->record[record_id].isvalid = true;
+ else
+ fctx->record[record_id].isvalid = false;
+
+ UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * Format and return a tuple for a single buffer cache entry.
+ */
+static Datum
+get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+ HeapTuple tuple;
+
+ values[0] = Int32GetDatum(fctx->record[record_id].bufferid);
+ nulls[0] = false;
+
+ /*
+ * Set all fields except the bufferid to null if the buffer is unused or
+ * not valid.
+ */
+ if (fctx->record[record_id].blocknum == InvalidBlockNumber ||
+ fctx->record[record_id].isvalid == false)
+ {
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
+ nulls[6] = true;
+ nulls[7] = true;
+
+ /* unused for v1.0 callers, but the array is always long enough */
+ nulls[8] = true;
+ }
+ else
+ {
+ values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber);
+ nulls[1] = false;
+ values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace);
+ nulls[2] = false;
+ values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase);
+ nulls[3] = false;
+ values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum);
+ nulls[4] = false;
+ values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum);
+ nulls[5] = false;
+ values[6] = BoolGetDatum(fctx->record[record_id].isdirty);
+ nulls[6] = false;
+ values[7] = Int16GetDatum(fctx->record[record_id].usagecount);
+ nulls[7] = false;
/*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
+ * unused for v1.0 callers, but the array is always long enough
*/
- if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
- elog(ERROR, "return type must be a row type");
+ values[8] = Int32GetDatum(fctx->record[record_id].pinning_backends);
+ nulls[8] = false;
+ }
- if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
- expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
- elog(ERROR, "incorrect number of output arguments");
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
- /* Construct a tuple descriptor for the result rows. */
- tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
- TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
- INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
- INT2OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
- INT8OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
- BOOLOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
- INT2OID, -1, 0);
-
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);
-
- fctx->tupdesc = BlessTupleDesc(tupledesc);
-
- /* Allocate NBuffers worth of BufferCachePagesRec records. */
- fctx->record = (BufferCachePagesRec *)
- MemoryContextAllocHuge(CurrentMemoryContext,
- sizeof(BufferCachePagesRec) * NBuffers);
-
- /* Set max calls and remember the user function context. */
- funcctx->max_calls = NBuffers;
- funcctx->user_fctx = fctx;
-
- /* Return to original context when allocating transient memory */
- MemoryContextSwitchTo(oldcontext);
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
* Scan through all the buffers, saving the relevant fields in the
@@ -149,36 +260,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
* locks, so the information of each buffer is self-consistent.
*/
for (i = 0; i < NBuffers; i++)
- {
- BufferDesc *bufHdr;
- uint32 buf_state;
-
- bufHdr = GetBufferDescriptor(i);
- /* Lock each buffer header before inspecting. */
- buf_state = LockBufHdr(bufHdr);
-
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
- if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
- else
- fctx->record[i].isdirty = false;
-
- /* Note if the buffer is valid, and has storage created */
- if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
- else
- fctx->record[i].isvalid = false;
-
- UnlockBufHdr(bufHdr, buf_state);
- }
+ pg_buffercache_build_tuple(i, fctx);
}
funcctx = SRF_PERCALL_SETUP();
@@ -188,59 +270,16 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
if (funcctx->call_cntr < funcctx->max_calls)
{
+ Datum result;
uint32 i = funcctx->call_cntr;
- Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
- bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
-
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
- nulls[0] = false;
-
- /*
- * Set all fields except the bufferid to null if the buffer is unused
- * or not valid.
- */
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
- {
- nulls[1] = true;
- nulls[2] = true;
- nulls[3] = true;
- nulls[4] = true;
- nulls[5] = true;
- nulls[6] = true;
- nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
- nulls[8] = true;
- }
- else
- {
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
- nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
- nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
- nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
- nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
- nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
- nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
- nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
- nulls[8] = false;
- }
-
- /* Build and return the tuple. */
- tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
- result = HeapTupleGetDatum(tuple);
+ result = get_buffercache_tuple(i, fctx);
SRF_RETURN_NEXT(funcctx, result);
}
else
+ {
SRF_RETURN_DONE(funcctx);
+ }
}
Datum
--
2.39.5
v15-0004-Add-pg_shmem_numa_allocations-to-show-NUMA-memor.patchapplication/octet-stream; name=v15-0004-Add-pg_shmem_numa_allocations-to-show-NUMA-memor.patchDownload
From f19a8151a9b436b2ab7c5d8b5ae1afe41eb49fb9 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v15 4/4] Add pg_shmem_numa_allocations to show NUMA memory
node for shared memory allocations.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 79 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 131 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 +++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 273 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f5a306247e..e83711f9578 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -176,6 +176,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-numa-allocations"><structname>pg_shmem_numa_allocations</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -3746,6 +3751,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-numa-allocations">
+ <title><structname>pg_shmem_numa_allocations</structname></title>
+
+ <indexterm zone="view-pg-shmem-numa-allocations">
+ <primary>pg_shmem_numa_allocations</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_numa_allocations</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_numa_allocations</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_numa_allocations</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf5..cc014a62dc2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..a011603d318 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port//pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,131 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+ return (Datum) 0;
+ }
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ shm_ent_page_count = ent->allocated_size / os_page_size;
+ /* It is always at least 1 page */
+ shm_ent_page_count = shm_ent_page_count == 0 ? 1 : shm_ent_page_count;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (which indicates unmapped/unallocated pages).
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+ /* Count number of NUMA nodes used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s >= 0 && s <= max_nodes)
+ nodes[s]++;
+ }
+
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory.
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 85902903653..66d753733d5 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8496,6 +8496,14 @@
proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,node_id,size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..fb882c5b771
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 954f549555e..d9d62470cdc 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3144,6 +3144,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
has_table_privilege
@@ -3157,6 +3163,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b2..f9b57e2a33f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1740,6 +1740,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ node_id,
+ size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, node_id, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..fddb21a260a
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index b81694c24f2..f93d4829702 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1921,11 +1921,13 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.39.5
Hi,
Thank you for working on this!
On Wed, 19 Mar 2025 at 12:06, Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
On Tue, Mar 18, 2025 at 3:29 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:Hi! v15 attached, rebased, CI-tested, all fixed incorporated.
This needs to be rebased after 8eadd5c73c.
--
Regards,
Nazir Bilal Yavuz
Microsoft
On Thu, Mar 27, 2025 at 12:31 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Hi,
Thank you for working on this!
On Wed, 19 Mar 2025 at 12:06, Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:On Tue, Mar 18, 2025 at 3:29 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:Hi! v15 attached, rebased, CI-tested, all fixed incorporated.
This needs to be rebased after 8eadd5c73c.
Hi Nazir, thanks for spotting! I've not connected the dots with AIO
going in and my libnuma dependency blowing up... Attached is rebased
v16 that passed my CI run.
-J.
Attachments:
v16-0003-Extend-pg_buffercache-with-new-view-pg_buffercac.patchapplication/octet-stream; name=v16-0003-Extend-pg_buffercache-with-new-view-pg_buffercac.patchDownload
From c9d7598eeaa73741ea1658e88e51cec8858751a0 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v16 3/4] Extend pg_buffercache with new view
pg_buffercache_numa to show NUMA memory node for individual buffer.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache_numa.out | 28 +++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 42 +++++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 172 +++++++++++++++++-
.../sql/pg_buffercache_numa.sql | 20 ++
doc/src/sgml/pgbuffercache.sgml | 61 ++++++-
9 files changed, 329 insertions(+), 4 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..0f4b2eaf444
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,42 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the new function.
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+CREATE OR REPLACE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, node_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pages() TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 86e0b8afe01..7abba723c0b 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,11 +11,12 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -45,6 +46,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_node_id;
} BufferCachePagesRec;
@@ -63,10 +65,56 @@ typedef struct
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+/*
+ * Helper routine to map Buffers into addresses that is used by
+ * pg_numa_query_pages().
+ *
+ * When database block size (BLCKSZ) is smaller than the OS page size (4kB),
+ * multiple database buffers will map to the same OS memory page. In this case,
+ * we only need to query the NUMA node for the first memory address of each
+ * unique OS page rather than for every buffer.
+ *
+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ */
+static inline void
+pg_buffercache_numa_prepare_ptrs(int buffer_id, float pages_per_blk,
+ Size os_page_size,
+ void **os_page_ptrs)
+{
+ size_t blk2page = (size_t) (buffer_id * pages_per_blk);
+
+ for (size_t j = 0; j < pages_per_blk; j++)
+ {
+ size_t blk2pageoff = blk2page + j;
+
+ if (os_page_ptrs[blk2pageoff] == 0)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ /* NBuffers starts from 1 */
+ os_page_ptrs[blk2pageoff] = (char *) BufferGetBlock(buffer_id + 1) +
+ (os_page_size * j);
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2pageoff]);
+
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+}
+
/*
* Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
*
@@ -124,6 +172,9 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "node_id",
+ INT4OID, -1, 0);
fctx->tupdesc = BlessTupleDesc(tupledesc);
@@ -176,6 +227,8 @@ pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)
else
fctx->record[record_id].isvalid = false;
+ fctx->record[record_id].numa_node_id = -1;
+
UnlockBufHdr(bufHdr, buf_state);
}
@@ -211,6 +264,7 @@ get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
/* unused for v1.0 callers, but the array is always long enough */
nulls[8] = true;
+ nulls[9] = true;
}
else
{
@@ -234,6 +288,8 @@ get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
*/
values[8] = Int32GetDatum(fctx->record[record_id].pinning_backends);
nulls[8] = false;
+ values[9] = Int32GetDatum(fctx->record[record_id].numa_node_id);
+ nulls[9] = false;
}
/* Build and return the tuple. */
@@ -285,6 +341,120 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
}
}
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings.
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_pages_status = NULL;
+ uint64 os_page_count = 0;
+ float pages_per_blk = 0;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS
+ * memory page size 2. Calculate how many OS pages are used by all
+ * buffer blocks 3. Calculate how many OS pages are contained within
+ * each database block.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+ os_page_count = ((uint64) NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu pages_per_blk=%.2f",
+ (unsigned long) os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(uint64) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ *
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ */
+ for (i = 0; i < NBuffers; i++)
+ {
+ pg_buffercache_build_tuple(i, fctx);
+ pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size,
+ os_page_ptrs);
+ }
+
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ int blk2page = (int) i * pages_per_blk;
+
+ /*
+ * Set the NUMA node id for this buffer based on the first OS page
+ * it maps to.
+ *
+ * Note: We could check for errors in os_pages_status and report
+ * them. Also, a single DB block might span multiple NUMA nodes if
+ * it crosses OS pages on node boundaries, but we only record the
+ * node of the first page. This is a simplification but should be
+ * sufficient for most analyses.
+ */
+ fctx->record[i].numa_node_id = os_pages_status[blk2page];
+ }
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ Datum result;
+ uint32 i = funcctx->call_cntr;
+
+ result = get_buffercache_tuple(i, fctx);
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ {
+ firstNumaTouch = false;
+ SRF_RETURN_DONE(funcctx);
+ }
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..086e0062a17 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,14 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides the same information
+ as <function>pg_buffercache_pages()</function> but is slower because it also
+ provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +210,55 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache_numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed are identical to the
+ <structname>pg_buffercache</structname> view, except that this one includes
+ one additional <structfield>node_id</structfield> column as defined in
+ <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Extra column</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>integer</type>
+ </para>
+ <para>
+ <acronym>NUMA</acronym> node ID. NULL if the shared buffer
+ has not been used yet. On systems without <acronym>NUMA</acronym> support
+ this returns 0.
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
--
2.39.5
v16-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchapplication/octet-stream; name=v16-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchDownload
From d091528cd990ea84980441bf82064d5b884c6786 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 10:19:35 +0100
Subject: [PATCH v16 1/4] Add optional dependency to libnuma (Linux-only) for
basic NUMA awareness routines and add minimal src/port/pg_numa.c portability
wrapper. Other platforms can be added later.
This also adds function pg_numa_available() that can be used to check if
the server was linked with NUMA support.
libnuma is unavailable on 32-bit builds, so due to lack of i386 shared object,
we disable it there (it does not make sense anyway on i386 it is very memory
limited platform even with PAE)
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 12 +-
configure | 87 ++++++++++++++
configure.ac | 13 +++
doc/src/sgml/func.sgml | 13 +++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 1 +
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 46 ++++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 168 ++++++++++++++++++++++++++++
17 files changed, 397 insertions(+), 5 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..e6963c774aa 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -445,8 +445,10 @@ task:
EOF
setup_additional_packages_script: |
- #apt-get update
- #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
+ apt-get update
+ DEBIAN_FRONTEND=noninteractive apt-get -y install \
+ libnuma1 \
+ libnuma-dev
matrix:
# SPECIAL:
@@ -471,6 +473,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
--with-liburing \
\
${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +526,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
@@ -839,8 +843,8 @@ task:
folder: $CCACHE_DIR
setup_additional_packages_script: |
- #apt-get update
- #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
+ apt-get update
+ DEBIAN_FRONTEND=noninteractive apt-get -y install libnuma1 libnuma-dev
###
# Test that code can be built with gcc/clang without warnings
diff --git a/configure b/configure
index 4dd67a5cc6e..81e43a38331 100755
--- a/configure
+++ b/configure
@@ -711,6 +711,7 @@ with_libxml
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
+with_libnuma
with_uuid
LIBURING_LIBS
LIBURING_CFLAGS
@@ -872,6 +873,7 @@ with_liburing
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -1588,6 +1590,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -9279,6 +9282,33 @@ fi
+#
+# NUMA
+#
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
@@ -12517,6 +12547,63 @@ fi
fi
+if test "$with_libnuma" = yes ; then
+
+ ac_fn_c_check_header_mongrel "$LINENO" "numa.h" "ac_cv_header_numa_h" "$ac_includes_default"
+if test "x$ac_cv_header_numa_h" = xyes; then :
+
+else
+ as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
+fi
+
+
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'numa' does not provide numa_available" "$LINENO" 5
+fi
+
+fi
+
# XXX libcurl must link after libgssapi_krb5 on FreeBSD to avoid segfaults
# during gss_acquire_cred(). This is possibly related to Curl's Heimdal
# dependency on that platform?
diff --git a/configure.ac b/configure.ac
index 537e654e7b3..1879baf183a 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,19 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5bf6656deca..1f98826d16d 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25138,6 +25138,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..5f0486bb335 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-liburing">
<term><option>--with-liburing</option></term>
<listitem>
@@ -2645,6 +2655,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 187f1787a3c..52c4f3c1022 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: liburing
@@ -3177,6 +3198,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
liburing,
libxml,
lz4,
@@ -3833,6 +3855,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..8675e1b5d87 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('liburing', type : 'feature', value: 'auto',
description: 'io_uring support, for asynchronous I/O')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index cce29a37ac5..71479ad9018 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 989825d3a9c..a80616d4455 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 8b68b16d79d..d532b8c43b9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8506,6 +8506,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c6f055b3905..424d42b14f8 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -675,6 +675,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to build with io_uring support. (--with-liburing) */
#undef USE_LIBURING
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..986152e0942
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,46 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "c.h"
+#include "postgres.h"
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
@@ -232,6 +234,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index 7843d7b67cb..8c8e6b92910 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
path.o \
pg_bitutils.o \
pg_localeconv_r.o \
+ pg_numa.o \
pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 653539ba5b3..1eb8e38d047 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_localeconv_r.c',
+ 'pg_numa.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..7d905ef31f5
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,168 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
+
+#ifndef FRONTEND
+/*
+ * XXX: not really tested as there is no way to trigger this in our
+ * current usage of libnuma.
+ *
+ * The libnuma built-in code can be seen here:
+ * https://github.com/numactl/numactl/blob/master/libnuma.c
+ *
+ */
+void
+numa_warn(int num, char *fmt,...)
+{
+ va_list ap;
+ int olde = errno;
+ int needed;
+ StringInfoData msg;
+
+ initStringInfo(&msg);
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ if (needed > 0)
+ {
+ enlargeStringInfo(&msg, needed);
+ va_start(ap, fmt);
+ appendStringInfoVA(&msg, fmt, ap);
+ va_end(ap);
+ }
+
+ ereport(WARNING,
+ (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
+ errmsg_internal("libnuma: WARNING: %s", msg.data)));
+
+ pfree(msg.data);
+
+ errno = olde;
+}
+
+void
+numa_error(char *where)
+{
+ int olde = errno;
+
+ /*
+ * XXX: for now we issue just WARNING, but long-term that might depend on
+ * numa_set_strict() here.
+ */
+ elog(WARNING, "libnuma: ERROR: %s", where);
+ errno = olde;
+}
+#endif /* FRONTEND */
+
+#else
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+Size
+pg_numa_get_pagesize(void)
+{
+#ifndef WIN32
+ Size os_page_size = sysconf(_SC_PAGESIZE);
+#else
+ Size os_page_size;
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#endif
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+ return os_page_size;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
--
2.39.5
v16-0002-This-extracts-code-from-contrib-pg_buffercache-s.patchapplication/octet-stream; name=v16-0002-This-extracts-code-from-contrib-pg_buffercache-s.patchDownload
From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's
primary function to separate functions.
This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple()
and get_buffercache_tuple() that help fill result tuplestores based on the
buffercache contents. This will be used in a follow-up commit that implements
NUMA observability in pg_buffercache.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++++++++++--------
1 file changed, 178 insertions(+), 139 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..86e0b8afe01 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -14,7 +14,6 @@
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
-
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
#define NUM_BUFFERCACHE_PAGES_ELEM 9
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
@@ -68,80 +67,192 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-Datum
-pg_buffercache_pages(PG_FUNCTION_ARGS)
+/*
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * This is almost identical to pg_buffercache_numa_pages(), but this one performs
+ * memory mapping inquiries to display NUMA node information for each buffer.
+ */
+static BufferCachePagesContext *
+pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
{
- FuncCallContext *funcctx;
- Datum result;
- MemoryContext oldcontext;
BufferCachePagesContext *fctx; /* User function context. */
+ MemoryContext oldcontext;
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
- HeapTuple tuple;
- if (SRF_IS_FIRSTCALL())
- {
- int i;
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
- funcctx = SRF_FIRSTCALL_INIT();
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. ee unfortunately have to get the result type for that... - we
+ * can't use the result type determined by the function definition without
+ * potentially crashing when somebody uses the old (or even wrong)
+ * function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+ expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
+ INT2OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
+ BOOLOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
+ INT2OID, -1, 0);
+
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+ INT4OID, -1, 0);
- /* Switch context when allocating stuff to be used in later calls */
- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
- /* Create a user function context for cross-call persistence */
- fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCachePagesRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCachePagesRec) * NBuffers);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+ return fctx;
+}
+
+/*
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * Build buffer cache information for a single buffer.
+ */
+static void
+pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ bufHdr = GetBufferDescriptor(record_id);
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+
+ fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr);
+ fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ fctx->record[record_id].reltablespace = bufHdr->tag.spcOid;
+ fctx->record[record_id].reldatabase = bufHdr->tag.dbOid;
+ fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag);
+ fctx->record[record_id].blocknum = bufHdr->tag.blockNum;
+ fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+ if (buf_state & BM_DIRTY)
+ fctx->record[record_id].isdirty = true;
+ else
+ fctx->record[record_id].isdirty = false;
+
+ /* Note if the buffer is valid, and has storage created */
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ fctx->record[record_id].isvalid = true;
+ else
+ fctx->record[record_id].isvalid = false;
+
+ UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * Format and return a tuple for a single buffer cache entry.
+ */
+static Datum
+get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+ HeapTuple tuple;
+
+ values[0] = Int32GetDatum(fctx->record[record_id].bufferid);
+ nulls[0] = false;
+
+ /*
+ * Set all fields except the bufferid to null if the buffer is unused or
+ * not valid.
+ */
+ if (fctx->record[record_id].blocknum == InvalidBlockNumber ||
+ fctx->record[record_id].isvalid == false)
+ {
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
+ nulls[6] = true;
+ nulls[7] = true;
+
+ /* unused for v1.0 callers, but the array is always long enough */
+ nulls[8] = true;
+ }
+ else
+ {
+ values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber);
+ nulls[1] = false;
+ values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace);
+ nulls[2] = false;
+ values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase);
+ nulls[3] = false;
+ values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum);
+ nulls[4] = false;
+ values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum);
+ nulls[5] = false;
+ values[6] = BoolGetDatum(fctx->record[record_id].isdirty);
+ nulls[6] = false;
+ values[7] = Int16GetDatum(fctx->record[record_id].usagecount);
+ nulls[7] = false;
/*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
+ * unused for v1.0 callers, but the array is always long enough
*/
- if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
- elog(ERROR, "return type must be a row type");
+ values[8] = Int32GetDatum(fctx->record[record_id].pinning_backends);
+ nulls[8] = false;
+ }
- if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
- expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
- elog(ERROR, "incorrect number of output arguments");
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
- /* Construct a tuple descriptor for the result rows. */
- tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
- TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
- INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
- INT2OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
- INT8OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
- BOOLOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
- INT2OID, -1, 0);
-
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);
-
- fctx->tupdesc = BlessTupleDesc(tupledesc);
-
- /* Allocate NBuffers worth of BufferCachePagesRec records. */
- fctx->record = (BufferCachePagesRec *)
- MemoryContextAllocHuge(CurrentMemoryContext,
- sizeof(BufferCachePagesRec) * NBuffers);
-
- /* Set max calls and remember the user function context. */
- funcctx->max_calls = NBuffers;
- funcctx->user_fctx = fctx;
-
- /* Return to original context when allocating transient memory */
- MemoryContextSwitchTo(oldcontext);
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
* Scan through all the buffers, saving the relevant fields in the
@@ -152,36 +263,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
* locks, so the information of each buffer is self-consistent.
*/
for (i = 0; i < NBuffers; i++)
- {
- BufferDesc *bufHdr;
- uint32 buf_state;
-
- bufHdr = GetBufferDescriptor(i);
- /* Lock each buffer header before inspecting. */
- buf_state = LockBufHdr(bufHdr);
-
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
- if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
- else
- fctx->record[i].isdirty = false;
-
- /* Note if the buffer is valid, and has storage created */
- if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
- else
- fctx->record[i].isvalid = false;
-
- UnlockBufHdr(bufHdr, buf_state);
- }
+ pg_buffercache_build_tuple(i, fctx);
}
funcctx = SRF_PERCALL_SETUP();
@@ -191,59 +273,16 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
if (funcctx->call_cntr < funcctx->max_calls)
{
+ Datum result;
uint32 i = funcctx->call_cntr;
- Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
- bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
-
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
- nulls[0] = false;
-
- /*
- * Set all fields except the bufferid to null if the buffer is unused
- * or not valid.
- */
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
- {
- nulls[1] = true;
- nulls[2] = true;
- nulls[3] = true;
- nulls[4] = true;
- nulls[5] = true;
- nulls[6] = true;
- nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
- nulls[8] = true;
- }
- else
- {
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
- nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
- nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
- nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
- nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
- nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
- nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
- nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
- nulls[8] = false;
- }
-
- /* Build and return the tuple. */
- tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
- result = HeapTupleGetDatum(tuple);
+ result = get_buffercache_tuple(i, fctx);
SRF_RETURN_NEXT(funcctx, result);
}
else
+ {
SRF_RETURN_DONE(funcctx);
+ }
}
Datum
--
2.39.5
v16-0004-Add-pg_shmem_numa_allocations-to-show-NUMA-memor.patchapplication/octet-stream; name=v16-0004-Add-pg_shmem_numa_allocations-to-show-NUMA-memor.patchDownload
From 0e7a1e8e34aa631d6420656eb5b811a0a07f11d7 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v16 4/4] Add pg_shmem_numa_allocations to show NUMA memory
node for shared memory allocations.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 79 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 131 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 +++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 273 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f5a306247e..e83711f9578 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -176,6 +176,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-numa-allocations"><structname>pg_shmem_numa_allocations</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -3746,6 +3751,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-numa-allocations">
+ <title><structname>pg_shmem_numa_allocations</structname></title>
+
+ <indexterm zone="view-pg-shmem-numa-allocations">
+ <primary>pg_shmem_numa_allocations</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_numa_allocations</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_numa_allocations</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_numa_allocations</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 31d269b7ee0..660a9e9832b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..a011603d318 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port//pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,131 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+ return (Datum) 0;
+ }
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ shm_ent_page_count = ent->allocated_size / os_page_size;
+ /* It is always at least 1 page */
+ shm_ent_page_count = shm_ent_page_count == 0 ? 1 : shm_ent_page_count;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (which indicates unmapped/unallocated pages).
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+ /* Count number of NUMA nodes used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s >= 0 && s <= max_nodes)
+ nodes[s]++;
+ }
+
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory.
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d532b8c43b9..db69c0231f9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8510,6 +8510,14 @@
proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,node_id,size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..fb882c5b771
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 954f549555e..d9d62470cdc 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3144,6 +3144,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
has_table_privilege
@@ -3157,6 +3163,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 47478969135..6e460e10d61 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1740,6 +1740,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ node_id,
+ size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, node_id, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..fddb21a260a
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index b81694c24f2..f93d4829702 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1921,11 +1921,13 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.39.5
Hello
I think you should remove numa_warn() and numa_error() from 0001.
AFAICS they are dead code (even with all your patches applied), and
furthermore would get you in trouble regarding memory allocation because
src/port is not allowed to use palloc et al. If you wanted to keep them
you'd have to have them in src/common, but looking at the rest of the
code in that patch, ISTM src/port is the right place for it. If in the
future you discover that you do need numa_warn(), you can create a
src/common/ file for it then.
Is pg_buffercache really the best place for these NUMA introspection
routines? I'm not saying that it isn't, maybe we're okay with that
(particularly if we can avoid duplicated code), but it seems a bit weird
to me.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"No me acuerdo, pero no es cierto. No es cierto, y si fuera cierto,
no me acuerdo." (Augusto Pinochet a una corte de justicia)
Hi,
On 2025-03-27 14:02:03 +0100, Jakub Wartak wrote:
setup_additional_packages_script: | - #apt-get update - #DEBIAN_FRONTEND=noninteractive apt-get -y install ... + apt-get update + DEBIAN_FRONTEND=noninteractive apt-get -y install \ + libnuma1 \ + libnuma-dev
I think libnuma is installed on the relevant platforms, so you shouldn't need
to install it manually.
+# +# libnuma +# +AC_MSG_CHECKING([whether to build with libnuma support]) +PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
Most other dependencies say "build with libxyz ..."
+/*------------------------------------------------------------------------- + * + * pg_numa.h + * Basic NUMA portability routines + * + * + * Copyright (c) 2025, PostgreSQL Global Development Group + * + * IDENTIFICATION + * src/include/port/pg_numa.h + * + *------------------------------------------------------------------------- + */ +#ifndef PG_NUMA_H +#define PG_NUMA_H + +#include "c.h" +#include "postgres.h"
Headers should never include either of those headers. Nor should .c files
include both.
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS', + 'LIBURING_CFLAGS', 'LIBURING_LIBS', ]
Maybe I am missing something, but are you actually defining and using those
LIBNUMA_* vars anywhere?
+Size +pg_numa_get_pagesize(void) +{ + Size os_page_size = sysconf(_SC_PAGESIZE); + + if (huge_pages_status == HUGE_PAGES_ON) + GetHugePageSize(&os_page_size, NULL); + + return os_page_size; +}
Should this have a comment or an assertion that it can only be used after
shared memory startup? Because before that huge_pages_status won't be
meaningful?
+#ifndef FRONTEND +/* + * XXX: not really tested as there is no way to trigger this in our + * current usage of libnuma. + * + * The libnuma built-in code can be seen here: + * https://github.com/numactl/numactl/blob/master/libnuma.c + * + */ +void +numa_warn(int num, char *fmt,...) +{ + va_list ap; + int olde = errno; + int needed; + StringInfoData msg; + + initStringInfo(&msg); + + va_start(ap, fmt); + needed = appendStringInfoVA(&msg, fmt, ap); + va_end(ap); + if (needed > 0) + { + enlargeStringInfo(&msg, needed); + va_start(ap, fmt); + appendStringInfoVA(&msg, fmt, ap); + va_end(ap); + } + + ereport(WARNING, + (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION), + errmsg_internal("libnuma: WARNING: %s", msg.data)));
I think you would at least have to hold interrupts across this, as
ereport(WARNING) does CHECK_FOR_INTERRUPTS() and it would not be safe to jump
out of libnuma in case an interrupt has arrived.
+Size +pg_numa_get_pagesize(void) +{ +#ifndef WIN32 + Size os_page_size = sysconf(_SC_PAGESIZE); +#else + Size os_page_size; + SYSTEM_INFO sysinfo; + + GetSystemInfo(&sysinfo); + os_page_size = sysinfo.dwPageSize; +#endif + if (huge_pages_status == HUGE_PAGES_ON) + GetHugePageSize(&os_page_size, NULL); + return os_page_size; +}
I would probably implement this once, outside of the big ifdef, with one more
ifdef inside, given that you're sharing the same implementation.
From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's
primary function to separate functions.This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple()
and get_buffercache_tuple() that help fill result tuplestores based on the
buffercache contents. This will be used in a follow-up commit that implements
NUMA observability in pg_buffercache.Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: /messages/by-id/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N=q1w+DiH-696Xw@mail.gmail.com
---
contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++++++++++--------
1 file changed, 178 insertions(+), 139 deletions(-)diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c index 62602af1775..86e0b8afe01 100644 --- a/contrib/pg_buffercache/pg_buffercache_pages.c +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -14,7 +14,6 @@ #include "storage/buf_internals.h" #include "storage/bufmgr.h"-
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
Independent change.
#define NUM_BUFFERCACHE_PAGES_ELEM 9
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
@@ -68,80 +67,192 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);-Datum -pg_buffercache_pages(PG_FUNCTION_ARGS) +/* + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). + * + * This is almost identical to pg_buffercache_numa_pages(), but this one performs + * memory mapping inquiries to display NUMA node information for each buffer. + */
If it's a helper routine it's probably not identical to
pg_buffercache_numa_pages()?
+/* + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). + * + * Build buffer cache information for a single buffer. + */ +static void +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx) +{
This isn't really building a tuple tuple? Seems somewhat confusing, because
get_buffercache_tuple() does actually build one.
+ BufferDesc *bufHdr; + uint32 buf_state; + + bufHdr = GetBufferDescriptor(record_id); + /* Lock each buffer header before inspecting. */ + buf_state = LockBufHdr(bufHdr); + + fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr); + fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag); + fctx->record[record_id].reltablespace = bufHdr->tag.spcOid; + fctx->record[record_id].reldatabase = bufHdr->tag.dbOid; + fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag); + fctx->record[record_id].blocknum = bufHdr->tag.blockNum; + fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state); + fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
As above, I think this would be more readable if you put
fctx->record[record_id] into a local var.
+static Datum +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx) +{ + Datum values[NUM_BUFFERCACHE_PAGES_ELEM]; + bool nulls[NUM_BUFFERCACHE_PAGES_ELEM]; + HeapTuple tuple; + + values[0] = Int32GetDatum(fctx->record[record_id].bufferid); + nulls[0] = false; + + /* + * Set all fields except the bufferid to null if the buffer is unused or + * not valid. + */ + if (fctx->record[record_id].blocknum == InvalidBlockNumber || + fctx->record[record_id].isvalid == false) + { + nulls[1] = true; + nulls[2] = true; + nulls[3] = true; + nulls[4] = true; + nulls[5] = true; + nulls[6] = true; + nulls[7] = true; + + /* unused for v1.0 callers, but the array is always long enough */ + nulls[8] = true;
I'd probably just memset the entire nulls array to either true or false,
instead of doing it one-by-one.
+ } + else + { + values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber); + nulls[1] = false; + values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace); + nulls[2] = false; + values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase); + nulls[3] = false; + values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum); + nulls[4] = false; + values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum); + nulls[5] = false; + values[6] = BoolGetDatum(fctx->record[record_id].isdirty); + nulls[6] = false; + values[7] = Int16GetDatum(fctx->record[record_id].usagecount); + nulls[7] = false;
Seems like it would end up a lot more readable if you put
fctx->record[record_id] into a local variable. Unfortunately that'd probably
be best done in one more commit ahead of the rest of the this one...
@@ -0,0 +1,28 @@ +SELECT NOT(pg_numa_available()) AS skip_test \gset +\if :skip_test +\quit +\endif
You could avoid the need for an alternative output file if you instead made
the queries do something like
SELECT NOT pg_numa_available() OR count(*) ...
--- /dev/null +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql @@ -0,0 +1,42 @@ +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit + +-- Register the new function. +DROP VIEW pg_buffercache; +DROP FUNCTION pg_buffercache_pages();
I don't think we can just drop a view in the upgrade script. That will fail if
anybody created a view depending on pg_buffercache.
(Sorry, ran out of time / energy here, i had originally just wanted to comment
on the apt-get thing in the tests)
Greetings,
Andres Freund
On Thu, Mar 27, 2025 at 2:15 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Hello
Good morning :)
I think you should remove numa_warn() and numa_error() from 0001.
AFAICS they are dead code (even with all your patches applied), and
furthermore would get you in trouble regarding memory allocation because
src/port is not allowed to use palloc et al. If you wanted to keep them
you'd have to have them in src/common, but looking at the rest of the
code in that patch, ISTM src/port is the right place for it. If in the
future you discover that you do need numa_warn(), you can create a
src/common/ file for it then.
Understood, trimmed it out from the patch. I'm going to respond also
within minutes to Andres' review and I'm going to post a new version
(v17) there.
Is pg_buffercache really the best place for these NUMA introspection
routines? I'm not saying that it isn't, maybe we're okay with that
(particularly if we can avoid duplicated code), but it seems a bit weird
to me.
I think it is, because as I understand, Andres wanted to have
observability per single database *page* and to avoid code duplication
we are just putting it there (it's natural fit). Imagine looking up an
8kB root btree memory page being hit hard from CPUs on other NUMA
nodes (this just gives ability to see that, but you could of course
also get aggregation to get e.g. NUMA node balance for single relation
and so on).
-J.
On Thu, Mar 27, 2025 at 2:40 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
Hi Andres,
On 2025-03-27 14:02:03 +0100, Jakub Wartak wrote:
setup_additional_packages_script: | - #apt-get update - #DEBIAN_FRONTEND=noninteractive apt-get -y install ... + apt-get update + DEBIAN_FRONTEND=noninteractive apt-get -y install \ + libnuma1 \ + libnuma-devI think libnuma is installed on the relevant platforms, so you shouldn't need
to install it manually.
Fixed. Right, you mentioned this earlier, I just didnt know when it went online.
+# +# libnuma +# +AC_MSG_CHECKING([whether to build with libnuma support]) +PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],Most other dependencies say "build with libxyz ..."
Done.
+ * pg_numa.h
[..]
+#include "c.h"
+#include "postgres.h"Headers should never include either of those headers. Nor should .c files
include both.
Fixed, huh, I've found explanation:
/messages/by-id/11634.1488932128@sss.pgh.pa.us
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS', + 'LIBURING_CFLAGS', 'LIBURING_LIBS', ]Maybe I am missing something, but are you actually defining and using those
LIBNUMA_* vars anywhere?
OK, so it seems I've been missing `PKG_CHECK_MODULES(LIBNUMA, numa)`
in configure.ac that would set those *FLAGS. I'm little bit loss
dependent in how to gurantee that meson is synced with autoconf as per
project requirements - trying to use past commits as reference, but I
still could get something wrong here (especially in
src/Makefile.global.in)
+Size
+pg_numa_get_pagesize(void)
[..]
Should this have a comment or an assertion that it can only be used after
shared memory startup? Because before that huge_pages_status won't be
meaningful?
Added both.
+#ifndef FRONTEND +/* + * XXX: not really tested as there is no way to trigger this in our + * current usage of libnuma. + * + * The libnuma built-in code can be seen here: + * https://github.com/numactl/numactl/blob/master/libnuma.c + * + */ +void +numa_warn(int num, char *fmt,...)
[..]
I think you would at least have to hold interrupts across this, as
ereport(WARNING) does CHECK_FOR_INTERRUPTS() and it would not be safe to jump
out of libnuma in case an interrupt has arrived.
On request by Alvaro I've removed it as that code is simply
unreachable/untestable, but point taken - I'm planning to re-add this
with holding interrupting in future when we start using proper
numa_interleave() one day. Anyway, please let me know if you want
still to keep it as deadcode. BTW for context , why this is deadcode
is explained in the latter part of [1]/messages/by-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA+s+_5hcTOg@mail.gmail.com message (TL;DR; unless we use
pining/numa_interleave/local_alloc() we probably never reach that
warnings/error handlers).
"Another question without an easy answer as I never hit this error from
numa_move_pages(), one gets invalid stuff in *os_pages_status instead.
BUT!: most of our patch just uses things that cannot fail as per
libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700
/sys` (because it's hard to unmount it) and then still most of numactl
stuff works as euid != 0, but numactl --hardware gets at least
"libnuma: Warning: Cannot parse distance information in sysfs:
Permission denied" or same story with numactl -C 678 date. So unless
we start way more heavy use of libnuma (not just for observability)
there's like no point in that right now (?) Contrary to that: we can
do just do variadic elog() for that, I've put some code, but no idea
if that works fine..."
+Size
+pg_numa_get_pagesize(void)
[..]
I would probably implement this once, outside of the big ifdef, with one more
ifdef inside, given that you're sharing the same implementation.
Done.
From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's
primary function to separate functions.This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple()
and get_buffercache_tuple() that help fill result tuplestores based on the
buffercache contents. This will be used in a follow-up commit that implements
NUMA observability in pg_buffercache.Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: /messages/by-id/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N=q1w+DiH-696Xw@mail.gmail.com
---
contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++++++++++--------
1 file changed, 178 insertions(+), 139 deletions(-)diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c index 62602af1775..86e0b8afe01 100644 --- a/contrib/pg_buffercache/pg_buffercache_pages.c +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -14,7 +14,6 @@ #include "storage/buf_internals.h" #include "storage/bufmgr.h"-
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8Independent change.
Fixed.
#define NUM_BUFFERCACHE_PAGES_ELEM 9
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
@@ -68,80 +67,192 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);-Datum -pg_buffercache_pages(PG_FUNCTION_ARGS) +/* + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). + * + * This is almost identical to pg_buffercache_numa_pages(), but this one performs + * memory mapping inquiries to display NUMA node information for each buffer. + */If it's a helper routine it's probably not identical to
pg_buffercache_numa_pages()?
Of course not, fixed by removing that comment.
+/* + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). + * + * Build buffer cache information for a single buffer. + */ +static void +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx) +{This isn't really building a tuple tuple? Seems somewhat confusing, because
get_buffercache_tuple() does actually build one.
s/pg_buffercache_build_tuple/pg_buffercache_save_tuple/g , unless
someone wants to come with better name.
+ BufferDesc *bufHdr; + uint32 buf_state; + + bufHdr = GetBufferDescriptor(record_id); + /* Lock each buffer header before inspecting. */ + buf_state = LockBufHdr(bufHdr); + + fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr); + fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag); + fctx->record[record_id].reltablespace = bufHdr->tag.spcOid; + fctx->record[record_id].reldatabase = bufHdr->tag.dbOid; + fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag); + fctx->record[record_id].blocknum = bufHdr->tag.blockNum; + fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state); + fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);As above, I think this would be more readable if you put
fctx->record[record_id] into a local var.
Done.
+static Datum +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx) +{ + Datum values[NUM_BUFFERCACHE_PAGES_ELEM]; + bool nulls[NUM_BUFFERCACHE_PAGES_ELEM]; + HeapTuple tuple; + + values[0] = Int32GetDatum(fctx->record[record_id].bufferid); + nulls[0] = false; + + /* + * Set all fields except the bufferid to null if the buffer is unused or + * not valid. + */ + if (fctx->record[record_id].blocknum == InvalidBlockNumber || + fctx->record[record_id].isvalid == false) + { + nulls[1] = true; + nulls[2] = true; + nulls[3] = true; + nulls[4] = true; + nulls[5] = true; + nulls[6] = true; + nulls[7] = true; + + /* unused for v1.0 callers, but the array is always long enough */ + nulls[8] = true;I'd probably just memset the entire nulls array to either true or false,
instead of doing it one-by-one.
Done.
+ } + else + { + values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber); + nulls[1] = false; + values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace); + nulls[2] = false; + values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase); + nulls[3] = false; + values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum); + nulls[4] = false; + values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum); + nulls[5] = false; + values[6] = BoolGetDatum(fctx->record[record_id].isdirty); + nulls[6] = false; + values[7] = Int16GetDatum(fctx->record[record_id].usagecount); + nulls[7] = false;Seems like it would end up a lot more readable if you put
fctx->record[record_id] into a local variable. Unfortunately that'd probably
be best done in one more commit ahead of the rest of the this one...
Done, i've put it those refactorig changes into the commit already
dedicated only for a refactor. For the record Bertrand also asked for
something about this, but I was somehow afraid to touch Tom's code.
@@ -0,0 +1,28 @@ +SELECT NOT(pg_numa_available()) AS skip_test \gset +\if :skip_test +\quit +\endifYou could avoid the need for an alternative output file if you instead made
the queries do something like
SELECT NOT pg_numa_available() OR count(*) ...
ITEM REMAINING: Is this for the future or can it stay like that? I
don't have a hard opinion on this, but I've already wasted lots of
cycles to discover that one can have those ".1" alternative expected
result files.
--- /dev/null +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql @@ -0,0 +1,42 @@ +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit + +-- Register the new function. +DROP VIEW pg_buffercache; +DROP FUNCTION pg_buffercache_pages();I don't think we can just drop a view in the upgrade script. That will fail if
anybody created a view depending on pg_buffercache.
Ugh, fixed, thanks. That must have been some leftover (we later do
CREATE OR REPLACE those anyway).
(Sorry, ran out of time / energy here, i had originally just wanted to comment
on the apt-get thing in the tests)
Thanks! AIO intensifies ... :)
-J.
[1]: /messages/by-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA+s+_5hcTOg@mail.gmail.com
Attachments:
v17-0003-Extend-pg_buffercache-with-new-view-pg_buffercac.patchapplication/octet-stream; name=v17-0003-Extend-pg_buffercache-with-new-view-pg_buffercac.patchDownload
From a920535a78df30661e10936e4750ad4c3bfc1818 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v17 3/4] Extend pg_buffercache with new view
pg_buffercache_numa to show NUMA memory node for individual buffer.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache_numa.out | 28 +++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 39 ++++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 170 +++++++++++++++++-
.../sql/pg_buffercache_numa.sql | 20 +++
doc/src/sgml/pgbuffercache.sgml | 61 ++++++-
9 files changed, 324 insertions(+), 4 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..720dc84b2c9
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,39 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, node_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pages() TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index cad7429a21b..1ec9ac25d58 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,12 +11,13 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -46,6 +47,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_node_id;
} BufferCachePagesRec;
@@ -64,10 +66,56 @@ typedef struct
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+/*
+ * Helper routine to map Buffers into addresses that is used by
+ * pg_numa_query_pages().
+ *
+ * When database block size (BLCKSZ) is smaller than the OS page size (4kB),
+ * multiple database buffers will map to the same OS memory page. In this case,
+ * we only need to query the NUMA node for the first memory address of each
+ * unique OS page rather than for every buffer.
+ *
+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ */
+static inline void
+pg_buffercache_numa_prepare_ptrs(int buffer_id, float pages_per_blk,
+ Size os_page_size,
+ void **os_page_ptrs)
+{
+ size_t blk2page = (size_t) (buffer_id * pages_per_blk);
+
+ for (size_t j = 0; j < pages_per_blk; j++)
+ {
+ size_t blk2pageoff = blk2page + j;
+
+ if (os_page_ptrs[blk2pageoff] == 0)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ /* NBuffers starts from 1 */
+ os_page_ptrs[blk2pageoff] = (char *) BufferGetBlock(buffer_id + 1) +
+ (os_page_size * j);
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2pageoff]);
+
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+}
+
/*
* Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
*/
@@ -122,6 +170,9 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "node_id",
+ INT4OID, -1, 0);
fctx->tupdesc = BlessTupleDesc(tupledesc);
@@ -175,6 +226,8 @@ pg_buffercache_save_tuple(int record_id, BufferCachePagesContext *fctx)
else
bufRecord->isvalid = false;
+ bufRecord->numa_node_id = -1;
+
UnlockBufHdr(bufHdr, buf_state);
}
@@ -220,6 +273,7 @@ get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
* unused for v1.0 callers, but the array is always long enough
*/
values[8] = Int32GetDatum(bufRecord->pinning_backends);
+ values[9] = Int32GetDatum(bufRecord->numa_node_id);
}
/* Build and return the tuple. */
@@ -271,6 +325,120 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
}
}
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings.
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_pages_status = NULL;
+ uint64 os_page_count = 0;
+ float pages_per_blk = 0;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS
+ * memory page size 2. Calculate how many OS pages are used by all
+ * buffer blocks 3. Calculate how many OS pages are contained within
+ * each database block.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+ os_page_count = ((uint64) NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu pages_per_blk=%.2f",
+ (unsigned long) os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(uint64) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ *
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ */
+ for (i = 0; i < NBuffers; i++)
+ {
+ pg_buffercache_save_tuple(i, fctx);
+ pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size,
+ os_page_ptrs);
+ }
+
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ int blk2page = (int) i * pages_per_blk;
+
+ /*
+ * Set the NUMA node id for this buffer based on the first OS page
+ * it maps to.
+ *
+ * Note: We could check for errors in os_pages_status and report
+ * them. Also, a single DB block might span multiple NUMA nodes if
+ * it crosses OS pages on node boundaries, but we only record the
+ * node of the first page. This is a simplification but should be
+ * sufficient for most analyses.
+ */
+ fctx->record[i].numa_node_id = os_pages_status[blk2page];
+ }
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ Datum result;
+ uint32 i = funcctx->call_cntr;
+
+ result = get_buffercache_tuple(i, fctx);
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ {
+ firstNumaTouch = false;
+ SRF_RETURN_DONE(funcctx);
+ }
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..086e0062a17 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,14 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides the same information
+ as <function>pg_buffercache_pages()</function> but is slower because it also
+ provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +210,55 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache_numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed are identical to the
+ <structname>pg_buffercache</structname> view, except that this one includes
+ one additional <structfield>node_id</structfield> column as defined in
+ <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Extra column</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>integer</type>
+ </para>
+ <para>
+ <acronym>NUMA</acronym> node ID. NULL if the shared buffer
+ has not been used yet. On systems without <acronym>NUMA</acronym> support
+ this returns 0.
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
--
2.39.5
v17-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchapplication/octet-stream; name=v17-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchDownload
From 2d0e0bf27d5cc55c97b8e660fd22e6ae94db27d0 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 10:19:35 +0100
Subject: [PATCH v17 1/4] Add optional dependency to libnuma (Linux-only) for
basic NUMA awareness routines and add minimal src/port/pg_numa.c portability
wrapper. Other platforms can be added later.
This also adds function pg_numa_available() that can be used to check if
the server was linked with NUMA support.
libnuma is unavailable on 32-bit builds, so due to lack of i386 shared object,
we disable it there (it does not make sense anyway on i386 it is very memory
limited platform even with PAE)
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 2 +
configure | 187 ++++++++++++++++++++++++++++
configure.ac | 14 +++
doc/src/sgml/func.sgml | 13 ++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 6 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 41 ++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 109 ++++++++++++++++
17 files changed, 432 insertions(+), 2 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
--with-liburing \
\
${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
diff --git a/configure b/configure
index 30d949c3c46..bc195975c2e 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
XML2_CFLAGS
XML2_CONFIG
with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
@@ -872,6 +875,7 @@ with_liburing
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
LIBURING_LIBS
LIBCURL_CFLAGS
LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
XML2_CONFIG
XML2_CFLAGS
XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma for NUMA awareness
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -1629,6 +1636,10 @@ Some influential environment variables:
C compiler flags for LIBCURL, overriding pkg-config
LIBCURL_LIBS
linker flags for LIBCURL, overriding pkg-config
+ LIBNUMA_CFLAGS
+ C compiler flags for LIBNUMA, overriding pkg-config
+ LIBNUMA_LIBS
+ linker flags for LIBNUMA, overriding pkg-config
XML2_CONFIG path to xml2-config utility
XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
XML2_LIBS linker flags for XML2, overriding pkg-config
@@ -9063,6 +9074,182 @@ $as_echo "$as_me: WARNING: *** OAuth support tests require --with-python to run"
fi
+#
+# libnuma
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libnuma support" >&5
+$as_echo_n "checking whether to build with libnuma support... " >&6; }
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libnuma" >&5
+$as_echo "$with_libnuma" >&6; }
+
+
+if test "$with_libnuma" = yes ; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'libnuma' is required for NUMA awareness" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa" >&5
+$as_echo_n "checking for numa... " >&6; }
+
+if test -n "$LIBNUMA_CFLAGS"; then
+ pkg_cv_LIBNUMA_CFLAGS="$LIBNUMA_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_CFLAGS=`$PKG_CONFIG --cflags "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBNUMA_LIBS"; then
+ pkg_cv_LIBNUMA_LIBS="$LIBNUMA_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_LIBS=`$PKG_CONFIG --libs "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "numa" 2>&1`
+ else
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "numa" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBNUMA_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (numa) were not met:
+
+$LIBNUMA_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBNUMA_CFLAGS=$pkg_cv_LIBNUMA_CFLAGS
+ LIBNUMA_LIBS=$pkg_cv_LIBNUMA_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
#
# XML
#
diff --git a/configure.ac b/configure.ac
index 25cdfcf65af..064dfee5ad0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,20 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+ PKG_CHECK_MODULES(LIBNUMA, numa)
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5bf6656deca..1f98826d16d 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25138,6 +25138,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..5f0486bb335 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-liburing">
<term><option>--with-liburing</option></term>
<listitem>
@@ -2645,6 +2655,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index b8da4966297..f509370ee42 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: liburing
@@ -3225,6 +3246,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
liburing,
libxml,
lz4,
@@ -3881,6 +3903,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..8675e1b5d87 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('liburing', type : 'feature', value: 'auto',
description: 'io_uring support, for asynchronous I/O')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index cce29a37ac5..8b61d1ed492 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
@@ -223,6 +224,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBNUMA_CFLAGS = @LIBNUMA_CFLAGS@
+LIBNUMA_LIBS = @LIBNUMA_LIBS@
+
LIBURING_CFLAGS = @LIBURING_CFLAGS@
LIBURING_LIBS = @LIBURING_LIBS@
@@ -250,7 +254,7 @@ CPP = @CPP@
CPPFLAGS = @CPPFLAGS@
PG_SYSROOT = @PG_SYSROOT@
-override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBNUMA_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
ifdef PGXS
override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..ea8d796e7c4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 8b68b16d79d..d532b8c43b9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8506,6 +8506,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 92f0616c400..e67f81da167 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -675,6 +675,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to build with io_uring support. (--with-liburing) */
#undef USE_LIBURING
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..2fa0bc82a90
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,41 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
@@ -232,6 +234,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index f11896440d5..4274949dfa4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
path.o \
pg_bitutils.o \
pg_localeconv_r.o \
+ pg_numa.o \
pg_popcount_aarch64.o \
pg_popcount_avx512.o \
pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index cf7f07644b9..3b26c68fda7 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_localeconv_r.c',
+ 'pg_numa.c',
'pg_popcount_aarch64.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..076d0bb5904
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,109 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+#else
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
+
+/* This should be used only after the server is started */
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size= sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
--
2.39.5
v17-0004-Add-pg_shmem_numa_allocations-to-show-NUMA-memor.patchapplication/octet-stream; name=v17-0004-Add-pg_shmem_numa_allocations-to-show-NUMA-memor.patchDownload
From 2a9af45ab222c73e835b26caa5e6a596c7e9b6f6 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v17 4/4] Add pg_shmem_numa_allocations to show NUMA memory
node for shared memory allocations.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 79 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 131 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 +++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 273 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f5a306247e..e83711f9578 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -176,6 +176,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-numa-allocations"><structname>pg_shmem_numa_allocations</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -3746,6 +3751,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-numa-allocations">
+ <title><structname>pg_shmem_numa_allocations</structname></title>
+
+ <indexterm zone="view-pg-shmem-numa-allocations">
+ <primary>pg_shmem_numa_allocations</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_numa_allocations</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_numa_allocations</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_numa_allocations</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 31d269b7ee0..660a9e9832b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..e83f066171a 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,131 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+ return (Datum) 0;
+ }
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ shm_ent_page_count = ent->allocated_size / os_page_size;
+ /* It is always at least 1 page */
+ shm_ent_page_count = shm_ent_page_count == 0 ? 1 : shm_ent_page_count;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (which indicates unmapped/unallocated pages).
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+ /* Count number of NUMA nodes used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s >= 0 && s <= max_nodes)
+ nodes[s]++;
+ }
+
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory.
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d532b8c43b9..db69c0231f9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8510,6 +8510,14 @@
proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,node_id,size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..fb882c5b771
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 954f549555e..d9d62470cdc 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3144,6 +3144,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
has_table_privilege
@@ -3157,6 +3163,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 47478969135..6e460e10d61 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1740,6 +1740,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ node_id,
+ size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, node_id, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..fddb21a260a
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index b81694c24f2..f93d4829702 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1921,11 +1921,13 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.39.5
v17-0002-This-extracts-code-from-contrib-pg_buffercache-s.patchapplication/octet-stream; name=v17-0002-This-extracts-code-from-contrib-pg_buffercache-s.patchDownload
From 9f032a8c43c837a0cce44ef4b2e0dbf4d7331dda Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v17 2/4] This extracts code from contrib/pg_buffercache's
primary function to separate functions.
This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple()
and get_buffercache_tuple() that help fill result tuplestores based on the
buffercache contents. This will be used in a follow-up commit that implements
NUMA observability in pg_buffercache.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/pg_buffercache_pages.c | 299 ++++++++++--------
1 file changed, 162 insertions(+), 137 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..cad7429a21b 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -68,80 +68,177 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-Datum
-pg_buffercache_pages(PG_FUNCTION_ARGS)
+/*
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ */
+static BufferCachePagesContext *
+pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
{
- FuncCallContext *funcctx;
- Datum result;
- MemoryContext oldcontext;
BufferCachePagesContext *fctx; /* User function context. */
+ MemoryContext oldcontext;
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
+
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. ee unfortunately have to get the result type for that... - we
+ * can't use the result type determined by the function definition without
+ * potentially crashing when somebody uses the old (or even wrong)
+ * function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+ expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
+ INT2OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
+ BOOLOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
+ INT2OID, -1, 0);
+
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCachePagesRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCachePagesRec) * NBuffers);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+ return fctx;
+}
+
+/*
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * Save buffer cache information for a single buffer.
+ */
+static void
+pg_buffercache_save_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+ BufferCachePagesRec *bufRecord = &(fctx->record[record_id]);
+
+ bufHdr = GetBufferDescriptor(record_id);
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+
+ bufRecord->bufferid = BufferDescriptorGetBuffer(bufHdr);
+ bufRecord->relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ bufRecord->reltablespace = bufHdr->tag.spcOid;
+ bufRecord->reldatabase = bufHdr->tag.dbOid;
+ bufRecord->forknum = BufTagGetForkNum(&bufHdr->tag);
+ bufRecord->blocknum = bufHdr->tag.blockNum;
+ bufRecord->usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ bufRecord->pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+ if (buf_state & BM_DIRTY)
+ bufRecord->isdirty = true;
+ else
+ bufRecord->isdirty = false;
+
+ /* Note if the buffer is valid, and has storage created */
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ bufRecord->isvalid = true;
+ else
+ bufRecord->isvalid = false;
+
+ UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * Format and return a tuple for a single buffer cache entry.
+ */
+static Datum
+get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
HeapTuple tuple;
+ BufferCachePagesRec *bufRecord = &(fctx->record[record_id]);
- if (SRF_IS_FIRSTCALL())
+ values[0] = Int32GetDatum(bufRecord->bufferid);
+ memset(nulls, false, NUM_BUFFERCACHE_PAGES_ELEM);
+
+ /*
+ * Set all fields except the bufferid to null if the buffer is unused or
+ * not valid.
+ */
+ if (bufRecord->blocknum == InvalidBlockNumber ||
+ bufRecord->isvalid == false)
{
int i;
- funcctx = SRF_FIRSTCALL_INIT();
-
- /* Switch context when allocating stuff to be used in later calls */
- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
-
- /* Create a user function context for cross-call persistence */
- fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+ for (i = 1; i <= 9; i++)
+ nulls[i] = true;
+ }
+ else
+ {
+ values[1] = ObjectIdGetDatum(bufRecord->relfilenumber);
+ values[2] = ObjectIdGetDatum(bufRecord->reltablespace);
+ values[3] = ObjectIdGetDatum(bufRecord->reldatabase);
+ values[4] = ObjectIdGetDatum(bufRecord->forknum);
+ values[5] = Int64GetDatum((int64) bufRecord->blocknum);
+ values[6] = BoolGetDatum(bufRecord->isdirty);
+ values[7] = Int16GetDatum(bufRecord->usagecount);
/*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
+ * unused for v1.0 callers, but the array is always long enough
*/
- if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
- elog(ERROR, "return type must be a row type");
+ values[8] = Int32GetDatum(bufRecord->pinning_backends);
+ }
- if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
- expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
- elog(ERROR, "incorrect number of output arguments");
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
- /* Construct a tuple descriptor for the result rows. */
- tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
- TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
- INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
- INT2OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
- INT8OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
- BOOLOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
- INT2OID, -1, 0);
-
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);
-
- fctx->tupdesc = BlessTupleDesc(tupledesc);
-
- /* Allocate NBuffers worth of BufferCachePagesRec records. */
- fctx->record = (BufferCachePagesRec *)
- MemoryContextAllocHuge(CurrentMemoryContext,
- sizeof(BufferCachePagesRec) * NBuffers);
-
- /* Set max calls and remember the user function context. */
- funcctx->max_calls = NBuffers;
- funcctx->user_fctx = fctx;
-
- /* Return to original context when allocating transient memory */
- MemoryContextSwitchTo(oldcontext);
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
* Scan through all the buffers, saving the relevant fields in the
@@ -152,36 +249,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
* locks, so the information of each buffer is self-consistent.
*/
for (i = 0; i < NBuffers; i++)
- {
- BufferDesc *bufHdr;
- uint32 buf_state;
-
- bufHdr = GetBufferDescriptor(i);
- /* Lock each buffer header before inspecting. */
- buf_state = LockBufHdr(bufHdr);
-
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
- if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
- else
- fctx->record[i].isdirty = false;
-
- /* Note if the buffer is valid, and has storage created */
- if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
- else
- fctx->record[i].isvalid = false;
-
- UnlockBufHdr(bufHdr, buf_state);
- }
+ pg_buffercache_save_tuple(i, fctx);
}
funcctx = SRF_PERCALL_SETUP();
@@ -191,59 +259,16 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
if (funcctx->call_cntr < funcctx->max_calls)
{
+ Datum result;
uint32 i = funcctx->call_cntr;
- Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
- bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
-
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
- nulls[0] = false;
-
- /*
- * Set all fields except the bufferid to null if the buffer is unused
- * or not valid.
- */
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
- {
- nulls[1] = true;
- nulls[2] = true;
- nulls[3] = true;
- nulls[4] = true;
- nulls[5] = true;
- nulls[6] = true;
- nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
- nulls[8] = true;
- }
- else
- {
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
- nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
- nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
- nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
- nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
- nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
- nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
- nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
- nulls[8] = false;
- }
-
- /* Build and return the tuple. */
- tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
- result = HeapTupleGetDatum(tuple);
+ result = get_buffercache_tuple(i, fctx);
SRF_RETURN_NEXT(funcctx, result);
}
else
+ {
SRF_RETURN_DONE(funcctx);
+ }
}
Datum
--
2.39.5
Hi,
On Mon, Mar 31, 2025 at 11:27:50AM +0200, Jakub Wartak wrote:
On Thu, Mar 27, 2025 at 2:40 PM Andres Freund <andres@anarazel.de> wrote:
+Size
+pg_numa_get_pagesize(void)[..]
Should this have a comment or an assertion that it can only be used after
shared memory startup? Because before that huge_pages_status won't be
meaningful?Added both.
Thanks for the updated version!
+ Assert(IsUnderPostmaster);
I wonder if that would make more sense to add an assertion on huge_pages_status
and HUGE_PAGES_UNKNOWN instead (more or less as it is done in
CreateSharedMemoryAndSemaphores()).
=== About v17-0002-This-extracts-code-from-contrib-pg_buffercache-s.patch
Once applied I can see mention to pg_buffercache_numa_pages() while it
only comes in v17-0003-Extend-pg_buffercache-with-new-view-pg_buffercac.patch.
I think pg_buffercache_numa_pages() should not be mentioned before it's actually
implemented.
=== 1
+ bufRecord->isvalid == false)
{
int i;
- funcctx = SRF_FIRSTCALL_INIT();
-
- /* Switch context when allocating stuff to be used in later calls */
- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
-
- /* Create a user function context for cross-call persistence */
- fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+ for (i = 1; i <= 9; i++)
+ nulls[i] = true;
"i <= 9" will be correct only once v17-0003 is applied (when NUM_BUFFERCACHE_PAGES_ELEM
is increased to 10).
In v17-0002 that should be i < 9 (even better i < NUM_BUFFERCACHE_PAGES_ELEM).
That could also make sense to remove the loop and use memset() that way:
"
memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool));
"
instead. It's done that way in some other places (hbafuncs.c for example).
=== 2
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+ INT4OID, -1, 0);
I think we should not change the "expected_tupledesc->natts" check here until
v17-0003 is applied.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, Mar 31, 2025 at 4:59 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi,
Hi Bertrand, happy to see you back, thanks for review and here's v18
attached (an ideal fit for PG18 ;))
On Mon, Mar 31, 2025 at 11:27:50AM +0200, Jakub Wartak wrote:
On Thu, Mar 27, 2025 at 2:40 PM Andres Freund <andres@anarazel.de> wrote:
+Size
+pg_numa_get_pagesize(void)[..]
Should this have a comment or an assertion that it can only be used after
shared memory startup? Because before that huge_pages_status won't be
meaningful?Added both.
Thanks for the updated version!
+ Assert(IsUnderPostmaster);
I wonder if that would make more sense to add an assertion on huge_pages_status
and HUGE_PAGES_UNKNOWN instead (more or less as it is done in
CreateSharedMemoryAndSemaphores()).
Ok, let's have both just in case (this status is by default
initialized to _UNKNOWN, so I hope you haven't had in mind using
GetConfigOption() as this would need guc.h in port?)
=== About v17-0002-This-extracts-code-from-contrib-pg_buffercache-s.patch
[..]
I think pg_buffercache_numa_pages() should not be mentioned before it's actually
implemented.
Right, fixed.
=== 1
[..]
"i <= 9" will be correct only once v17-0003 is applied (when NUM_BUFFERCACHE_PAGES_ELEM
is increased to 10).In v17-0002 that should be i < 9 (even better i < NUM_BUFFERCACHE_PAGES_ELEM).
That could also make sense to remove the loop and use memset() that way:
"
memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool));
"instead. It's done that way in some other places (hbafuncs.c for example).
Ouch, good catch.
=== 2
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1) + TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends", + INT4OID, -1, 0);I think we should not change the "expected_tupledesc->natts" check here until
v17-0003 is applied.
Right, I've moved that into 003 where it belongs and now 002 has no
single NUMA reference. I've thrown 0001+0002 alone onto CI and it
passed too.
-J.
Attachments:
v18-0004-Add-pg_shmem_numa_allocations-to-show-NUMA-memor.patchapplication/octet-stream; name=v18-0004-Add-pg_shmem_numa_allocations-to-show-NUMA-memor.patchDownload
From 91032eb66fd0be91f90bf74618378462b79db347 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v18 4/4] Add pg_shmem_numa_allocations to show NUMA memory
node for shared memory allocations.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 79 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 131 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 +++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 273 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f5a306247e..e83711f9578 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -176,6 +176,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-numa-allocations"><structname>pg_shmem_numa_allocations</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -3746,6 +3751,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-numa-allocations">
+ <title><structname>pg_shmem_numa_allocations</structname></title>
+
+ <indexterm zone="view-pg-shmem-numa-allocations">
+ <primary>pg_shmem_numa_allocations</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_numa_allocations</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_numa_allocations</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_numa_allocations</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 31d269b7ee0..660a9e9832b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..e83f066171a 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,131 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+ return (Datum) 0;
+ }
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ shm_ent_page_count = ent->allocated_size / os_page_size;
+ /* It is always at least 1 page */
+ shm_ent_page_count = shm_ent_page_count == 0 ? 1 : shm_ent_page_count;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (which indicates unmapped/unallocated pages).
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+ /* Count number of NUMA nodes used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s >= 0 && s <= max_nodes)
+ nodes[s]++;
+ }
+
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory.
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d532b8c43b9..db69c0231f9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8510,6 +8510,14 @@
proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,node_id,size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..fb882c5b771
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 954f549555e..d9d62470cdc 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3144,6 +3144,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
has_table_privilege
@@ -3157,6 +3163,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 47478969135..6e460e10d61 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1740,6 +1740,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ node_id,
+ size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, node_id, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..fddb21a260a
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index b81694c24f2..f93d4829702 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1921,11 +1921,13 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.39.5
v18-0003-Extend-pg_buffercache-with-new-view-pg_buffercac.patchapplication/octet-stream; name=v18-0003-Extend-pg_buffercache-with-new-view-pg_buffercac.patchDownload
From be92bdd40744843488f66d089c875bbc2d11b999 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v18 3/4] Extend pg_buffercache with new view
pg_buffercache_numa to show NUMA memory node for individual buffer.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache_numa.out | 28 +++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 39 ++++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 178 +++++++++++++++++-
.../sql/pg_buffercache_numa.sql | 20 ++
doc/src/sgml/pgbuffercache.sgml | 61 +++++-
9 files changed, 328 insertions(+), 8 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..720dc84b2c9
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,39 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, node_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pages() TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index b0741e568d8..acf7f58501e 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,12 +11,13 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -46,6 +47,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_node_id;
} BufferCachePagesRec;
@@ -64,12 +66,58 @@ typedef struct
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+/*
+ * Helper routine to map Buffers into addresses that is used by
+ * pg_numa_query_pages().
+ *
+ * When database block size (BLCKSZ) is smaller than the OS page size (4kB),
+ * multiple database buffers will map to the same OS memory page. In this case,
+ * we only need to query the NUMA node for the first memory address of each
+ * unique OS page rather than for every buffer.
+ *
+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ */
+static inline void
+pg_buffercache_numa_prepare_ptrs(int buffer_id, float pages_per_blk,
+ Size os_page_size,
+ void **os_page_ptrs)
+{
+ size_t blk2page = (size_t) (buffer_id * pages_per_blk);
+
+ for (size_t j = 0; j < pages_per_blk; j++)
+ {
+ size_t blk2pageoff = blk2page + j;
+
+ if (os_page_ptrs[blk2pageoff] == 0)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ /* NBuffers starts from 1 */
+ os_page_ptrs[blk2pageoff] = (char *) BufferGetBlock(buffer_id + 1) +
+ (os_page_size * j);
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2pageoff]);
+
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+}
+
/*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
*/
static BufferCachePagesContext *
pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
@@ -119,9 +167,12 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
INT2OID, -1, 0);
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "node_id",
+ INT4OID, -1, 0);
fctx->tupdesc = BlessTupleDesc(tupledesc);
@@ -140,7 +191,7 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
}
/*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
*
* Save buffer cache information for a single buffer.
*/
@@ -175,11 +226,13 @@ pg_buffercache_save_tuple(int record_id, BufferCachePagesContext *fctx)
else
bufRecord->isvalid = false;
+ bufRecord->numa_node_id = -1;
+
UnlockBufHdr(bufHdr, buf_state);
}
/*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
*
* Format and return a tuple for a single buffer cache entry.
*/
@@ -215,6 +268,7 @@ get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
* unused for v1.0 callers, but the array is always long enough
*/
values[8] = Int32GetDatum(bufRecord->pinning_backends);
+ values[9] = Int32GetDatum(bufRecord->numa_node_id);
}
/* Build and return the tuple. */
@@ -266,6 +320,120 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
}
}
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings.
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_pages_status = NULL;
+ uint64 os_page_count = 0;
+ float pages_per_blk = 0;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS
+ * memory page size 2. Calculate how many OS pages are used by all
+ * buffer blocks 3. Calculate how many OS pages are contained within
+ * each database block.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+ os_page_count = ((uint64) NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu pages_per_blk=%.2f",
+ (unsigned long) os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(uint64) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ *
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ */
+ for (i = 0; i < NBuffers; i++)
+ {
+ pg_buffercache_save_tuple(i, fctx);
+ pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size,
+ os_page_ptrs);
+ }
+
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ int blk2page = (int) i * pages_per_blk;
+
+ /*
+ * Set the NUMA node id for this buffer based on the first OS page
+ * it maps to.
+ *
+ * Note: We could check for errors in os_pages_status and report
+ * them. Also, a single DB block might span multiple NUMA nodes if
+ * it crosses OS pages on node boundaries, but we only record the
+ * node of the first page. This is a simplification but should be
+ * sufficient for most analyses.
+ */
+ fctx->record[i].numa_node_id = os_pages_status[blk2page];
+ }
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ Datum result;
+ uint32 i = funcctx->call_cntr;
+
+ result = get_buffercache_tuple(i, fctx);
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ {
+ firstNumaTouch = false;
+ SRF_RETURN_DONE(funcctx);
+ }
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..086e0062a17 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,14 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides the same information
+ as <function>pg_buffercache_pages()</function> but is slower because it also
+ provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +210,55 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache_numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed are identical to the
+ <structname>pg_buffercache</structname> view, except that this one includes
+ one additional <structfield>node_id</structfield> column as defined in
+ <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Extra column</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>integer</type>
+ </para>
+ <para>
+ <acronym>NUMA</acronym> node ID. NULL if the shared buffer
+ has not been used yet. On systems without <acronym>NUMA</acronym> support
+ this returns 0.
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
--
2.39.5
v18-0002-This-extracts-code-from-contrib-pg_buffercache-s.patchapplication/octet-stream; name=v18-0002-This-extracts-code-from-contrib-pg_buffercache-s.patchDownload
From b295c3136b774da9dde17571484178096e7d06f4 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v18 2/4] This extracts code from contrib/pg_buffercache's
primary function to separate functions.
This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple()
and get_buffercache_tuple() that help fill result tuplestores based on the
buffercache contents. This will be used in a follow-up commit that implements
NUMA observability in pg_buffercache.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/pg_buffercache_pages.c | 296 ++++++++++--------
1 file changed, 158 insertions(+), 138 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..b0741e568d8 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -68,80 +68,172 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-Datum
-pg_buffercache_pages(PG_FUNCTION_ARGS)
+/*
+ * Helper routine for pg_buffercache_pages().
+ */
+static BufferCachePagesContext *
+pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
{
- FuncCallContext *funcctx;
- Datum result;
- MemoryContext oldcontext;
BufferCachePagesContext *fctx; /* User function context. */
+ MemoryContext oldcontext;
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
- HeapTuple tuple;
- if (SRF_IS_FIRSTCALL())
- {
- int i;
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
- funcctx = SRF_FIRSTCALL_INIT();
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. ee unfortunately have to get the result type for that... - we
+ * can't use the result type determined by the function definition without
+ * potentially crashing when somebody uses the old (or even wrong)
+ * function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+ expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
+ INT2OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
+ BOOLOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
+ INT2OID, -1, 0);
+
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
- /* Switch context when allocating stuff to be used in later calls */
- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCachePagesRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCachePagesRec) * NBuffers);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+ return fctx;
+}
+
+/*
+ * Helper routine for pg_buffercache_pages().
+ *
+ * Save buffer cache information for a single buffer.
+ */
+static void
+pg_buffercache_save_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+ BufferCachePagesRec *bufRecord = &(fctx->record[record_id]);
+
+ bufHdr = GetBufferDescriptor(record_id);
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+
+ bufRecord->bufferid = BufferDescriptorGetBuffer(bufHdr);
+ bufRecord->relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ bufRecord->reltablespace = bufHdr->tag.spcOid;
+ bufRecord->reldatabase = bufHdr->tag.dbOid;
+ bufRecord->forknum = BufTagGetForkNum(&bufHdr->tag);
+ bufRecord->blocknum = bufHdr->tag.blockNum;
+ bufRecord->usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ bufRecord->pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+ if (buf_state & BM_DIRTY)
+ bufRecord->isdirty = true;
+ else
+ bufRecord->isdirty = false;
+
+ /* Note if the buffer is valid, and has storage created */
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ bufRecord->isvalid = true;
+ else
+ bufRecord->isvalid = false;
+
+ UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
+ * Helper routine for pg_buffercache_pages().
+ *
+ * Format and return a tuple for a single buffer cache entry.
+ */
+static Datum
+get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+ HeapTuple tuple;
+ BufferCachePagesRec *bufRecord = &(fctx->record[record_id]);
- /* Create a user function context for cross-call persistence */
- fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+ values[0] = Int32GetDatum(bufRecord->bufferid);
+ memset(nulls, false, NUM_BUFFERCACHE_PAGES_ELEM);
+
+ /*
+ * Set all fields except the bufferid to null if the buffer is unused or
+ * not valid.
+ */
+ if (bufRecord->blocknum == InvalidBlockNumber ||
+ bufRecord->isvalid == false)
+ memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool));
+ else
+ {
+ values[1] = ObjectIdGetDatum(bufRecord->relfilenumber);
+ values[2] = ObjectIdGetDatum(bufRecord->reltablespace);
+ values[3] = ObjectIdGetDatum(bufRecord->reldatabase);
+ values[4] = ObjectIdGetDatum(bufRecord->forknum);
+ values[5] = Int64GetDatum((int64) bufRecord->blocknum);
+ values[6] = BoolGetDatum(bufRecord->isdirty);
+ values[7] = Int16GetDatum(bufRecord->usagecount);
/*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
+ * unused for v1.0 callers, but the array is always long enough
*/
- if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
- elog(ERROR, "return type must be a row type");
+ values[8] = Int32GetDatum(bufRecord->pinning_backends);
+ }
- if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
- expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
- elog(ERROR, "incorrect number of output arguments");
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
- /* Construct a tuple descriptor for the result rows. */
- tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
- TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
- INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
- INT2OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
- INT8OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
- BOOLOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
- INT2OID, -1, 0);
-
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);
-
- fctx->tupdesc = BlessTupleDesc(tupledesc);
-
- /* Allocate NBuffers worth of BufferCachePagesRec records. */
- fctx->record = (BufferCachePagesRec *)
- MemoryContextAllocHuge(CurrentMemoryContext,
- sizeof(BufferCachePagesRec) * NBuffers);
-
- /* Set max calls and remember the user function context. */
- funcctx->max_calls = NBuffers;
- funcctx->user_fctx = fctx;
-
- /* Return to original context when allocating transient memory */
- MemoryContextSwitchTo(oldcontext);
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
* Scan through all the buffers, saving the relevant fields in the
@@ -152,36 +244,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
* locks, so the information of each buffer is self-consistent.
*/
for (i = 0; i < NBuffers; i++)
- {
- BufferDesc *bufHdr;
- uint32 buf_state;
-
- bufHdr = GetBufferDescriptor(i);
- /* Lock each buffer header before inspecting. */
- buf_state = LockBufHdr(bufHdr);
-
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
- if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
- else
- fctx->record[i].isdirty = false;
-
- /* Note if the buffer is valid, and has storage created */
- if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
- else
- fctx->record[i].isvalid = false;
-
- UnlockBufHdr(bufHdr, buf_state);
- }
+ pg_buffercache_save_tuple(i, fctx);
}
funcctx = SRF_PERCALL_SETUP();
@@ -191,59 +254,16 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
if (funcctx->call_cntr < funcctx->max_calls)
{
+ Datum result;
uint32 i = funcctx->call_cntr;
- Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
- bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
-
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
- nulls[0] = false;
-
- /*
- * Set all fields except the bufferid to null if the buffer is unused
- * or not valid.
- */
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
- {
- nulls[1] = true;
- nulls[2] = true;
- nulls[3] = true;
- nulls[4] = true;
- nulls[5] = true;
- nulls[6] = true;
- nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
- nulls[8] = true;
- }
- else
- {
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
- nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
- nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
- nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
- nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
- nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
- nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
- nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
- nulls[8] = false;
- }
-
- /* Build and return the tuple. */
- tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
- result = HeapTupleGetDatum(tuple);
+ result = get_buffercache_tuple(i, fctx);
SRF_RETURN_NEXT(funcctx, result);
}
else
+ {
SRF_RETURN_DONE(funcctx);
+ }
}
Datum
--
2.39.5
v18-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchapplication/octet-stream; name=v18-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patchDownload
From 0350a860da60a4ea9438a43dfb5b32e5465045ab Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 10:19:35 +0100
Subject: [PATCH v18 1/4] Add optional dependency to libnuma (Linux-only) for
basic NUMA awareness routines and add minimal src/port/pg_numa.c portability
wrapper. Other platforms can be added later.
This also adds function pg_numa_available() that can be used to check if
the server was linked with NUMA support.
libnuma is unavailable on 32-bit builds, so due to lack of i386 shared object,
we disable it there (it does not make sense anyway on i386 it is very memory
limited platform even with PAE)
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 2 +
configure | 187 ++++++++++++++++++++++++++++
configure.ac | 14 +++
doc/src/sgml/func.sgml | 13 ++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 6 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 41 ++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 110 ++++++++++++++++
17 files changed, 433 insertions(+), 2 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
--with-liburing \
\
${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
diff --git a/configure b/configure
index 30d949c3c46..bc195975c2e 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
XML2_CFLAGS
XML2_CONFIG
with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
@@ -872,6 +875,7 @@ with_liburing
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
LIBURING_LIBS
LIBCURL_CFLAGS
LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
XML2_CONFIG
XML2_CFLAGS
XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma for NUMA awareness
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -1629,6 +1636,10 @@ Some influential environment variables:
C compiler flags for LIBCURL, overriding pkg-config
LIBCURL_LIBS
linker flags for LIBCURL, overriding pkg-config
+ LIBNUMA_CFLAGS
+ C compiler flags for LIBNUMA, overriding pkg-config
+ LIBNUMA_LIBS
+ linker flags for LIBNUMA, overriding pkg-config
XML2_CONFIG path to xml2-config utility
XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
XML2_LIBS linker flags for XML2, overriding pkg-config
@@ -9063,6 +9074,182 @@ $as_echo "$as_me: WARNING: *** OAuth support tests require --with-python to run"
fi
+#
+# libnuma
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libnuma support" >&5
+$as_echo_n "checking whether to build with libnuma support... " >&6; }
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libnuma" >&5
+$as_echo "$with_libnuma" >&6; }
+
+
+if test "$with_libnuma" = yes ; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'libnuma' is required for NUMA awareness" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa" >&5
+$as_echo_n "checking for numa... " >&6; }
+
+if test -n "$LIBNUMA_CFLAGS"; then
+ pkg_cv_LIBNUMA_CFLAGS="$LIBNUMA_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_CFLAGS=`$PKG_CONFIG --cflags "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBNUMA_LIBS"; then
+ pkg_cv_LIBNUMA_LIBS="$LIBNUMA_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_LIBS=`$PKG_CONFIG --libs "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "numa" 2>&1`
+ else
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "numa" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBNUMA_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (numa) were not met:
+
+$LIBNUMA_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBNUMA_CFLAGS=$pkg_cv_LIBNUMA_CFLAGS
+ LIBNUMA_LIBS=$pkg_cv_LIBNUMA_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
#
# XML
#
diff --git a/configure.ac b/configure.ac
index 25cdfcf65af..064dfee5ad0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,20 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+ PKG_CHECK_MODULES(LIBNUMA, numa)
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5bf6656deca..1f98826d16d 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25138,6 +25138,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..5f0486bb335 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-liburing">
<term><option>--with-liburing</option></term>
<listitem>
@@ -2645,6 +2655,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index b8da4966297..f509370ee42 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: liburing
@@ -3225,6 +3246,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
liburing,
libxml,
lz4,
@@ -3881,6 +3903,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..8675e1b5d87 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('liburing', type : 'feature', value: 'auto',
description: 'io_uring support, for asynchronous I/O')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index cce29a37ac5..8b61d1ed492 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
@@ -223,6 +224,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBNUMA_CFLAGS = @LIBNUMA_CFLAGS@
+LIBNUMA_LIBS = @LIBNUMA_LIBS@
+
LIBURING_CFLAGS = @LIBURING_CFLAGS@
LIBURING_LIBS = @LIBURING_LIBS@
@@ -250,7 +254,7 @@ CPP = @CPP@
CPPFLAGS = @CPPFLAGS@
PG_SYSROOT = @PG_SYSROOT@
-override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBNUMA_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
ifdef PGXS
override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..ea8d796e7c4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 8b68b16d79d..d532b8c43b9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8506,6 +8506,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 92f0616c400..e67f81da167 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -675,6 +675,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to build with io_uring support. (--with-liburing) */
#undef USE_LIBURING
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..2fa0bc82a90
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,41 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
@@ -232,6 +234,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index f11896440d5..4274949dfa4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
path.o \
pg_bitutils.o \
pg_localeconv_r.o \
+ pg_numa.o \
pg_popcount_aarch64.o \
pg_popcount_avx512.o \
pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index cf7f07644b9..3b26c68fda7 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_localeconv_r.c',
+ 'pg_numa.c',
'pg_popcount_aarch64.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..443cd85838a
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,110 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+#else
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
+
+/* This should be used only after the server is started */
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size= sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+ Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
--
2.39.5
Hi Jakub,
On Tue, Apr 01, 2025 at 12:56:06PM +0200, Jakub Wartak wrote:
On Mon, Mar 31, 2025 at 4:59 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:Hi,
Hi Bertrand, happy to see you back, thanks for review and here's v18
attached (an ideal fit for PG18 ;))
Thanks for the new version!
=== About v18-0002
It looks in a good shape to me. The helper's name might still be debatable
though.
I just have 2 comments:
=== 1
+ if (bufRecord->blocknum == InvalidBlockNumber ||
+ bufRecord->isvalid == false)
It seems to me that this check could now fit in one line.
=== 2
+ {
SRF_RETURN_DONE(funcctx);
+ }
Extra parentheses are not needed.
=== About v18-0003
=== 3
I think that pg_buffercache--1.5--1.6.sql is not correct. It should contain
only the necessary changes when updating from 1.5. It means that it should "only"
create the new objects (views and functions in our case) that come in v18-0003
and grant appropriate privs.
Also it should mention:
"
\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
"
and not:
"
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
"
The already existing pg_buffercache--1.N--1.(N+1).sql are good examples.
=== 4
+ for (i = 0; i < NBuffers; i++)
+ {
+ int blk2page = (int) i * pages_per_blk;
+
I think that should be:
int blk2page = (int) (i * pages_per_blk);
=== About v18-0004
=== 5
When running:
select c.name, c.size as num_size, s.size as shmem_size
from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c, pg_shmem_allocations s
where c.name = s.name;
I can see:
- pg_shmem_numa_allocations reporting a lot of times the same size
- pg_shmem_numa_allocations and pg_shmem_allocations not reporting the same size
Do you observe the same?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
I've spent a bit of time reviewing this. In general I haven't found
anything I'd call a bug, but here's a couple comments for v18 ... Most
of this is in separate "review" commits, with a couple exceptions.
1) Please update the commit messages, with proper formatting, etc. I
tried to do that in the attached v19, but please go through that, add
relevant details, update list of reviewers, etc. The subject should not
be overly long, etc.
2) I don't think we need "libnuma for NUMA awareness" in configure, I'd
use just "libnuma support" similar to other libraries.
3) I don't think we need pg_numa.h to have this:
extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);
AFAICS we don't have any SQL functions exposed as PGDLLIMPORT, so why
would it be necessary here? It's enough to have a prototype in .c file.
4) Improved .sgml to have acronym/productname in a couple places.
5) I don't think the comment for pg_buffercache_init_entries() is very
useful. That it's helper for pg_buffercache_pages() tells me nothing
about how to use it, what the arguments are, etc.
6) IMHO pg_buffercache_numa_prepare_ptrs() would deserve a better
comment too. I mean, there's no info about what the arguments are, which
arguments are input or output, etc. And it only discussed one option
(block page < memory page), but what happens in the other case? The
formulas with blk2page/blk2pageoff are not quite clear to me (I'm not
saying it's wrong).
However, it seems rather suspicious that pages_per_blk is calculated as
float, and pg_buffercache_numa_prepare_ptrs() then does this:
for (size_t j = 0; j < pages_per_blk; j++)
{ ... }
I mean, isn't this vulnerable to rounding errors, which might trigger
some weird behavior? If not, it'd be good to add a comment why this is
fine, it confuses me a lot. I personally would probably prefer doing
just integer arithmetic here.
7) This para in the docs seems self-contradictory:
<para>
The <function>pg_buffercache_numa_pages()</function> provides the same
information
as <function>pg_buffercache_pages()</function> but is slower because
it also
provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
The <structname>pg_buffercache_numa</structname> view wraps the
function for
convenient use.
</para>
I mean, "provides the same information, but is slower because it
provides different information" is strange. I understand the point, but
maybe rephrase somehow?
8) Why is pg_numa_available() marked as volatile? Should not change in a
running cluster, no?
9) I noticed the new SGML docs have IDs with mixed "-" and "_". Maybe
let's not do that.
<sect2 id="pgbuffercache-pg-buffercache_numa">
10) I think it'd be good to mention/explain why move_pages is used
instead of get_mempolicy - more efficient with batching, etc. This would
be useful both in the commit message and before the move_pages call (and
in general to explain why pg_buffercache_numa_prepare_ptrs prepares the
pointers like this etc.).
11) This could use UINT64_FORMAT, instead of a cast:
elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu
pages_per_blk=%.2f",
(unsigned long) os_page_count, os_page_size, pages_per_blk);
regards
--
Tomas Vondra
Attachments:
v19-0001-Add-support-for-basic-NUMA-awareness.patchtext/x-patch; charset=UTF-8; name=v19-0001-Add-support-for-basic-NUMA-awareness.patchDownload
From 46a7801b1985a81bb8bc35fcfb2cbb74e6ea5545 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 10:19:35 +0100
Subject: [PATCH v19 1/8] Add support for basic NUMA awareness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.
A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with NUMA library.
The libnuma library is not available on 32-bit builds (there's no shared
object for i386), so we disable it in that case. The i386 is very memory
limited anyway, even with PAE, so NUMA is mostly irrelevant.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 2 +
configure | 187 ++++++++++++++++++++++++++++
configure.ac | 14 +++
doc/src/sgml/func.sgml | 13 ++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 6 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 41 ++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 110 ++++++++++++++++
17 files changed, 433 insertions(+), 2 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
--with-liburing \
\
${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
diff --git a/configure b/configure
index 30d949c3c46..bc195975c2e 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
XML2_CFLAGS
XML2_CONFIG
with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
@@ -872,6 +875,7 @@ with_liburing
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
LIBURING_LIBS
LIBCURL_CFLAGS
LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
XML2_CONFIG
XML2_CFLAGS
XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma for NUMA awareness
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -1629,6 +1636,10 @@ Some influential environment variables:
C compiler flags for LIBCURL, overriding pkg-config
LIBCURL_LIBS
linker flags for LIBCURL, overriding pkg-config
+ LIBNUMA_CFLAGS
+ C compiler flags for LIBNUMA, overriding pkg-config
+ LIBNUMA_LIBS
+ linker flags for LIBNUMA, overriding pkg-config
XML2_CONFIG path to xml2-config utility
XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
XML2_LIBS linker flags for XML2, overriding pkg-config
@@ -9063,6 +9074,182 @@ $as_echo "$as_me: WARNING: *** OAuth support tests require --with-python to run"
fi
+#
+# libnuma
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libnuma support" >&5
+$as_echo_n "checking whether to build with libnuma support... " >&6; }
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libnuma" >&5
+$as_echo "$with_libnuma" >&6; }
+
+
+if test "$with_libnuma" = yes ; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'libnuma' is required for NUMA awareness" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa" >&5
+$as_echo_n "checking for numa... " >&6; }
+
+if test -n "$LIBNUMA_CFLAGS"; then
+ pkg_cv_LIBNUMA_CFLAGS="$LIBNUMA_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_CFLAGS=`$PKG_CONFIG --cflags "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBNUMA_LIBS"; then
+ pkg_cv_LIBNUMA_LIBS="$LIBNUMA_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_LIBS=`$PKG_CONFIG --libs "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "numa" 2>&1`
+ else
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "numa" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBNUMA_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (numa) were not met:
+
+$LIBNUMA_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBNUMA_CFLAGS=$pkg_cv_LIBNUMA_CFLAGS
+ LIBNUMA_LIBS=$pkg_cv_LIBNUMA_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
#
# XML
#
diff --git a/configure.ac b/configure.ac
index 25cdfcf65af..064dfee5ad0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,20 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+ PKG_CHECK_MODULES(LIBNUMA, numa)
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5bf6656deca..1f98826d16d 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25138,6 +25138,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..5f0486bb335 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-liburing">
<term><option>--with-liburing</option></term>
<listitem>
@@ -2645,6 +2655,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index b8da4966297..f509370ee42 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: liburing
@@ -3225,6 +3246,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
liburing,
libxml,
lz4,
@@ -3881,6 +3903,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..8675e1b5d87 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('liburing', type : 'feature', value: 'auto',
description: 'io_uring support, for asynchronous I/O')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index cce29a37ac5..8b61d1ed492 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
@@ -223,6 +224,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBNUMA_CFLAGS = @LIBNUMA_CFLAGS@
+LIBNUMA_LIBS = @LIBNUMA_LIBS@
+
LIBURING_CFLAGS = @LIBURING_CFLAGS@
LIBURING_LIBS = @LIBURING_LIBS@
@@ -250,7 +254,7 @@ CPP = @CPP@
CPPFLAGS = @CPPFLAGS@
PG_SYSROOT = @PG_SYSROOT@
-override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBNUMA_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
ifdef PGXS
override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..ea8d796e7c4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 8b68b16d79d..d532b8c43b9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8506,6 +8506,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 92f0616c400..e67f81da167 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -675,6 +675,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to build with io_uring support. (--with-liburing) */
#undef USE_LIBURING
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..2fa0bc82a90
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,41 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
@@ -232,6 +234,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index f11896440d5..4274949dfa4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
path.o \
pg_bitutils.o \
pg_localeconv_r.o \
+ pg_numa.o \
pg_popcount_aarch64.o \
pg_popcount_avx512.o \
pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index cf7f07644b9..3b26c68fda7 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_localeconv_r.c',
+ 'pg_numa.c',
'pg_popcount_aarch64.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..443cd85838a
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,110 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+#else
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
+
+/* This should be used only after the server is started */
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size= sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+ Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
--
2.49.0
v19-0002-pgindent.patchtext/x-patch; charset=UTF-8; name=v19-0002-pgindent.patchDownload
From df45db584eabdf7503eb868ea12f91b7603fd634 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 1 Apr 2025 21:34:42 +0200
Subject: [PATCH v19 2/8] pgindent
---
src/port/pg_numa.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 443cd85838a..22b3f6a1781 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -94,10 +94,11 @@ pg_numa_get_pagesize(void)
Size os_page_size;
#ifdef WIN32
SYSTEM_INFO sysinfo;
+
GetSystemInfo(&sysinfo);
os_page_size = sysinfo.dwPageSize;
#else
- os_page_size= sysconf(_SC_PAGESIZE);
+ os_page_size = sysconf(_SC_PAGESIZE);
#endif
Assert(IsUnderPostmaster);
--
2.49.0
v19-0003-review.patchtext/x-patch; charset=UTF-8; name=v19-0003-review.patchDownload
From 2c5a837b87bbb17c4694db517f99611561afa1e1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 1 Apr 2025 20:07:05 +0200
Subject: [PATCH v19 3/8] review
---
configure | 2 +-
configure.ac | 2 +-
doc/src/sgml/installation.sgml | 11 ++++++-----
src/include/port/pg_numa.h | 1 -
src/port/pg_numa.c | 4 ++++
5 files changed, 12 insertions(+), 8 deletions(-)
diff --git a/configure b/configure
index bc195975c2e..b36d66aa3eb 100755
--- a/configure
+++ b/configure
@@ -1594,7 +1594,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
- --with-libnuma build with libnuma for NUMA awareness
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
diff --git a/configure.ac b/configure.ac
index 064dfee5ad0..fc8b91afeb1 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1057,7 +1057,7 @@ fi
# libnuma
#
AC_MSG_CHECKING([whether to build with libnuma support])
-PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma for NUMA awareness],
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma support],
[AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
AC_MSG_RESULT([$with_libnuma])
AC_SUBST(with_libnuma)
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 5f0486bb335..1a3f9a0c3ac 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1160,8 +1160,9 @@ build-postgresql:
<term><option>--with-libnuma</option></term>
<listitem>
<para>
- Build with libnuma support for basic NUMA support.
- Only supported on platforms for which the libnuma library is implemented.
+ Build with libnuma support for basic <acronym>NUMA</acronym> support.
+ Only supported on platforms for which the <productname>libnuma</productname>
+ library is implemented.
</para>
</listitem>
</varlistentry>
@@ -2659,9 +2660,9 @@ ninja install
<term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
<listitem>
<para>
- Build with libnuma support for basic NUMA support.
- Only supported on platforms for which the libnuma library is implemented.
- The default for this option is auto.
+ Build with libnuma support for basic <acronym>NUMA</acronym> support.
+ Only supported on platforms for which the <productname>libnuma</productname>
+ library is implemented. The default for this option is auto.
</para>
</listitem>
</varlistentry>
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 2fa0bc82a90..314cff94dbc 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -20,7 +20,6 @@ extern PGDLLIMPORT int pg_numa_init(void);
extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
extern PGDLLIMPORT int pg_numa_get_max_node(void);
extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
-extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);
#ifdef USE_LIBNUMA
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 22b3f6a1781..8c234c263be 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -36,6 +36,8 @@
#include <numa.h>
#include <numaif.h>
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
/* libnuma requires initialization as per numa(3) on Linux */
int
pg_numa_init(void)
@@ -59,6 +61,8 @@ pg_numa_get_max_node(void)
#else
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
/* Empty wrappers */
int
pg_numa_init(void)
--
2.49.0
v19-0004-pg_buffercache-split-pg_buffercache_pages-into-p.patchtext/x-patch; charset=UTF-8; name=v19-0004-pg_buffercache-split-pg_buffercache_pages-into-p.patchDownload
From 3cef598bc1b841db4253429e94b71d3045791bc3 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v19 4/8] pg_buffercache: split pg_buffercache_pages into parts
Split pg_buffercache_pages() into multiple smaller functions, to allow
reuse in future patches. This introduces three new functions:
- pg_buffercache_init_entries
- pg_buffercache_build_tuple
- get_buffercache_tuple
that help adding entries into a tuplestore, describing the contents of
the buffercache.
This is a preparation for future patches extending pg_buffercache, e.g.
to add NUMA observabitily.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/pg_buffercache_pages.c | 296 ++++++++++--------
1 file changed, 158 insertions(+), 138 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..b0741e568d8 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -68,80 +68,172 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-Datum
-pg_buffercache_pages(PG_FUNCTION_ARGS)
+/*
+ * Helper routine for pg_buffercache_pages().
+ */
+static BufferCachePagesContext *
+pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
{
- FuncCallContext *funcctx;
- Datum result;
- MemoryContext oldcontext;
BufferCachePagesContext *fctx; /* User function context. */
+ MemoryContext oldcontext;
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
- HeapTuple tuple;
- if (SRF_IS_FIRSTCALL())
- {
- int i;
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
- funcctx = SRF_FIRSTCALL_INIT();
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. ee unfortunately have to get the result type for that... - we
+ * can't use the result type determined by the function definition without
+ * potentially crashing when somebody uses the old (or even wrong)
+ * function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+ expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
+ INT2OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
+ BOOLOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
+ INT2OID, -1, 0);
+
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
- /* Switch context when allocating stuff to be used in later calls */
- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCachePagesRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCachePagesRec) * NBuffers);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+ return fctx;
+}
+
+/*
+ * Helper routine for pg_buffercache_pages().
+ *
+ * Save buffer cache information for a single buffer.
+ */
+static void
+pg_buffercache_save_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+ BufferCachePagesRec *bufRecord = &(fctx->record[record_id]);
+
+ bufHdr = GetBufferDescriptor(record_id);
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+
+ bufRecord->bufferid = BufferDescriptorGetBuffer(bufHdr);
+ bufRecord->relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ bufRecord->reltablespace = bufHdr->tag.spcOid;
+ bufRecord->reldatabase = bufHdr->tag.dbOid;
+ bufRecord->forknum = BufTagGetForkNum(&bufHdr->tag);
+ bufRecord->blocknum = bufHdr->tag.blockNum;
+ bufRecord->usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ bufRecord->pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+ if (buf_state & BM_DIRTY)
+ bufRecord->isdirty = true;
+ else
+ bufRecord->isdirty = false;
+
+ /* Note if the buffer is valid, and has storage created */
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ bufRecord->isvalid = true;
+ else
+ bufRecord->isvalid = false;
+
+ UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
+ * Helper routine for pg_buffercache_pages().
+ *
+ * Format and return a tuple for a single buffer cache entry.
+ */
+static Datum
+get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+ HeapTuple tuple;
+ BufferCachePagesRec *bufRecord = &(fctx->record[record_id]);
- /* Create a user function context for cross-call persistence */
- fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+ values[0] = Int32GetDatum(bufRecord->bufferid);
+ memset(nulls, false, NUM_BUFFERCACHE_PAGES_ELEM);
+
+ /*
+ * Set all fields except the bufferid to null if the buffer is unused or
+ * not valid.
+ */
+ if (bufRecord->blocknum == InvalidBlockNumber ||
+ bufRecord->isvalid == false)
+ memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool));
+ else
+ {
+ values[1] = ObjectIdGetDatum(bufRecord->relfilenumber);
+ values[2] = ObjectIdGetDatum(bufRecord->reltablespace);
+ values[3] = ObjectIdGetDatum(bufRecord->reldatabase);
+ values[4] = ObjectIdGetDatum(bufRecord->forknum);
+ values[5] = Int64GetDatum((int64) bufRecord->blocknum);
+ values[6] = BoolGetDatum(bufRecord->isdirty);
+ values[7] = Int16GetDatum(bufRecord->usagecount);
/*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
+ * unused for v1.0 callers, but the array is always long enough
*/
- if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
- elog(ERROR, "return type must be a row type");
+ values[8] = Int32GetDatum(bufRecord->pinning_backends);
+ }
- if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
- expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
- elog(ERROR, "incorrect number of output arguments");
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
- /* Construct a tuple descriptor for the result rows. */
- tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
- TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
- INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
- INT2OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
- INT8OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
- BOOLOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
- INT2OID, -1, 0);
-
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);
-
- fctx->tupdesc = BlessTupleDesc(tupledesc);
-
- /* Allocate NBuffers worth of BufferCachePagesRec records. */
- fctx->record = (BufferCachePagesRec *)
- MemoryContextAllocHuge(CurrentMemoryContext,
- sizeof(BufferCachePagesRec) * NBuffers);
-
- /* Set max calls and remember the user function context. */
- funcctx->max_calls = NBuffers;
- funcctx->user_fctx = fctx;
-
- /* Return to original context when allocating transient memory */
- MemoryContextSwitchTo(oldcontext);
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
* Scan through all the buffers, saving the relevant fields in the
@@ -152,36 +244,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
* locks, so the information of each buffer is self-consistent.
*/
for (i = 0; i < NBuffers; i++)
- {
- BufferDesc *bufHdr;
- uint32 buf_state;
-
- bufHdr = GetBufferDescriptor(i);
- /* Lock each buffer header before inspecting. */
- buf_state = LockBufHdr(bufHdr);
-
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
- if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
- else
- fctx->record[i].isdirty = false;
-
- /* Note if the buffer is valid, and has storage created */
- if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
- else
- fctx->record[i].isvalid = false;
-
- UnlockBufHdr(bufHdr, buf_state);
- }
+ pg_buffercache_save_tuple(i, fctx);
}
funcctx = SRF_PERCALL_SETUP();
@@ -191,59 +254,16 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
if (funcctx->call_cntr < funcctx->max_calls)
{
+ Datum result;
uint32 i = funcctx->call_cntr;
- Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
- bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
-
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
- nulls[0] = false;
-
- /*
- * Set all fields except the bufferid to null if the buffer is unused
- * or not valid.
- */
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
- {
- nulls[1] = true;
- nulls[2] = true;
- nulls[3] = true;
- nulls[4] = true;
- nulls[5] = true;
- nulls[6] = true;
- nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
- nulls[8] = true;
- }
- else
- {
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
- nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
- nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
- nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
- nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
- nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
- nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
- nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
- nulls[8] = false;
- }
-
- /* Build and return the tuple. */
- tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
- result = HeapTupleGetDatum(tuple);
+ result = get_buffercache_tuple(i, fctx);
SRF_RETURN_NEXT(funcctx, result);
}
else
+ {
SRF_RETURN_DONE(funcctx);
+ }
}
Datum
--
2.49.0
v19-0005-review.patchtext/x-patch; charset=UTF-8; name=v19-0005-review.patchDownload
From cc3c49b8ca8e6f6a62d6c9e6e1caf3cfdc79fdc6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 1 Apr 2025 20:40:01 +0200
Subject: [PATCH v19 5/8] review
---
contrib/pg_buffercache/pg_buffercache_pages.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index b0741e568d8..d65b3bdf8df 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -70,6 +70,9 @@ PG_FUNCTION_INFO_V1(pg_buffercache_evict);
/*
* Helper routine for pg_buffercache_pages().
+ *
+ * review: maybe describe what the helper does? Also, I guess we don't want to
+ * keep updating this whenever someone else uses the helper, right?
*/
static BufferCachePagesContext *
pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
@@ -88,7 +91,7 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
/*
* To smoothly support upgrades from version 1.0 of this extension
* transparently handle the (non-)existence of the pinning_backends
- * column. ee unfortunately have to get the result type for that... - we
+ * column. We unfortunately have to get the result type for that... - we
* can't use the result type determined by the function definition without
* potentially crashing when somebody uses the old (or even wrong)
* function definition though.
@@ -261,9 +264,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
SRF_RETURN_NEXT(funcctx, result);
}
else
- {
SRF_RETURN_DONE(funcctx);
- }
}
Datum
--
2.49.0
v19-0006-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchtext/x-patch; charset=UTF-8; name=v19-0006-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchDownload
From 6fe26ca004ada3a9a6decc90ecfada44b058e94d Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v19 6/8] Add pg_buffercache_numa view with NUMA node info
Introduces a new view pg_buffercache_numa, showing a NUMA memory node
for each individual buffer.
To determine the NUMA node for a buffer, we first need to touch the
memory pages using pg_numa_touch_mem_if_required, otherwise we might get
status -2 (ENOENT = The page is not present), indicating the page is
either unmapped or unallocated.
The size of a database block and OS memory page may differ. For example
the default block size (BLCKSZ) sis 8KB, while the memory page is 4KB,
but it's also possible to make the block size smaller (e.g. 1KB). It's
enough to query the NUMA node only once per memory page, we don't need
to repeat this for every buffer.
review: What if we get multiple pages per buffer (the default). Could we
get multiple nodes per buffer?
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache_numa.out | 28 +++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 39 ++++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 180 +++++++++++++++++-
.../sql/pg_buffercache_numa.sql | 20 ++
doc/src/sgml/pgbuffercache.sgml | 61 +++++-
9 files changed, 328 insertions(+), 10 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..720dc84b2c9
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,39 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+ SELECT P.* FROM pg_buffercache_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, node_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pages() TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index d65b3bdf8df..535123aef1c 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,12 +11,13 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -46,6 +47,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_node_id;
} BufferCachePagesRec;
@@ -64,15 +66,59 @@ typedef struct
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
/*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine to map Buffers into addresses that is used by
+ * pg_numa_query_pages().
+ *
+ * When database block size (BLCKSZ) is smaller than the OS page size (4kB),
+ * multiple database buffers will map to the same OS memory page. In this case,
+ * we only need to query the NUMA node for the first memory address of each
+ * unique OS page rather than for every buffer.
*
- * review: maybe describe what the helper does? Also, I guess we don't want to
- * keep updating this whenever someone else uses the helper, right?
+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ */
+static inline void
+pg_buffercache_numa_prepare_ptrs(int buffer_id, float pages_per_blk,
+ Size os_page_size,
+ void **os_page_ptrs)
+{
+ size_t blk2page = (size_t) (buffer_id * pages_per_blk);
+
+ for (size_t j = 0; j < pages_per_blk; j++)
+ {
+ size_t blk2pageoff = blk2page + j;
+
+ if (os_page_ptrs[blk2pageoff] == 0)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ /* NBuffers starts from 1 */
+ os_page_ptrs[blk2pageoff] = (char *) BufferGetBlock(buffer_id + 1) +
+ (os_page_size * j);
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2pageoff]);
+
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+}
+
+/*
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+>>>>>>> 704ed2dd2cb (Extend pg_buffercache with new view pg_buffercache_numa to show NUMA memory node for individual buffer.)
*/
static BufferCachePagesContext *
pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
@@ -122,9 +168,12 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
INT2OID, -1, 0);
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "node_id",
+ INT4OID, -1, 0);
fctx->tupdesc = BlessTupleDesc(tupledesc);
@@ -143,7 +192,7 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
}
/*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
*
* Save buffer cache information for a single buffer.
*/
@@ -178,11 +227,13 @@ pg_buffercache_save_tuple(int record_id, BufferCachePagesContext *fctx)
else
bufRecord->isvalid = false;
+ bufRecord->numa_node_id = -1;
+
UnlockBufHdr(bufHdr, buf_state);
}
/*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
*
* Format and return a tuple for a single buffer cache entry.
*/
@@ -218,6 +269,7 @@ get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
* unused for v1.0 callers, but the array is always long enough
*/
values[8] = Int32GetDatum(bufRecord->pinning_backends);
+ values[9] = Int32GetDatum(bufRecord->numa_node_id);
}
/* Build and return the tuple. */
@@ -267,6 +319,120 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings.
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_pages_status = NULL;
+ uint64 os_page_count = 0;
+ float pages_per_blk = 0;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS
+ * memory page size 2. Calculate how many OS pages are used by all
+ * buffer blocks 3. Calculate how many OS pages are contained within
+ * each database block.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+ os_page_count = ((uint64) NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (float) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu pages_per_blk=%.2f",
+ (unsigned long) os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(uint64) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ *
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ */
+ for (i = 0; i < NBuffers; i++)
+ {
+ pg_buffercache_save_tuple(i, fctx);
+ pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size,
+ os_page_ptrs);
+ }
+
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ int blk2page = (int) i * pages_per_blk;
+
+ /*
+ * Set the NUMA node id for this buffer based on the first OS page
+ * it maps to.
+ *
+ * Note: We could check for errors in os_pages_status and report
+ * them. Also, a single DB block might span multiple NUMA nodes if
+ * it crosses OS pages on node boundaries, but we only record the
+ * node of the first page. This is a simplification but should be
+ * sufficient for most analyses.
+ */
+ fctx->record[i].numa_node_id = os_pages_status[blk2page];
+ }
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ Datum result;
+ uint32 i = funcctx->call_cntr;
+
+ result = get_buffercache_tuple(i, fctx);
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ {
+ firstNumaTouch = false;
+ SRF_RETURN_DONE(funcctx);
+ }
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..086e0062a17 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,14 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides the same information
+ as <function>pg_buffercache_pages()</function> but is slower because it also
+ provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +210,55 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache_numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed are identical to the
+ <structname>pg_buffercache</structname> view, except that this one includes
+ one additional <structfield>node_id</structfield> column as defined in
+ <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Extra column</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>integer</type>
+ </para>
+ <para>
+ <acronym>NUMA</acronym> node ID. NULL if the shared buffer
+ has not been used yet. On systems without <acronym>NUMA</acronym> support
+ this returns 0.
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
--
2.49.0
v19-0007-review.patchtext/x-patch; charset=UTF-8; name=v19-0007-review.patchDownload
From c613d416f11cde0ae01b99d5f2b5f25d041eeb54 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 1 Apr 2025 21:29:44 +0200
Subject: [PATCH v19 7/8] review
---
contrib/pg_buffercache/pg_buffercache_pages.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 535123aef1c..2d8e1f6ee98 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -86,6 +86,10 @@ static bool firstNumaTouch = true;
* In order to get reliable results we also need to touch memory pages, so that
* inquiry about NUMA memory node doesn't return -2 (which indicates
* unmapped/unallocated pages).
+ *
+ * review: It's not very obvious to me what this does, exactly. I mean, what's
+ * the result in os_page_ptrs? What if BLCKSZ < PAGESIZE or BLCKSZ > PAGESIZE?
+ * What's blk2page and blk2pageoff?
*/
static inline void
pg_buffercache_numa_prepare_ptrs(int buffer_id, float pages_per_blk,
--
2.49.0
v19-0008-Add-pg_shmem_numa_allocations-to-show-NUMA-node.patchtext/x-patch; charset=UTF-8; name=v19-0008-Add-pg_shmem_numa_allocations-to-show-NUMA-node.patchDownload
From ba5692104458eba1bc7dbfbb2251614f634d69eb Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v19 8/8] Add pg_shmem_numa_allocations to show NUMA node
... TBD ...
Why not pg_shmem_allocations_numa?
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 79 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 131 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 +++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 273 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f5a306247e..e83711f9578 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -176,6 +176,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-numa-allocations"><structname>pg_shmem_numa_allocations</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -3746,6 +3751,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-numa-allocations">
+ <title><structname>pg_shmem_numa_allocations</structname></title>
+
+ <indexterm zone="view-pg-shmem-numa-allocations">
+ <primary>pg_shmem_numa_allocations</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_numa_allocations</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_numa_allocations</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_numa_allocations</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 31d269b7ee0..660a9e9832b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..e83f066171a 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,131 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+ return (Datum) 0;
+ }
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ shm_ent_page_count = ent->allocated_size / os_page_size;
+ /* It is always at least 1 page */
+ shm_ent_page_count = shm_ent_page_count == 0 ? 1 : shm_ent_page_count;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (which indicates unmapped/unallocated pages).
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+ /* Count number of NUMA nodes used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s >= 0 && s <= max_nodes)
+ nodes[s]++;
+ }
+
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory.
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d532b8c43b9..db69c0231f9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8510,6 +8510,14 @@
proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,node_id,size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..fb882c5b771
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 954f549555e..d9d62470cdc 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3144,6 +3144,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
has_table_privilege
@@ -3157,6 +3163,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 47478969135..6e460e10d61 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1740,6 +1740,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ node_id,
+ size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, node_id, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..fddb21a260a
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index b81694c24f2..f93d4829702 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1921,11 +1921,13 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.49.0
On Tue, Apr 1, 2025 at 5:13 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi Jakub,
On Tue, Apr 01, 2025 at 12:56:06PM +0200, Jakub Wartak wrote:
On Mon, Mar 31, 2025 at 4:59 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:Hi,
Hi Bertrand, happy to see you back, thanks for review and here's v18
attached (an ideal fit for PG18 ;))Thanks for the new version!
Hi Bertrand,
I'll be attaching v20 when responding to the follow-up review by Tomas
to avoid double-sending this in the list, but review findings from
here are fixed in v20.
=== About v18-0002
It looks in a good shape to me. The helper's name might still be debatable
though.I just have 2 comments:
=== 1
+ if (bufRecord->blocknum == InvalidBlockNumber || + bufRecord->isvalid == false)It seems to me that this check could now fit in one line.
OK.
=== 2
+ { SRF_RETURN_DONE(funcctx); + }Extra parentheses are not needed.
This also should be fixed in v20.
=== About v18-0003
=== 3
I think that pg_buffercache--1.5--1.6.sql is not correct. It should contain
only the necessary changes when updating from 1.5. It means that it should "only"
create the new objects (views and functions in our case) that come in v18-0003
and grant appropriate privs.Also it should mention:
"
\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
"
and not:"
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
"The already existing pg_buffercache--1.N--1.(N+1).sql are good examples.
Hm, good find, it might be leftover from first attempts where we were
aiming just for adding column numa_node_id to pg_buffercache_pages()
(rather than adding new view), all hopefully fixed.
=== 4
+ for (i = 0; i < NBuffers; i++) + { + int blk2page = (int) i * pages_per_blk; +I think that should be:
int blk2page = (int) (i * pages_per_blk);
OK, but I still fail to grasp why pg_indent doesnt fix this stuff on
it's own... I believe orginal ident, would fix this on it's own?
=== About v18-0004
=== 5
When running:
select c.name, c.size as num_size, s.size as shmem_size
from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c, pg_shmem_allocations s
where c.name = s.name;I can see:
- pg_shmem_numa_allocations reporting a lot of times the same size
- pg_shmem_numa_allocations and pg_shmem_allocations not reporting the same sizeDo you observe the same?
Yes, it is actually by design: the pg_shmem_allocations.size is sum of
page sizes not size of struct, e.g. with "order by 3 desc":
name | num_size | shmem_size
------------------------------------------------+-----------+------------
WaitEventCustomCounterData | 4096 | 8
Archiver Data | 4096 | 8
SerialControlData | 4096 | 16
I was even wondering if it does make sense to output "shm pointer
address" (or at least offset) there to see which shm structures are on
the same page, but we have the same pg_shm_allocations already.
-J.
On Tue, Apr 1, 2025 at 10:17 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi,
I've spent a bit of time reviewing this. In general I haven't found
anything I'd call a bug, but here's a couple comments for v18 ... Most
of this is in separate "review" commits, with a couple exceptions.
Hi, thank you very much for help on this, yes I did not anticipate
this patch to organically grow like that...
I've squashed those review findings into v20 and provided answers for
the "review:".
1) Please update the commit messages, with proper formatting, etc. I
tried to do that in the attached v19, but please go through that, add
relevant details, update list of reviewers, etc. The subject should not
be overly long, etc.
Fixed by you.
2) I don't think we need "libnuma for NUMA awareness" in configure, I'd
use just "libnuma support" similar to other libraries.
Fixed by you.
3) I don't think we need pg_numa.h to have this:
extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);
AFAICS we don't have any SQL functions exposed as PGDLLIMPORT, so why
would it be necessary here? It's enough to have a prototype in .c file.
Right, probably the result of ENOTENOUGHCOFFEE and copy/paste.
4) Improved .sgml to have acronym/productname in a couple places.
Great.
5) I don't think the comment for pg_buffercache_init_entries() is very
useful. That it's helper for pg_buffercache_pages() tells me nothing
about how to use it, what the arguments are, etc.
I've added an explanation (in 0003 though), so that this is covered.
I've always assumed that 'static' functions don't need that much of
that(?)
6) IMHO pg_buffercache_numa_prepare_ptrs() would deserve a better
comment too. I mean, there's no info about what the arguments are, which
arguments are input or output, etc. And it only discussed one option
(block page < memory page), but what happens in the other case? The
formulas with blk2page/blk2pageoff are not quite clear to me (I'm not
saying it's wrong).However, it seems rather suspicious that pages_per_blk is calculated as
float, and pg_buffercache_numa_prepare_ptrs() then does this:for (size_t j = 0; j < pages_per_blk; j++)
{ ... }I mean, isn't this vulnerable to rounding errors, which might trigger
some weird behavior? If not, it'd be good to add a comment why this is
fine, it confuses me a lot. I personally would probably prefer doing
just integer arithmetic here.
Please bear with me: If you set client_min_messages to debug1 and then
pg_buffercache_numa will dump:
a) without HP, DEBUG: NUMA: os_page_count=32768 os_page_size=4096
pages_per_blk=2.00
b) with HP (2M) DEBUG: NUMA: os_page_count=64 os_page_size=2097152
pages_per_blk=0.003906
so we need to be agile to support two cases as you mention (BLCKSZ >
PAGESIZE and BLCKSZ < PAGESIZE). BLCKSZ are 2..32kB and pagesize are
4kB..1GB, thus we can get in that float the following sample values:
BLCKSZ pagesize
2kB 4kB = 0.5
2kB 2048kb = .0009765625
2kB 1024*1024kb # 1GB = .0000019073486328125 # worst-case?
8kB 4kB = 2
8kB 2048kb = .003906250 # example from above (x86_64, 2M HP)
8kB 1024*1024kb # 1GB = .00000762939453
32kB 4kB = 8
32kB 2048kb = .0156250
32kB 1024*1024kb # 1GB = .000030517578125
So that loop:
for (size_t j = 0; j < pages_per_blk; j++)
is quite generic and launches in both cases. I've somehow failed to
somehow come up with integer-based math and generic code for this
(without special cases which seem to be no-go here?). So, that loop
then will:
a) launch many times to support BLCKSZ > pagesize, that is when single
DB block spans multiple memory pages
b) launch once when BLCKSZ < pagesize (because 0.003906250 > 0 in the
example above)
Loop touches && stores addresses into os_page_ptrs[] as input to this
one big move_pages(2) query. So we basically ask for all memory pages
for NBuffers. Once we get our NUMA information we then use blk2page =
up_to_NBuffers * pages_per_blk to resolve memory pointers back to
Buffers, if anywhere it could be a problem here.
So let's say we have s_b=4TB (it wont work for sure for other reasons,
let's assume we have it), let's also assume we have no huge
pages(pagesize=4kB) and BLCKSZ=8kB (default) => NBuffers=1073741824
which multiplied by 2 = INT_MAX (integer overflow bug), so I think
that int is not big enough there in pg_buffercache_numa_pages() (it
should be "size_t blk2page" there as in
pg_buffercache_numa_prepare_ptrs(), so I've changed it in v20)
Another angle is s_b=4TB RAM with 2MB HP, BLKSZ=8kB =>
NBuffers=2097152 * 0.003906250 = 8192.0 .
OPEN_QUESTION: I'm not sure all of this is safe and I'm seeking help, but with
float f = 2097152 * 0.003906250;
under clang -Weverything I got "implicit conversion increases
floating-point precision: 'float' to 'double'", so either it is:
- we somehow rewrite all of the core arithmetics here to integer?
- or simply go with doubles just to be sure? I went with doubles in
v20, comments explaining are not there yet.
7) This para in the docs seems self-contradictory:
<para>
The <function>pg_buffercache_numa_pages()</function> provides the same
information
as <function>pg_buffercache_pages()</function> but is slower because
it also
provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
The <structname>pg_buffercache_numa</structname> view wraps the
function for
convenient use.
</para>I mean, "provides the same information, but is slower because it
provides different information" is strange. I understand the point, but
maybe rephrase somehow?
Oh my... yes, now it looks way better.
8) Why is pg_numa_available() marked as volatile? Should not change in a
running cluster, no?
No it shouldn't, good find, made it 's'table.
9) I noticed the new SGML docs have IDs with mixed "-" and "_". Maybe
let's not do that.<sect2 id="pgbuffercache-pg-buffercache_numa">
Fixed.
10) I think it'd be good to mention/explain why move_pages is used
instead of get_mempolicy - more efficient with batching, etc. This would
be useful both in the commit message and before the move_pages call
Ok, added in 0001.
(and in general to explain why pg_buffercache_numa_prepare_ptrs prepares the
pointers like this etc.).
Added reference to that earlier new comment here too in 0003.
11) This could use UINT64_FORMAT, instead of a cast:
elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu
pages_per_blk=%.2f",
(unsigned long) os_page_count, os_page_size, pages_per_blk);
Done.
12) You have also raised "why not pg_shm_allocations_numa" instead of
"pg_shm_numa_allocations"
OPEN_QUESTION: To be honest, I'm not attached to any of those two (or
naming things in general), I can change if you want.
13) In the patch: "review: What if we get multiple pages per buffer
(the default). Could we get multiple nodes per buffer?"
OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to
output multiple rows per single buffer (with "page_no") then we could
get this:
buffer1:..:page0:numanodeID1
buffer1:..:page1:numanodeID2
buffer2:..:page0:numanodeID1
Should we add such functionality?
-J.
Attachments:
v20-0002-pg_buffercache-split-pg_buffercache_pages-into-p.patchapplication/x-patch; name=v20-0002-pg_buffercache-split-pg_buffercache_pages-into-p.patchDownload
From 01a117778b98944ffc50a15777fa023188b4fd45 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v20 2/4] pg_buffercache: split pg_buffercache_pages into parts
Split pg_buffercache_pages() into multiple smaller functions, to allow
reuse in future patches. This introduces three new functions:
- pg_buffercache_init_entries
- pg_buffercache_build_tuple
- get_buffercache_tuple
that help adding entries into a tuplestore, describing the contents of
the buffercache.
This is a preparation for future patches extending pg_buffercache, e.g.
to add NUMA observabitily.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/pg_buffercache_pages.c | 293 +++++++++---------
1 file changed, 155 insertions(+), 138 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..ced4ec777a1 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -68,80 +68,171 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-Datum
-pg_buffercache_pages(PG_FUNCTION_ARGS)
+/*
+ * Helper routine for pg_buffercache_pages().
+ */
+static BufferCachePagesContext *
+pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
{
- FuncCallContext *funcctx;
- Datum result;
- MemoryContext oldcontext;
BufferCachePagesContext *fctx; /* User function context. */
+ MemoryContext oldcontext;
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
- HeapTuple tuple;
- if (SRF_IS_FIRSTCALL())
- {
- int i;
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
- funcctx = SRF_FIRSTCALL_INIT();
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. We unfortunately have to get the result type for that... - we
+ * can't use the result type determined by the function definition without
+ * potentially crashing when somebody uses the old (or even wrong)
+ * function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+ expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
+ INT2OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
+ BOOLOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
+ INT2OID, -1, 0);
+
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+ INT4OID, -1, 0);
- /* Switch context when allocating stuff to be used in later calls */
- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
- /* Create a user function context for cross-call persistence */
- fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCachePagesRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCachePagesRec) * NBuffers);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+ return fctx;
+}
+
+/*
+ * Helper routine for pg_buffercache_pages().
+ *
+ * Save buffer cache information for a single buffer.
+ */
+static void
+pg_buffercache_save_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+ BufferCachePagesRec *bufRecord = &(fctx->record[record_id]);
+
+ bufHdr = GetBufferDescriptor(record_id);
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+
+ bufRecord->bufferid = BufferDescriptorGetBuffer(bufHdr);
+ bufRecord->relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ bufRecord->reltablespace = bufHdr->tag.spcOid;
+ bufRecord->reldatabase = bufHdr->tag.dbOid;
+ bufRecord->forknum = BufTagGetForkNum(&bufHdr->tag);
+ bufRecord->blocknum = bufHdr->tag.blockNum;
+ bufRecord->usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ bufRecord->pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+ if (buf_state & BM_DIRTY)
+ bufRecord->isdirty = true;
+ else
+ bufRecord->isdirty = false;
+
+ /* Note if the buffer is valid, and has storage created */
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ bufRecord->isvalid = true;
+ else
+ bufRecord->isvalid = false;
+
+ UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
+ * Helper routine for pg_buffercache_pages().
+ *
+ * Format and return a tuple for a single buffer cache entry.
+ */
+static Datum
+get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+ HeapTuple tuple;
+ BufferCachePagesRec *bufRecord = &(fctx->record[record_id]);
+
+ values[0] = Int32GetDatum(bufRecord->bufferid);
+ memset(nulls, false, NUM_BUFFERCACHE_PAGES_ELEM);
+
+ /*
+ * Set all fields except the bufferid to null if the buffer is unused or
+ * not valid.
+ */
+ if (bufRecord->blocknum == InvalidBlockNumber || bufRecord->isvalid == false)
+ memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool));
+ else
+ {
+ values[1] = ObjectIdGetDatum(bufRecord->relfilenumber);
+ values[2] = ObjectIdGetDatum(bufRecord->reltablespace);
+ values[3] = ObjectIdGetDatum(bufRecord->reldatabase);
+ values[4] = ObjectIdGetDatum(bufRecord->forknum);
+ values[5] = Int64GetDatum((int64) bufRecord->blocknum);
+ values[6] = BoolGetDatum(bufRecord->isdirty);
+ values[7] = Int16GetDatum(bufRecord->usagecount);
/*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
+ * unused for v1.0 callers, but the array is always long enough
*/
- if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
- elog(ERROR, "return type must be a row type");
+ values[8] = Int32GetDatum(bufRecord->pinning_backends);
+ }
- if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
- expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
- elog(ERROR, "incorrect number of output arguments");
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
- /* Construct a tuple descriptor for the result rows. */
- tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
- TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
- INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
- INT2OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
- INT8OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
- BOOLOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
- INT2OID, -1, 0);
-
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);
-
- fctx->tupdesc = BlessTupleDesc(tupledesc);
-
- /* Allocate NBuffers worth of BufferCachePagesRec records. */
- fctx->record = (BufferCachePagesRec *)
- MemoryContextAllocHuge(CurrentMemoryContext,
- sizeof(BufferCachePagesRec) * NBuffers);
-
- /* Set max calls and remember the user function context. */
- funcctx->max_calls = NBuffers;
- funcctx->user_fctx = fctx;
-
- /* Return to original context when allocating transient memory */
- MemoryContextSwitchTo(oldcontext);
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
* Scan through all the buffers, saving the relevant fields in the
@@ -152,36 +243,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
* locks, so the information of each buffer is self-consistent.
*/
for (i = 0; i < NBuffers; i++)
- {
- BufferDesc *bufHdr;
- uint32 buf_state;
-
- bufHdr = GetBufferDescriptor(i);
- /* Lock each buffer header before inspecting. */
- buf_state = LockBufHdr(bufHdr);
-
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
- if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
- else
- fctx->record[i].isdirty = false;
-
- /* Note if the buffer is valid, and has storage created */
- if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
- else
- fctx->record[i].isvalid = false;
-
- UnlockBufHdr(bufHdr, buf_state);
- }
+ pg_buffercache_save_tuple(i, fctx);
}
funcctx = SRF_PERCALL_SETUP();
@@ -191,55 +253,10 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
if (funcctx->call_cntr < funcctx->max_calls)
{
+ Datum result;
uint32 i = funcctx->call_cntr;
- Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
- bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
-
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
- nulls[0] = false;
-
- /*
- * Set all fields except the bufferid to null if the buffer is unused
- * or not valid.
- */
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
- {
- nulls[1] = true;
- nulls[2] = true;
- nulls[3] = true;
- nulls[4] = true;
- nulls[5] = true;
- nulls[6] = true;
- nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
- nulls[8] = true;
- }
- else
- {
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
- nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
- nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
- nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
- nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
- nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
- nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
- nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
- nulls[8] = false;
- }
-
- /* Build and return the tuple. */
- tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
- result = HeapTupleGetDatum(tuple);
+ result = get_buffercache_tuple(i, fctx);
SRF_RETURN_NEXT(funcctx, result);
}
else
--
2.39.5
v20-0003-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchapplication/x-patch; name=v20-0003-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchDownload
From ce5242a248155e1e18e7b3a3abc38a26bc5537d1 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v20 3/4] Add pg_buffercache_numa view with NUMA node info
Introduces a new view pg_buffercache_numa, showing a NUMA memory node
for each individual buffer.
To determine the NUMA node for a buffer, we first need to touch the
memory pages using pg_numa_touch_mem_if_required, otherwise we might get
status -2 (ENOENT = The page is not present), indicating the page is
either unmapped or unallocated.
The size of a database block and OS memory page may differ. For example
the default block size (BLCKSZ) sis 8KB, while the memory page is 4KB,
but it's also possible to make the block size smaller (e.g. 1KB). It's
enough to query the NUMA node only once per memory page, we don't need
to repeat this for every buffer.
Right now we just report NUMA node of the first page when dealing with
multiple pages per single buffer.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache_numa.out | 28 +++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 24 +++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 187 +++++++++++++++++-
.../sql/pg_buffercache_numa.sql | 20 ++
doc/src/sgml/pgbuffercache.sgml | 61 +++++-
9 files changed, 322 insertions(+), 8 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..8c1e891eab2
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,24 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, node_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index ced4ec777a1..2f2db4c1634 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,12 +11,13 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -46,6 +47,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_node_id;
} BufferCachePagesRec;
@@ -64,12 +66,67 @@ typedef struct
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
/*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine to map Buffers into addresses that is used by
+ * pg_numa_query_pages(). Please see it's comment for explanation why we need to
+ * prepare pointers like this.
+ *
+ * When database block size (BLCKSZ) is smaller than the OS page size (4kB),
+ * multiple database buffers will map to the same OS memory page. In this case,
+ * we only need to query the NUMA node for the first memory address of each
+ * unique OS page rather than for every buffer.
+ *
+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ *
+ * review: It's not very obvious to me what this does, exactly. I mean, what's
+ * the result in os_page_ptrs? What if BLCKSZ < PAGESIZE or BLCKSZ > PAGESIZE?
+ * What's blk2page and blk2pageoff?
+ */
+static inline void
+pg_buffercache_numa_prepare_ptrs(int buffer_id, double pages_per_blk,
+ Size os_page_size,
+ void **os_page_ptrs)
+{
+ size_t blk2page = (size_t) (buffer_id * pages_per_blk);
+
+ for (size_t j = 0; j < pages_per_blk; j++)
+ {
+ size_t blk2pageoff = blk2page + j;
+
+ if (os_page_ptrs[blk2pageoff] == 0)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ /* NBuffers starts from 1 */
+ os_page_ptrs[blk2pageoff] = (char *) BufferGetBlock(buffer_id + 1) +
+ (os_page_size * j);
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2pageoff]);
+
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+}
+
+/*
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * Allocates and returns new user function context based on SRF context
+ * (requires that functx to be initalized by SRF_FIRSTCALL_INIT()) and
+ * standard function call info.
*/
static BufferCachePagesContext *
pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
@@ -119,9 +176,12 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
INT2OID, -1, 0);
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "node_id",
+ INT4OID, -1, 0);
fctx->tupdesc = BlessTupleDesc(tupledesc);
@@ -140,7 +200,7 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
}
/*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
*
* Save buffer cache information for a single buffer.
*/
@@ -175,11 +235,13 @@ pg_buffercache_save_tuple(int record_id, BufferCachePagesContext *fctx)
else
bufRecord->isvalid = false;
+ bufRecord->numa_node_id = -1;
+
UnlockBufHdr(bufHdr, buf_state);
}
/*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
*
* Format and return a tuple for a single buffer cache entry.
*/
@@ -214,6 +276,7 @@ get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
* unused for v1.0 callers, but the array is always long enough
*/
values[8] = Int32GetDatum(bufRecord->pinning_backends);
+ values[9] = Int32GetDatum(bufRecord->numa_node_id);
}
/* Build and return the tuple. */
@@ -263,6 +326,120 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inquiry about memory mappings.
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_pages_status = NULL;
+ uint64 os_page_count = 0;
+ double pages_per_blk = 0;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS
+ * memory page size 2. Calculate how many OS pages are used by all
+ * buffer blocks 3. Calculate how many OS pages are contained within
+ * each database block.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+ os_page_count = ((uint64) NBuffers * BLCKSZ) / os_page_size;
+ pages_per_blk = (double) BLCKSZ / os_page_size;
+
+ elog(DEBUG1, "NUMA: os_page_count=" UINT64_FORMAT " os_page_size=%zu pages_per_blk=%.6f",
+ os_page_count, os_page_size, pages_per_blk);
+
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+ os_pages_status = palloc(sizeof(uint64) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ *
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ */
+ for (i = 0; i < NBuffers; i++)
+ {
+ pg_buffercache_save_tuple(i, fctx);
+ pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size,
+ os_page_ptrs);
+ }
+
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ size_t blk2page = (size_t) i * pages_per_blk;
+
+ /*
+ * Set the NUMA node id for this buffer based on the first OS page
+ * it maps to.
+ *
+ * Note: We could check for errors in os_pages_status and report
+ * them. Also, a single DB block might span multiple NUMA nodes if
+ * it crosses OS pages on node boundaries, but we only record the
+ * node of the first page. This is a simplification but should be
+ * sufficient for most analyses.
+ */
+ fctx->record[i].numa_node_id = os_pages_status[blk2page];
+ }
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ Datum result;
+ uint32 i = funcctx->call_cntr;
+
+ result = get_buffercache_tuple(i, fctx);
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ {
+ firstNumaTouch = false;
+ SRF_RETURN_DONE(funcctx);
+ }
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..315227bf0ce 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,14 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides the same information
+ as <function>pg_buffercache_pages()</function> but is slower because it also
+ provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +210,55 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache-numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed are identical to the
+ <structname>pg_buffercache</structname> view, except that this one includes
+ one additional <structfield>node_id</structfield> column as defined in
+ <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Extra column</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>integer</type>
+ </para>
+ <para>
+ <acronym>NUMA</acronym> node ID. NULL if the shared buffer
+ has not been used yet. On systems without <acronym>NUMA</acronym> support
+ this returns 0.
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
--
2.39.5
v20-0001-Add-support-for-basic-NUMA-awareness.patchapplication/x-patch; name=v20-0001-Add-support-for-basic-NUMA-awareness.patchDownload
From fe5cc5eaf57b1ddf81d8852da5442fef27960027 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 2 Apr 2025 12:29:22 +0200
Subject: [PATCH v20 1/4] Add support for basic NUMA awareness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.
A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with NUMA library.
The libnuma library is not available on 32-bit builds (there's no shared
object for i386), so we disable it in that case. The i386 is very memory
limited anyway, even with PAE, so NUMA is mostly irrelevant.
On Linux we use move_pages(2) syscall for speed instead of
get_mempolicy(2).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 2 +
configure | 187 ++++++++++++++++++++++++++++
configure.ac | 14 +++
doc/src/sgml/func.sgml | 13 ++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 6 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 40 ++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 120 ++++++++++++++++++
17 files changed, 442 insertions(+), 2 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
--with-liburing \
\
${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
diff --git a/configure b/configure
index 3d0e701c745..8308200dce7 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
XML2_CFLAGS
XML2_CONFIG
with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
@@ -872,6 +875,7 @@ with_liburing
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
LIBURING_LIBS
LIBCURL_CFLAGS
LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
XML2_CONFIG
XML2_CFLAGS
XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma for NUMA awareness
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -1629,6 +1636,10 @@ Some influential environment variables:
C compiler flags for LIBCURL, overriding pkg-config
LIBCURL_LIBS
linker flags for LIBCURL, overriding pkg-config
+ LIBNUMA_CFLAGS
+ C compiler flags for LIBNUMA, overriding pkg-config
+ LIBNUMA_LIBS
+ linker flags for LIBNUMA, overriding pkg-config
XML2_CONFIG path to xml2-config utility
XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
XML2_LIBS linker flags for XML2, overriding pkg-config
@@ -9063,6 +9074,182 @@ $as_echo "$as_me: WARNING: *** OAuth support tests require --with-python to run"
fi
+#
+# libnuma
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libnuma support" >&5
+$as_echo_n "checking whether to build with libnuma support... " >&6; }
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libnuma" >&5
+$as_echo "$with_libnuma" >&6; }
+
+
+if test "$with_libnuma" = yes ; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'libnuma' is required for NUMA awareness" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa" >&5
+$as_echo_n "checking for numa... " >&6; }
+
+if test -n "$LIBNUMA_CFLAGS"; then
+ pkg_cv_LIBNUMA_CFLAGS="$LIBNUMA_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_CFLAGS=`$PKG_CONFIG --cflags "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBNUMA_LIBS"; then
+ pkg_cv_LIBNUMA_LIBS="$LIBNUMA_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_LIBS=`$PKG_CONFIG --libs "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "numa" 2>&1`
+ else
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "numa" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBNUMA_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (numa) were not met:
+
+$LIBNUMA_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBNUMA_CFLAGS=$pkg_cv_LIBNUMA_CFLAGS
+ LIBNUMA_LIBS=$pkg_cv_LIBNUMA_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
#
# XML
#
diff --git a/configure.ac b/configure.ac
index 47a287926bc..ab4a0dc2be7 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,20 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+ PKG_CHECK_MODULES(LIBNUMA, numa)
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2488e9ba998..4bb60e9e080 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25143,6 +25143,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..5f0486bb335 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-liburing">
<term><option>--with-liburing</option></term>
<listitem>
@@ -2645,6 +2655,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index ba7916d1493..7cd524307b3 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: liburing
@@ -3240,6 +3261,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
liburing,
libxml,
lz4,
@@ -3896,6 +3918,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..8675e1b5d87 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('liburing', type : 'feature', value: 'auto',
description: 'io_uring support, for asynchronous I/O')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index cce29a37ac5..8b61d1ed492 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
@@ -223,6 +224,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBNUMA_CFLAGS = @LIBNUMA_CFLAGS@
+LIBNUMA_LIBS = @LIBNUMA_LIBS@
+
LIBURING_CFLAGS = @LIBURING_CFLAGS@
LIBURING_LIBS = @LIBURING_LIBS@
@@ -250,7 +254,7 @@ CPP = @CPP@
CPPFLAGS = @CPPFLAGS@
PG_SYSROOT = @PG_SYSROOT@
-override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBNUMA_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
ifdef PGXS
override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..ea8d796e7c4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6b57b7e18d9..e6730ac703c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8518,6 +8518,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 2ac61575883..b7144cbf32f 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -676,6 +676,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to build with io_uring support. (--with-liburing) */
#undef USE_LIBURING
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..314cff94dbc
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
@@ -232,6 +234,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index f11896440d5..4274949dfa4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
path.o \
pg_bitutils.o \
pg_localeconv_r.o \
+ pg_numa.o \
pg_popcount_aarch64.o \
pg_popcount_avx512.o \
pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index cf7f07644b9..3b26c68fda7 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_localeconv_r.c',
+ 'pg_numa.c',
'pg_popcount_aarch64.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..5e2523cf798
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+/*
+ * We use move_pages(2) syscall here - instead of get_mempolicy(2) - as the
+ * first one allows us to batch and query about many memory pages in one single
+ * giant system call that is way faster.
+ */
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+#else
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
+
+/* This should be used only after the server is started */
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+ Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
--
2.39.5
v20-0004-Add-new-pg_shmem_numa_allocations-view.patchapplication/x-patch; name=v20-0004-Add-new-pg_shmem_numa_allocations-view.patchDownload
From 752664b8e05982f054500203eba7c49dd52fdb46 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v20 4/4] Add new pg_shmem_numa_allocations view
Introduce new pg_shmem_numa_alloctions view that allows viewing the shared memory split layout across
NUMA nodes.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 79 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 131 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 +++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 273 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index e9a59af8c34..c1d63ffc3b4 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -181,6 +181,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-numa-allocations"><structname>pg_shmem_numa_allocations</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -4040,6 +4045,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-numa-allocations">
+ <title><structname>pg_shmem_numa_allocations</structname></title>
+
+ <indexterm zone="view-pg-shmem-numa-allocations">
+ <primary>pg_shmem_numa_allocations</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_numa_allocations</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_numa_allocations</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_numa_allocations</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 64a7240aa77..eef7a7f9788 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..e83f066171a 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,131 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+ return (Datum) 0;
+ }
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ shm_ent_page_count = ent->allocated_size / os_page_size;
+ /* It is always at least 1 page */
+ shm_ent_page_count = shm_ent_page_count == 0 ? 1 : shm_ent_page_count;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (which indicates unmapped/unallocated pages).
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+ /* Count number of NUMA nodes used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s >= 0 && s <= max_nodes)
+ nodes[s]++;
+ }
+
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory.
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index e6730ac703c..966ae7994f4 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8522,6 +8522,14 @@
proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,node_id,size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..fb882c5b771
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 5588d83e1bf..f66cf1bbfbd 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3150,6 +3150,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
has_table_privilege
@@ -3169,6 +3175,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index d9533deb04e..6c5da81a2b2 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1756,6 +1756,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ node_id,
+ size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, node_id, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..fddb21a260a
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index 286b1d03756..ca51dfd7702 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1922,12 +1922,14 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.39.5
Hi Jakub,
On Wed, Apr 02, 2025 at 04:45:53PM +0200, Jakub Wartak wrote:
On Tue, Apr 1, 2025 at 5:13 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:=== 4
+ for (i = 0; i < NBuffers; i++) + { + int blk2page = (int) i * pages_per_blk; +I think that should be:
int blk2page = (int) (i * pages_per_blk);
OK, but I still fail to grasp why pg_indent doesnt fix this stuff on
it's own... I believe orginal ident, would fix this on it's own?
My comment was not about indention but about the fact that I think that the
casting is not a the right place. I think that's the result of the multiplication
that we want to be casted (cast operator has higher precedence than Multiplication
operator).
select c.name, c.size as num_size, s.size as shmem_size
from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c, pg_shmem_allocations s
where c.name = s.name;I can see:
- pg_shmem_numa_allocations reporting a lot of times the same size
- pg_shmem_numa_allocations and pg_shmem_allocations not reporting the same sizeDo you observe the same?
Yes, it is actually by design: the pg_shmem_allocations.size is sum of
page sizes not size of struct,
Ok, but then does it make sense to see some num_size < shmem_size?
postgres=# select c.name, c.size as num_size, s.size as shmem_size
from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c, pg_shmem_allocations s
where c.name = s.name and s.size > c.size;
name | num_size | shmem_size
---------------+-----------+------------
XLOG Ctl | 4194304 | 4208200
Buffer Blocks | 134217728 | 134221824
AioHandleIOV | 2097152 | 2850816
(3 rows)
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On 4/2/25 16:46, Jakub Wartak wrote:
On Tue, Apr 1, 2025 at 10:17 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi,
I've spent a bit of time reviewing this. In general I haven't found
anything I'd call a bug, but here's a couple comments for v18 ... Most
of this is in separate "review" commits, with a couple exceptions.Hi, thank you very much for help on this, yes I did not anticipate
this patch to organically grow like that...
I've squashed those review findings into v20 and provided answers for
the "review:".
Thanks.
1) Please update the commit messages, with proper formatting, etc. I
tried to do that in the attached v19, but please go through that, add
relevant details, update list of reviewers, etc. The subject should not
be overly long, etc.Fixed by you.
OK, so you agree the commit messages are complete / correct?
2) I don't think we need "libnuma for NUMA awareness" in configure, I'd
use just "libnuma support" similar to other libraries.Fixed by you.
OK. FWIW if you disagree with some of my proposed changes, feel free to
push back. I'm sure some may be more a matter of personal preference.
3) I don't think we need pg_numa.h to have this:
extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);
AFAICS we don't have any SQL functions exposed as PGDLLIMPORT, so why
would it be necessary here? It's enough to have a prototype in .c file.Right, probably the result of ENOTENOUGHCOFFEE and copy/paste.
4) Improved .sgml to have acronym/productname in a couple places.
Great.
5) I don't think the comment for pg_buffercache_init_entries() is very
useful. That it's helper for pg_buffercache_pages() tells me nothing
about how to use it, what the arguments are, etc.I've added an explanation (in 0003 though), so that this is covered.
I've always assumed that 'static' functions don't need that much of
that(?)
I think that's mostly true - the (non-static) functions that are part of
the API for a module need better / more detailed docs. But that doesn't
mean static functions shouldn't have at least basic docs (unless the
function is trivial / obvious, but I don't think that's the case here).
If I happen to work a pg_buffercache patch in a couple months, I'll
still need to understand what the function does. It won't save me that
I'm working on the same file ...
I'm not saying this needs a detailed docs, but "helper for X" adds very
little information - I can easily see where it's called from, right?
6) IMHO pg_buffercache_numa_prepare_ptrs() would deserve a better
comment too. I mean, there's no info about what the arguments are, which
arguments are input or output, etc. And it only discussed one option
(block page < memory page), but what happens in the other case? The
formulas with blk2page/blk2pageoff are not quite clear to me (I'm not
saying it's wrong).However, it seems rather suspicious that pages_per_blk is calculated as
float, and pg_buffercache_numa_prepare_ptrs() then does this:for (size_t j = 0; j < pages_per_blk; j++)
{ ... }I mean, isn't this vulnerable to rounding errors, which might trigger
some weird behavior? If not, it'd be good to add a comment why this is
fine, it confuses me a lot. I personally would probably prefer doing
just integer arithmetic here.Please bear with me: If you set client_min_messages to debug1 and then
pg_buffercache_numa will dump:
a) without HP, DEBUG: NUMA: os_page_count=32768 os_page_size=4096
pages_per_blk=2.00
b) with HP (2M) DEBUG: NUMA: os_page_count=64 os_page_size=2097152
pages_per_blk=0.003906so we need to be agile to support two cases as you mention (BLCKSZ >
PAGESIZE and BLCKSZ < PAGESIZE). BLCKSZ are 2..32kB and pagesize are
4kB..1GB, thus we can get in that float the following sample values:
BLCKSZ pagesize
2kB 4kB = 0.5
2kB 2048kb = .0009765625
2kB 1024*1024kb # 1GB = .0000019073486328125 # worst-case?
8kB 4kB = 2
8kB 2048kb = .003906250 # example from above (x86_64, 2M HP)
8kB 1024*1024kb # 1GB = .00000762939453
32kB 4kB = 8
32kB 2048kb = .0156250
32kB 1024*1024kb # 1GB = .000030517578125So that loop:
for (size_t j = 0; j < pages_per_blk; j++)
is quite generic and launches in both cases. I've somehow failed to
somehow come up with integer-based math and generic code for this
(without special cases which seem to be no-go here?). So, that loop
then will:
a) launch many times to support BLCKSZ > pagesize, that is when single
DB block spans multiple memory pages
b) launch once when BLCKSZ < pagesize (because 0.003906250 > 0 in the
example above)
Hmmm, OK. Maybe it's correct. I still find the float arithmetic really
confusing and difficult to reason about ...
I agree we don't want special cases for each possible combination of
page sizes (I'm not sure we even know all the combinations). What I was
thinking about is two branches, one for (block >= page) and another for
(block < page). AFAICK both values have to be 2^k, so this would
guarantee we have either (block/page) or (page/block) as integer.
I wonder if you could even just calculate both, and have one loop that
deals with both.
Loop touches && stores addresses into os_page_ptrs[] as input to this
one big move_pages(2) query. So we basically ask for all memory pages
for NBuffers. Once we get our NUMA information we then use blk2page =
up_to_NBuffers * pages_per_blk to resolve memory pointers back to
Buffers, if anywhere it could be a problem here.
IMHO this is the information I'd expect in the function comment.
So let's say we have s_b=4TB (it wont work for sure for other reasons,
let's assume we have it), let's also assume we have no huge
pages(pagesize=4kB) and BLCKSZ=8kB (default) => NBuffers=1073741824
which multiplied by 2 = INT_MAX (integer overflow bug), so I think
that int is not big enough there in pg_buffercache_numa_pages() (it
should be "size_t blk2page" there as in
pg_buffercache_numa_prepare_ptrs(), so I've changed it in v20)Another angle is s_b=4TB RAM with 2MB HP, BLKSZ=8kB =>
NBuffers=2097152 * 0.003906250 = 8192.0 .OPEN_QUESTION: I'm not sure all of this is safe and I'm seeking help, but with
float f = 2097152 * 0.003906250;
under clang -Weverything I got "implicit conversion increases
floating-point precision: 'float' to 'double'", so either it is:
- we somehow rewrite all of the core arithmetics here to integer?
- or simply go with doubles just to be sure? I went with doubles in
v20, comments explaining are not there yet.
When I say "integer arithmetic" I don't mean it should use 32-bit ints,
or any other data type. I mean that it works with non-floating point
values. It could be int64, Size or whatever is large enough to not
overflow. I really don't see how changing stuff to double makes this
easier to understand.
7) This para in the docs seems self-contradictory:
<para>
The <function>pg_buffercache_numa_pages()</function> provides the same
information
as <function>pg_buffercache_pages()</function> but is slower because
it also
provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
The <structname>pg_buffercache_numa</structname> view wraps the
function for
convenient use.
</para>I mean, "provides the same information, but is slower because it
provides different information" is strange. I understand the point, but
maybe rephrase somehow?Oh my... yes, now it looks way better.
8) Why is pg_numa_available() marked as volatile? Should not change in a
running cluster, no?No it shouldn't, good find, made it 's'table.
9) I noticed the new SGML docs have IDs with mixed "-" and "_". Maybe
let's not do that.<sect2 id="pgbuffercache-pg-buffercache_numa">
Fixed.
10) I think it'd be good to mention/explain why move_pages is used
instead of get_mempolicy - more efficient with batching, etc. This would
be useful both in the commit message and before the move_pages callOk, added in 0001.
(and in general to explain why pg_buffercache_numa_prepare_ptrs prepares the
pointers like this etc.).Added reference to that earlier new comment here too in 0003.
Will take a look in the evening.
11) This could use UINT64_FORMAT, instead of a cast:
elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu
pages_per_blk=%.2f",
(unsigned long) os_page_count, os_page_size, pages_per_blk);Done.
12) You have also raised "why not pg_shm_allocations_numa" instead of
"pg_shm_numa_allocations"OPEN_QUESTION: To be honest, I'm not attached to any of those two (or
naming things in general), I can change if you want.
Me neither. I wonder if there's some precedent when adding similar
variants for other catalogs ... can you check? I've been thinking about
pg_stats and pg_stats_ext, but maybe there's a better example?
13) In the patch: "review: What if we get multiple pages per buffer
(the default). Could we get multiple nodes per buffer?"OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to
output multiple rows per single buffer (with "page_no") then we could
get this:
buffer1:..:page0:numanodeID1
buffer1:..:page1:numanodeID2
buffer2:..:page0:numanodeID1Should we add such functionality?
When you say "today no" does that mean we know all pages will be on the
same node, or that there may be pages from different nodes and we can't
display that? That'd not be great, IMHO.
I'm not a huge fan of returning multiple rows per buffer, with one row
per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The
rest of the fields is for the whole buffer, it'd be wrong to duplicate
that for each page.
I wonder if we should have a bitmap of nodes for the buffer (but then
what if there are multiple pages from the same node?), or maybe just an
array of nodes, with one element per page.
regards
--
Tomas Vondra
On Wed, Apr 2, 2025 at 5:27 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi Jakub,
Hi Bertrand,
OK, but I still fail to grasp why pg_indent doesnt fix this stuff on
it's own... I believe orginal ident, would fix this on it's own?My comment was not about indention but about the fact that I think that the
casting is not a the right place. I think that's the result of the multiplication
that we want to be casted (cast operator has higher precedence than Multiplication
operator).
Oh! I've missed that, but v21 got a rewrite (still not polished) just
to show it can be done without float points as Tomas requested.
[..]
Ok, but then does it make sense to see some num_size < shmem_size?
postgres=# select c.name, c.size as num_size, s.size as shmem_size
from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c, pg_shmem_allocations s
where c.name = s.name and s.size > c.size;
name | num_size | shmem_size
---------------+-----------+------------
XLOG Ctl | 4194304 | 4208200
Buffer Blocks | 134217728 | 134221824
AioHandleIOV | 2097152 | 2850816
This was a real bug, fixed in v21 , the ent->allocated_size was not
properly page aligned. Thanks for the attention to detail.
-J.
On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi Tomas,
OK, so you agree the commit messages are complete / correct?
Yes.
OK. FWIW if you disagree with some of my proposed changes, feel free to
push back. I'm sure some may be more a matter of personal preference.
No, it's all fine. I will probably have lots of questions about
setting proper env for development that cares itself about style, but
that's for another day.
[..floats..]
Hmmm, OK. Maybe it's correct. I still find the float arithmetic really
confusing and difficult to reason about ...I agree we don't want special cases for each possible combination of
page sizes (I'm not sure we even know all the combinations). What I was
thinking about is two branches, one for (block >= page) and another for
(block < page). AFAICK both values have to be 2^k, so this would
guarantee we have either (block/page) or (page/block) as integer.I wonder if you could even just calculate both, and have one loop that
deals with both.
[..]
When I say "integer arithmetic" I don't mean it should use 32-bit ints,
or any other data type. I mean that it works with non-floating point
values. It could be int64, Size or whatever is large enough to not
overflow. I really don't see how changing stuff to double makes this
easier to understand.
I hear you, attached v21 / 0003 is free of float/double arithmetics
and uses non-float point values. It should be more readable too with
those comments. I have not put it into its own function, because now
it fits the whole screen, so hopefully one can follow visually. Please
let me know if that code solves the doubts or feel free to reformat
it. That _numa_prepare_ptrs() is unused and will need to be removed,
but we can still move some code there if necessary.
12) You have also raised "why not pg_shm_allocations_numa" instead of
"pg_shm_numa_allocations"OPEN_QUESTION: To be honest, I'm not attached to any of those two (or
naming things in general), I can change if you want.Me neither. I wonder if there's some precedent when adding similar
variants for other catalogs ... can you check? I've been thinking about
pg_stats and pg_stats_ext, but maybe there's a better example?
Hm, it seems we always go with suffix "_somethingnew":
* pg_stat_database -> pg_stat_database_conflicts
* pg_stat_subscription -> pg_stat_subscription_stats
* even here: pg_buffercache -> pg_buffercache_numa
@Bertrand: do you have anything against pg_shm_allocations_numa
instead of pg_shm_numa_allocations? I don't mind changing it...
13) In the patch: "review: What if we get multiple pages per buffer
(the default). Could we get multiple nodes per buffer?"OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to
output multiple rows per single buffer (with "page_no") then we could
get this:
buffer1:..:page0:numanodeID1
buffer1:..:page1:numanodeID2
buffer2:..:page0:numanodeID1Should we add such functionality?
When you say "today no" does that mean we know all pages will be on the
same node, or that there may be pages from different nodes and we can't
display that? That'd not be great, IMHO.I'm not a huge fan of returning multiple rows per buffer, with one row
per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The
rest of the fields is for the whole buffer, it'd be wrong to duplicate
that for each page.
OPEN_QUESTION: With v21 we have all the information available, we are
just unable to display this in pg_buffercache_numa right now. We could
trim the view so that it has 3 columns (and user needs to JOIN to
pg_buffercache for more details like relationoid), but then what the
whole refactor (0002) was for if we would just return bufferId like
below:
buffer1:page0:numanodeID1
buffer1:page1:numanodeID2
buffer2:page0:numanodeID1
buffer2:page1:numanodeID1
There's also the problem that reading/joining could be inconsistent
and even slower.
I wonder if we should have a bitmap of nodes for the buffer (but then
what if there are multiple pages from the same node?), or maybe just an
array of nodes, with one element per page.
AFAIR this has been discussed back in end of January, and the
conclusion was more or less - on Discord - that everything sucks
(bitmaps, BIT() datatype, arrays,...) either from implementation or
user side, but apparently arrays [] would suck the least from
implementation side. So we could probably do something like up to
node_max_nodes():
buffer1:..:{0, 2, 0, 0}
buffer2:..:{0, 1, 0, 1} #edgecase: buffer across 2 NUMA nodes
buffer3:..:{0, 0, 0, 2}
Other idea is JSON or even simple string with numa_node_id<->count:
buffer1:..:"1=2"
buffer2:..:"1=1 3=1" #edgecase: buffer across 2 NUMA nodes
buffer3:..:"3=2"
I find all of those non-user friendly and I'm afraid I won't be able
to pull that alone in time...
-J.
Attachments:
v21-0001-Add-support-for-basic-NUMA-awareness.patchapplication/octet-stream; name=v21-0001-Add-support-for-basic-NUMA-awareness.patchDownload
From fe5cc5eaf57b1ddf81d8852da5442fef27960027 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 2 Apr 2025 12:29:22 +0200
Subject: [PATCH v21 1/4] Add support for basic NUMA awareness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.
A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with NUMA library.
The libnuma library is not available on 32-bit builds (there's no shared
object for i386), so we disable it in that case. The i386 is very memory
limited anyway, even with PAE, so NUMA is mostly irrelevant.
On Linux we use move_pages(2) syscall for speed instead of
get_mempolicy(2).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 2 +
configure | 187 ++++++++++++++++++++++++++++
configure.ac | 14 +++
doc/src/sgml/func.sgml | 13 ++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 6 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 40 ++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 120 ++++++++++++++++++
17 files changed, 442 insertions(+), 2 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
--with-liburing \
\
${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
diff --git a/configure b/configure
index 3d0e701c745..8308200dce7 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
XML2_CFLAGS
XML2_CONFIG
with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
@@ -872,6 +875,7 @@ with_liburing
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
LIBURING_LIBS
LIBCURL_CFLAGS
LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
XML2_CONFIG
XML2_CFLAGS
XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma for NUMA awareness
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -1629,6 +1636,10 @@ Some influential environment variables:
C compiler flags for LIBCURL, overriding pkg-config
LIBCURL_LIBS
linker flags for LIBCURL, overriding pkg-config
+ LIBNUMA_CFLAGS
+ C compiler flags for LIBNUMA, overriding pkg-config
+ LIBNUMA_LIBS
+ linker flags for LIBNUMA, overriding pkg-config
XML2_CONFIG path to xml2-config utility
XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
XML2_LIBS linker flags for XML2, overriding pkg-config
@@ -9063,6 +9074,182 @@ $as_echo "$as_me: WARNING: *** OAuth support tests require --with-python to run"
fi
+#
+# libnuma
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libnuma support" >&5
+$as_echo_n "checking whether to build with libnuma support... " >&6; }
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libnuma" >&5
+$as_echo "$with_libnuma" >&6; }
+
+
+if test "$with_libnuma" = yes ; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'libnuma' is required for NUMA awareness" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa" >&5
+$as_echo_n "checking for numa... " >&6; }
+
+if test -n "$LIBNUMA_CFLAGS"; then
+ pkg_cv_LIBNUMA_CFLAGS="$LIBNUMA_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_CFLAGS=`$PKG_CONFIG --cflags "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBNUMA_LIBS"; then
+ pkg_cv_LIBNUMA_LIBS="$LIBNUMA_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_LIBS=`$PKG_CONFIG --libs "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "numa" 2>&1`
+ else
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "numa" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBNUMA_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (numa) were not met:
+
+$LIBNUMA_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBNUMA_CFLAGS=$pkg_cv_LIBNUMA_CFLAGS
+ LIBNUMA_LIBS=$pkg_cv_LIBNUMA_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
#
# XML
#
diff --git a/configure.ac b/configure.ac
index 47a287926bc..ab4a0dc2be7 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,20 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+ PKG_CHECK_MODULES(LIBNUMA, numa)
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2488e9ba998..4bb60e9e080 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25143,6 +25143,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..5f0486bb335 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-liburing">
<term><option>--with-liburing</option></term>
<listitem>
@@ -2645,6 +2655,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index ba7916d1493..7cd524307b3 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: liburing
@@ -3240,6 +3261,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
liburing,
libxml,
lz4,
@@ -3896,6 +3918,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..8675e1b5d87 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('liburing', type : 'feature', value: 'auto',
description: 'io_uring support, for asynchronous I/O')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index cce29a37ac5..8b61d1ed492 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
@@ -223,6 +224,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBNUMA_CFLAGS = @LIBNUMA_CFLAGS@
+LIBNUMA_LIBS = @LIBNUMA_LIBS@
+
LIBURING_CFLAGS = @LIBURING_CFLAGS@
LIBURING_LIBS = @LIBURING_LIBS@
@@ -250,7 +254,7 @@ CPP = @CPP@
CPPFLAGS = @CPPFLAGS@
PG_SYSROOT = @PG_SYSROOT@
-override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBNUMA_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
ifdef PGXS
override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..ea8d796e7c4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6b57b7e18d9..e6730ac703c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8518,6 +8518,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 2ac61575883..b7144cbf32f 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -676,6 +676,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to build with io_uring support. (--with-liburing) */
#undef USE_LIBURING
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..314cff94dbc
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
@@ -232,6 +234,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index f11896440d5..4274949dfa4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
path.o \
pg_bitutils.o \
pg_localeconv_r.o \
+ pg_numa.o \
pg_popcount_aarch64.o \
pg_popcount_avx512.o \
pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index cf7f07644b9..3b26c68fda7 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_localeconv_r.c',
+ 'pg_numa.c',
'pg_popcount_aarch64.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..5e2523cf798
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+/*
+ * We use move_pages(2) syscall here - instead of get_mempolicy(2) - as the
+ * first one allows us to batch and query about many memory pages in one single
+ * giant system call that is way faster.
+ */
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+#else
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
+
+/* This should be used only after the server is started */
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+ Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
--
2.39.5
v21-0002-pg_buffercache-split-pg_buffercache_pages-into-p.patchapplication/octet-stream; name=v21-0002-pg_buffercache-split-pg_buffercache_pages-into-p.patchDownload
From 01a117778b98944ffc50a15777fa023188b4fd45 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v21 2/4] pg_buffercache: split pg_buffercache_pages into parts
Split pg_buffercache_pages() into multiple smaller functions, to allow
reuse in future patches. This introduces three new functions:
- pg_buffercache_init_entries
- pg_buffercache_build_tuple
- get_buffercache_tuple
that help adding entries into a tuplestore, describing the contents of
the buffercache.
This is a preparation for future patches extending pg_buffercache, e.g.
to add NUMA observabitily.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/pg_buffercache_pages.c | 293 +++++++++---------
1 file changed, 155 insertions(+), 138 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..ced4ec777a1 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -68,80 +68,171 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-Datum
-pg_buffercache_pages(PG_FUNCTION_ARGS)
+/*
+ * Helper routine for pg_buffercache_pages().
+ */
+static BufferCachePagesContext *
+pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
{
- FuncCallContext *funcctx;
- Datum result;
- MemoryContext oldcontext;
BufferCachePagesContext *fctx; /* User function context. */
+ MemoryContext oldcontext;
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
- HeapTuple tuple;
- if (SRF_IS_FIRSTCALL())
- {
- int i;
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
- funcctx = SRF_FIRSTCALL_INIT();
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. We unfortunately have to get the result type for that... - we
+ * can't use the result type determined by the function definition without
+ * potentially crashing when somebody uses the old (or even wrong)
+ * function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+ expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
+ INT2OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
+ BOOLOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
+ INT2OID, -1, 0);
+
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+ INT4OID, -1, 0);
- /* Switch context when allocating stuff to be used in later calls */
- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
- /* Create a user function context for cross-call persistence */
- fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCachePagesRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCachePagesRec) * NBuffers);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+ return fctx;
+}
+
+/*
+ * Helper routine for pg_buffercache_pages().
+ *
+ * Save buffer cache information for a single buffer.
+ */
+static void
+pg_buffercache_save_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+ BufferCachePagesRec *bufRecord = &(fctx->record[record_id]);
+
+ bufHdr = GetBufferDescriptor(record_id);
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+
+ bufRecord->bufferid = BufferDescriptorGetBuffer(bufHdr);
+ bufRecord->relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ bufRecord->reltablespace = bufHdr->tag.spcOid;
+ bufRecord->reldatabase = bufHdr->tag.dbOid;
+ bufRecord->forknum = BufTagGetForkNum(&bufHdr->tag);
+ bufRecord->blocknum = bufHdr->tag.blockNum;
+ bufRecord->usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ bufRecord->pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+ if (buf_state & BM_DIRTY)
+ bufRecord->isdirty = true;
+ else
+ bufRecord->isdirty = false;
+
+ /* Note if the buffer is valid, and has storage created */
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ bufRecord->isvalid = true;
+ else
+ bufRecord->isvalid = false;
+
+ UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
+ * Helper routine for pg_buffercache_pages().
+ *
+ * Format and return a tuple for a single buffer cache entry.
+ */
+static Datum
+get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+ HeapTuple tuple;
+ BufferCachePagesRec *bufRecord = &(fctx->record[record_id]);
+
+ values[0] = Int32GetDatum(bufRecord->bufferid);
+ memset(nulls, false, NUM_BUFFERCACHE_PAGES_ELEM);
+
+ /*
+ * Set all fields except the bufferid to null if the buffer is unused or
+ * not valid.
+ */
+ if (bufRecord->blocknum == InvalidBlockNumber || bufRecord->isvalid == false)
+ memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool));
+ else
+ {
+ values[1] = ObjectIdGetDatum(bufRecord->relfilenumber);
+ values[2] = ObjectIdGetDatum(bufRecord->reltablespace);
+ values[3] = ObjectIdGetDatum(bufRecord->reldatabase);
+ values[4] = ObjectIdGetDatum(bufRecord->forknum);
+ values[5] = Int64GetDatum((int64) bufRecord->blocknum);
+ values[6] = BoolGetDatum(bufRecord->isdirty);
+ values[7] = Int16GetDatum(bufRecord->usagecount);
/*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
+ * unused for v1.0 callers, but the array is always long enough
*/
- if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
- elog(ERROR, "return type must be a row type");
+ values[8] = Int32GetDatum(bufRecord->pinning_backends);
+ }
- if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
- expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
- elog(ERROR, "incorrect number of output arguments");
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
- /* Construct a tuple descriptor for the result rows. */
- tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
- TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
- INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
- INT2OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
- INT8OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
- BOOLOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
- INT2OID, -1, 0);
-
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);
-
- fctx->tupdesc = BlessTupleDesc(tupledesc);
-
- /* Allocate NBuffers worth of BufferCachePagesRec records. */
- fctx->record = (BufferCachePagesRec *)
- MemoryContextAllocHuge(CurrentMemoryContext,
- sizeof(BufferCachePagesRec) * NBuffers);
-
- /* Set max calls and remember the user function context. */
- funcctx->max_calls = NBuffers;
- funcctx->user_fctx = fctx;
-
- /* Return to original context when allocating transient memory */
- MemoryContextSwitchTo(oldcontext);
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
* Scan through all the buffers, saving the relevant fields in the
@@ -152,36 +243,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
* locks, so the information of each buffer is self-consistent.
*/
for (i = 0; i < NBuffers; i++)
- {
- BufferDesc *bufHdr;
- uint32 buf_state;
-
- bufHdr = GetBufferDescriptor(i);
- /* Lock each buffer header before inspecting. */
- buf_state = LockBufHdr(bufHdr);
-
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
- if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
- else
- fctx->record[i].isdirty = false;
-
- /* Note if the buffer is valid, and has storage created */
- if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
- else
- fctx->record[i].isvalid = false;
-
- UnlockBufHdr(bufHdr, buf_state);
- }
+ pg_buffercache_save_tuple(i, fctx);
}
funcctx = SRF_PERCALL_SETUP();
@@ -191,55 +253,10 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
if (funcctx->call_cntr < funcctx->max_calls)
{
+ Datum result;
uint32 i = funcctx->call_cntr;
- Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
- bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
-
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
- nulls[0] = false;
-
- /*
- * Set all fields except the bufferid to null if the buffer is unused
- * or not valid.
- */
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
- {
- nulls[1] = true;
- nulls[2] = true;
- nulls[3] = true;
- nulls[4] = true;
- nulls[5] = true;
- nulls[6] = true;
- nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
- nulls[8] = true;
- }
- else
- {
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
- nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
- nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
- nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
- nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
- nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
- nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
- nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
- nulls[8] = false;
- }
-
- /* Build and return the tuple. */
- tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
- result = HeapTupleGetDatum(tuple);
+ result = get_buffercache_tuple(i, fctx);
SRF_RETURN_NEXT(funcctx, result);
}
else
--
2.39.5
v21-0004-Add-new-pg_shmem_numa_allocations-view.patchapplication/octet-stream; name=v21-0004-Add-new-pg_shmem_numa_allocations-view.patchDownload
From 15b22f287ee40972022d1aaf47cca0b8772b6139 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v21 4/4] Add new pg_shmem_numa_allocations view
Introduce new pg_shmem_numa_alloctions view that allows viewing the shared memory split layout across
NUMA nodes.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 79 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 130 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 +++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 272 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index e9a59af8c34..c1d63ffc3b4 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -181,6 +181,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-numa-allocations"><structname>pg_shmem_numa_allocations</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -4040,6 +4045,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-numa-allocations">
+ <title><structname>pg_shmem_numa_allocations</structname></title>
+
+ <indexterm zone="view-pg-shmem-numa-allocations">
+ <primary>pg_shmem_numa_allocations</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_numa_allocations</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_numa_allocations</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_numa_allocations</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 64a7240aa77..eef7a7f9788 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..bddd1d156c9 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,130 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+ return (Datum) 0;
+ }
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ /* Get number of OS aliged pages */
+ shm_ent_page_count = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (which indicates unmapped/unallocated pages).
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+ /* Count number of NUMA nodes used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s >= 0 && s <= max_nodes)
+ nodes[s]++;
+ }
+
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory.
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index e6730ac703c..966ae7994f4 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8522,6 +8522,14 @@
proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,node_id,size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..fb882c5b771
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 5588d83e1bf..f66cf1bbfbd 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3150,6 +3150,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
has_table_privilege
@@ -3169,6 +3175,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index d9533deb04e..6c5da81a2b2 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1756,6 +1756,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ node_id,
+ size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, node_id, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..fddb21a260a
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index 286b1d03756..ca51dfd7702 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1922,12 +1922,14 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.39.5
v21-0003-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchapplication/octet-stream; name=v21-0003-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchDownload
From 22f4625d93f4d9cdc8f1994d86ea608ea1d274dc Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v21 3/4] Add pg_buffercache_numa view with NUMA node info
Introduces a new view pg_buffercache_numa, showing a NUMA memory node
for each individual buffer.
To determine the NUMA node for a buffer, we first need to touch the
memory pages using pg_numa_touch_mem_if_required, otherwise we might get
status -2 (ENOENT = The page is not present), indicating the page is
either unmapped or unallocated.
The size of a database block and OS memory page may differ. For example
the default block size (BLCKSZ) is 8KB, while the memory page is 4KB,
but it's also possible to make the block size smaller (e.g. 1KB).
XXX: Right now we just report NUMA node of the first page when dealing with
multiple pages per single buffer.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache_numa.out | 28 +++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 24 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 225 +++++++++++++++++-
.../sql/pg_buffercache_numa.sql | 20 ++
doc/src/sgml/pgbuffercache.sgml | 61 ++++-
9 files changed, 360 insertions(+), 8 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..8c1e891eab2
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,24 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, node_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index ced4ec777a1..3460cf579f7 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,12 +11,13 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -46,6 +47,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_node_id;
} BufferCachePagesRec;
@@ -64,12 +66,41 @@ typedef struct
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+/*
+ * Helper routine to map Buffers into addresses that is used by
+ * pg_numa_query_pages(). Please see it's comment for explanation why we need to
+ * prepare pointers like this.
+ *
+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ *
+ */
+#if 0
+static inline void
+pg_buffercache_numa_prepare_ptrs(int buffer_id, double pages_per_blk,
+ Size os_page_size,
+ void **os_page_ptrs)
+{
+
+ /* XXX: move it here? */
+}
+#endif
+
/*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * Allocates and returns new user function context based on SRF context
+ * (requires that functx to be initalized by SRF_FIRSTCALL_INIT()) and
+ * standard function call info.
*/
static BufferCachePagesContext *
pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
@@ -119,9 +150,12 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
INT2OID, -1, 0);
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "node_id",
+ INT4OID, -1, 0);
fctx->tupdesc = BlessTupleDesc(tupledesc);
@@ -140,7 +174,7 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
}
/*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
*
* Save buffer cache information for a single buffer.
*/
@@ -175,11 +209,13 @@ pg_buffercache_save_tuple(int record_id, BufferCachePagesContext *fctx)
else
bufRecord->isvalid = false;
+ bufRecord->numa_node_id = -1;
+
UnlockBufHdr(bufHdr, buf_state);
}
/*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
*
* Format and return a tuple for a single buffer cache entry.
*/
@@ -214,6 +250,7 @@ get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
* unused for v1.0 callers, but the array is always long enough
*/
values[8] = Int32GetDatum(bufRecord->pinning_backends);
+ values[9] = Int32GetDatum(bufRecord->numa_node_id);
}
/* Build and return the tuple. */
@@ -263,6 +300,184 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inquiry about memory mappings.
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_pages_status = NULL;
+ uint64 os_page_query_count = 0;
+ int pages_per_buffer = 0;
+ int buffers_per_page = 0;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS
+ * memory page size 2. Calculate how many OS pages are used by all
+ * buffer blocks 3. Calculate how many OS pages are contained within
+ * each database block.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+ buffers_per_page = os_page_size / BLCKSZ;
+ pages_per_buffer = BLCKSZ / os_page_size;
+
+ /*
+ * How many addresses we are going to query (store) depends on the
+ * relation between BLCKSZ : PAGESIZE.
+ */
+ if (buffers_per_page > 1)
+ os_page_query_count = NBuffers;
+ else
+ os_page_query_count = NBuffers * pages_per_buffer;
+
+ elog(DEBUG1, "NUMA: NBuffers=%d os_page_query_count=" UINT64_FORMAT " os_page_size=%zu buffers_per_page=%d pages_per_buffer=%d",
+ NBuffers, os_page_query_count, os_page_size, buffers_per_page, pages_per_buffer);
+
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_query_count);
+ os_pages_status = palloc(sizeof(uint64) * os_page_query_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ *
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_query_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ *
+ * This loop touches and stores addresses into os_page_ptrs[] as input
+ * to one big big move_pages(2) inquiry system call. Basically we ask
+ * for all memory pages for NBuffers.
+ */
+ for (i = 0; i < NBuffers; i++)
+ {
+ int j;
+ volatile uint64 touch pg_attribute_unused();
+
+ pg_buffercache_save_tuple(i, fctx);
+
+ /*
+ * BLCKSZ >= PAGESIZE: If Buffer occupies more than one OS page we
+ * query all OS pages for NUMA information. This wont run for
+ * BLCKSZ < PAGESIZE.
+ */
+ for (j = 0; j < pages_per_buffer; j++)
+ {
+ size_t idx = (size_t) (i * pages_per_buffer) + j;
+
+ /* NBuffers starts from 1 */
+ os_page_ptrs[idx] = (char *) BufferGetBlock(i + 1) + (os_page_size * j);
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[idx]);
+ }
+
+ /* otherwise BLCKSZ < PAGESIZE: one page hosts many Buffers */
+ if (buffers_per_page > 1)
+ {
+ /*
+ * Altough we could query just once per each OS page, we do it
+ * repeatably for each Buffer and hit the same address as
+ * move_pages(2) requires page aligment. This is also
+ * simplifies retrieval code later on.
+ */
+ os_page_ptrs[i] = (char *) TYPEALIGN(os_page_size,
+ (char *) BufferGetBlock(i + 1));
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[i]);
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, os_page_query_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ /*
+ * Once we have our NUMA information we resolve memory pointers back
+ * to Buffers
+ */
+ for (i = 0; i < NBuffers; i++)
+ {
+ size_t idx;
+
+ /*
+ * Note: We could check for errors in os_pages_status and report
+ * them. Again, a single DB block might span multiple NUMA nodes
+ * if it crosses OS pages on node boundaries, but we only record
+ * the node of the first page. This is a simplification but should
+ * be sufficient for most analyses.
+ */
+
+ if (buffers_per_page > 1)
+ idx = i;
+ else
+ {
+ /*
+ * XXX: BLCKSZ < PAGESIZE: return the node id for this Buffer
+ * based only on >> FIRST << OS page. We could do something
+ * else with this.
+ */
+ idx = i * pages_per_buffer;
+ }
+ fctx->record[i].numa_node_id = os_pages_status[idx];
+ }
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ Datum result;
+ uint32 i = funcctx->call_cntr;
+
+ result = get_buffercache_tuple(i, fctx);
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ {
+ firstNumaTouch = false;
+ SRF_RETURN_DONE(funcctx);
+ }
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..315227bf0ce 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,14 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides the same information
+ as <function>pg_buffercache_pages()</function> but is slower because it also
+ provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +210,55 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache-numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed are identical to the
+ <structname>pg_buffercache</structname> view, except that this one includes
+ one additional <structfield>node_id</structfield> column as defined in
+ <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Extra column</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>integer</type>
+ </para>
+ <para>
+ <acronym>NUMA</acronym> node ID. NULL if the shared buffer
+ has not been used yet. On systems without <acronym>NUMA</acronym> support
+ this returns 0.
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
--
2.39.5
Hi,
On Thu, Apr 03, 2025 at 09:01:43AM +0200, Jakub Wartak wrote:
On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi Tomas,
OK, so you agree the commit messages are complete / correct?
Yes.
Not 100% sure on my side.
=== v21-0002
Says:
"
This introduces three new functions:
- pg_buffercache_init_entries
- pg_buffercache_build_tuple
- get_buffercache_tuple
"
While pg_buffercache_build_tuple() is not added (pg_buffercache_save_tuple()
is).
About v21-0002:
=== 1
I can see that the pg_buffercache_init_entries() helper comments are added in
v21-0003 but I think that it would be better to add them in v21-0002
(where the helper is actually created).
About v21-0003:
=== 2
I hear you, attached v21 / 0003 is free of float/double arithmetics
and uses non-float point values.
+ if (buffers_per_page > 1)
+ os_page_query_count = NBuffers;
+ else
+ os_page_query_count = NBuffers * pages_per_buffer;
yeah, that's more elegant. I think that it properly handles the relationships
between buffer and page sizes without relying on float arithmetic.
=== 3
+ if (buffers_per_page > 1)
+ {
As buffers_per_page does not change, I think I'd put this check outside of the
for (i = 0; i < NBuffers; i++) loop, something like:
"
if (buffers_per_page > 1) {
/* BLCKSZ < PAGESIZE: one page hosts many Buffers */
for (i = 0; i < NBuffers; i++) {
.
.
.
.
} else {
/* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */
for (i = 0; i < NBuffers; i++) {
.
.
.
"
=== 4
That _numa_prepare_ptrs() is unused and will need to be removed,
but we can still move some code there if necessary.
Yeah I think that it can be simply removed then.
=== 5
@Bertrand: do you have anything against pg_shm_allocations_numa
instead of pg_shm_numa_allocations? I don't mind changing it...
I think that pg_shm_allocations_numa() is better (given the examples you just
shared).
=== 6
I find all of those non-user friendly and I'm afraid I won't be able
to pull that alone in time...
Maybe we could add a few words in the doc to mention the "multiple nodes per
buffer" case? And try to improve it for say 19? Also maybe we should just focus
till v21-0003 (and discard v21-0004 for 18).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On 4/3/25 09:01, Jakub Wartak wrote:
On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi Tomas,
OK, so you agree the commit messages are complete / correct?
Yes.
OK. FWIW if you disagree with some of my proposed changes, feel free to
push back. I'm sure some may be more a matter of personal preference.No, it's all fine. I will probably have lots of questions about
setting proper env for development that cares itself about style, but
that's for another day.[..floats..]
Hmmm, OK. Maybe it's correct. I still find the float arithmetic really
confusing and difficult to reason about ...I agree we don't want special cases for each possible combination of
page sizes (I'm not sure we even know all the combinations). What I was
thinking about is two branches, one for (block >= page) and another for
(block < page). AFAICK both values have to be 2^k, so this would
guarantee we have either (block/page) or (page/block) as integer.I wonder if you could even just calculate both, and have one loop that
deals with both.[..]
When I say "integer arithmetic" I don't mean it should use 32-bit ints,
or any other data type. I mean that it works with non-floating point
values. It could be int64, Size or whatever is large enough to not
overflow. I really don't see how changing stuff to double makes this
easier to understand.I hear you, attached v21 / 0003 is free of float/double arithmetics
and uses non-float point values. It should be more readable too with
those comments. I have not put it into its own function, because now
it fits the whole screen, so hopefully one can follow visually. Please
let me know if that code solves the doubts or feel free to reformat
it. That _numa_prepare_ptrs() is unused and will need to be removed,
but we can still move some code there if necessary.
IMHO the code in v21 is much easier to understand. It's not quite clear
to me why it's done outside pg_buffercache_numa_prepare_ptrs(), though.
12) You have also raised "why not pg_shm_allocations_numa" instead of
"pg_shm_numa_allocations"OPEN_QUESTION: To be honest, I'm not attached to any of those two (or
naming things in general), I can change if you want.Me neither. I wonder if there's some precedent when adding similar
variants for other catalogs ... can you check? I've been thinking about
pg_stats and pg_stats_ext, but maybe there's a better example?Hm, it seems we always go with suffix "_somethingnew":
* pg_stat_database -> pg_stat_database_conflicts
* pg_stat_subscription -> pg_stat_subscription_stats
* even here: pg_buffercache -> pg_buffercache_numa@Bertrand: do you have anything against pg_shm_allocations_numa
instead of pg_shm_numa_allocations? I don't mind changing it...
+1 to pg_shmem_allocations_numa
13) In the patch: "review: What if we get multiple pages per buffer
(the default). Could we get multiple nodes per buffer?"OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to
output multiple rows per single buffer (with "page_no") then we could
get this:
buffer1:..:page0:numanodeID1
buffer1:..:page1:numanodeID2
buffer2:..:page0:numanodeID1Should we add such functionality?
When you say "today no" does that mean we know all pages will be on the
same node, or that there may be pages from different nodes and we can't
display that? That'd not be great, IMHO.I'm not a huge fan of returning multiple rows per buffer, with one row
per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The
rest of the fields is for the whole buffer, it'd be wrong to duplicate
that for each page.OPEN_QUESTION: With v21 we have all the information available, we are
just unable to display this in pg_buffercache_numa right now. We could
trim the view so that it has 3 columns (and user needs to JOIN to
pg_buffercache for more details like relationoid), but then what the
whole refactor (0002) was for if we would just return bufferId like
below:buffer1:page0:numanodeID1
buffer1:page1:numanodeID2
buffer2:page0:numanodeID1
buffer2:page1:numanodeID1There's also the problem that reading/joining could be inconsistent
and even slower.
I think a view with just 3 columns would be a good solution. It's what
pg_shmem_allocations_numa already does, so it'd be consistent with that
part too.
I'm not too worried about the cost of the extra join - it's going to be
a couple dozen milliseconds at worst, I guess, and that's negligible in
the bigger scheme of things (e.g. compared to how long the move_pages is
expected to take). Also, it's not like having everything in the same
view is free - people would have to do some sort of post-processing, and
that has a cost too.
So unless someone can demonstrate a use case where this would matter,
I'd not worry about it too much.
I wonder if we should have a bitmap of nodes for the buffer (but then
what if there are multiple pages from the same node?), or maybe just an
array of nodes, with one element per page.AFAIR this has been discussed back in end of January, and the
conclusion was more or less - on Discord - that everything sucks
(bitmaps, BIT() datatype, arrays,...) either from implementation or
user side, but apparently arrays [] would suck the least from
implementation side. So we could probably do something like up to
node_max_nodes():
buffer1:..:{0, 2, 0, 0}
buffer2:..:{0, 1, 0, 1} #edgecase: buffer across 2 NUMA nodes
buffer3:..:{0, 0, 0, 2}Other idea is JSON or even simple string with numa_node_id<->count:
buffer1:..:"1=2"
buffer2:..:"1=1 3=1" #edgecase: buffer across 2 NUMA nodes
buffer3:..:"3=2"I find all of those non-user friendly and I'm afraid I won't be able
to pull that alone in time...
I'm -1 on JSON, I don't see how would that solve anything better than
e.g. a regular array, and it's going to be harder to work with. So if we
don't want to go with the 3-column view proposed earlier, I'd stick to a
simple array. I don't think there's a huge difference between those two
approaches, it should be easy to convert between those approaches using
unnest() and array_agg().
Attached is v22, with some minor review comments:
1) I suggested we should just use "libnuma support" in configure,
instead of talking about "NUMA awareness support", and AFAICS you
agreed. But I still see the old text in configure ... is that
intentional or a bit of forgotten text?
2) I added a couple asserts to pg_buffercache_numa_pages() and comments,
and simplified a couple lines (but that's a matter of preference).
3) I don't think it's correct for pg_get_shmem_numa_allocations to just
silently ignore nodes outside the valid range. I suggest we simply do
elog(ERROR), as it's an internal error we don't expect to happen.
regards
--
Tomas Vondra
Attachments:
v22-0001-Add-support-for-basic-NUMA-awareness.patchtext/x-patch; charset=UTF-8; name=v22-0001-Add-support-for-basic-NUMA-awareness.patchDownload
From 22b6296ee914f8445be5eebf53b994196064d0d3 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 2 Apr 2025 12:29:22 +0200
Subject: [PATCH v22 1/7] Add support for basic NUMA awareness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.
A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with NUMA library.
The libnuma library is not available on 32-bit builds (there's no shared
object for i386), so we disable it in that case. The i386 is very memory
limited anyway, even with PAE, so NUMA is mostly irrelevant.
On Linux we use move_pages(2) syscall for speed instead of
get_mempolicy(2).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 2 +
configure | 187 ++++++++++++++++++++++++++++
configure.ac | 14 +++
doc/src/sgml/func.sgml | 13 ++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 6 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 40 ++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 120 ++++++++++++++++++
17 files changed, 442 insertions(+), 2 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
--with-liburing \
\
${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
diff --git a/configure b/configure
index 3c19e7e60ec..bea359812d4 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
XML2_CFLAGS
XML2_CONFIG
with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
@@ -872,6 +875,7 @@ with_liburing
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
LIBURING_LIBS
LIBCURL_CFLAGS
LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
XML2_CONFIG
XML2_CFLAGS
XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma for NUMA awareness
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -1629,6 +1636,10 @@ Some influential environment variables:
C compiler flags for LIBCURL, overriding pkg-config
LIBCURL_LIBS
linker flags for LIBCURL, overriding pkg-config
+ LIBNUMA_CFLAGS
+ C compiler flags for LIBNUMA, overriding pkg-config
+ LIBNUMA_LIBS
+ linker flags for LIBNUMA, overriding pkg-config
XML2_CONFIG path to xml2-config utility
XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
XML2_LIBS linker flags for XML2, overriding pkg-config
@@ -9063,6 +9074,182 @@ $as_echo "$as_me: WARNING: *** OAuth support tests require --with-python to run"
fi
+#
+# libnuma
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libnuma support" >&5
+$as_echo_n "checking whether to build with libnuma support... " >&6; }
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libnuma" >&5
+$as_echo "$with_libnuma" >&6; }
+
+
+if test "$with_libnuma" = yes ; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'libnuma' is required for NUMA awareness" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa" >&5
+$as_echo_n "checking for numa... " >&6; }
+
+if test -n "$LIBNUMA_CFLAGS"; then
+ pkg_cv_LIBNUMA_CFLAGS="$LIBNUMA_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_CFLAGS=`$PKG_CONFIG --cflags "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBNUMA_LIBS"; then
+ pkg_cv_LIBNUMA_LIBS="$LIBNUMA_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_LIBS=`$PKG_CONFIG --libs "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "numa" 2>&1`
+ else
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "numa" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBNUMA_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (numa) were not met:
+
+$LIBNUMA_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBNUMA_CFLAGS=$pkg_cv_LIBNUMA_CFLAGS
+ LIBNUMA_LIBS=$pkg_cv_LIBNUMA_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
#
# XML
#
diff --git a/configure.ac b/configure.ac
index 65db0673f8a..fc8dfa87567 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,20 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma for NUMA awareness],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+ PKG_CHECK_MODULES(LIBNUMA, numa)
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2488e9ba998..4bb60e9e080 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25143,6 +25143,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..5f0486bb335 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-liburing">
<term><option>--with-liburing</option></term>
<listitem>
@@ -2645,6 +2655,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the libnuma library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index e8b872d29ad..e3f3ab0f335 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: liburing
@@ -3242,6 +3263,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
liburing,
libxml,
lz4,
@@ -3898,6 +3920,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..8675e1b5d87 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA awareness support')
+
option('liburing', type : 'feature', value: 'auto',
description: 'io_uring support, for asynchronous I/O')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 265fd1b2cfe..92bd85cbed2 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
@@ -223,6 +224,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBNUMA_CFLAGS = @LIBNUMA_CFLAGS@
+LIBNUMA_LIBS = @LIBNUMA_LIBS@
+
LIBURING_CFLAGS = @LIBURING_CFLAGS@
LIBURING_LIBS = @LIBURING_LIBS@
@@ -250,7 +254,7 @@ CPP = @CPP@
CPPFLAGS = @CPPFLAGS@
PG_SYSROOT = @PG_SYSROOT@
-override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBNUMA_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
ifdef PGXS
override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..ea8d796e7c4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a28a15993a2..63859661951 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8518,6 +8518,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 2ac61575883..b7144cbf32f 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -676,6 +676,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to build with io_uring support. (--with-liburing) */
#undef USE_LIBURING
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..314cff94dbc
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
@@ -232,6 +234,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index f11896440d5..4274949dfa4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
path.o \
pg_bitutils.o \
pg_localeconv_r.o \
+ pg_numa.o \
pg_popcount_aarch64.o \
pg_popcount_avx512.o \
pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 51041e75609..228888b2f66 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_localeconv_r.c',
+ 'pg_numa.c',
'pg_popcount_aarch64.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..5e2523cf798
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+/*
+ * We use move_pages(2) syscall here - instead of get_mempolicy(2) - as the
+ * first one allows us to batch and query about many memory pages in one single
+ * giant system call that is way faster.
+ */
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+#else
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
+
+/* This should be used only after the server is started */
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+ Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
--
2.49.0
v22-0002-review.patchtext/x-patch; charset=UTF-8; name=v22-0002-review.patchDownload
From 818d26ab70af3f7d1a653fae64c02b3d140ae880 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 3 Apr 2025 11:58:34 +0200
Subject: [PATCH v22 2/7] review
---
configure | 2 +-
doc/src/sgml/installation.sgml | 2 +-
src/include/pg_config.h.in | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/configure b/configure
index bea359812d4..969de1bbeb2 100755
--- a/configure
+++ b/configure
@@ -1594,7 +1594,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
- --with-libnuma build with libnuma for NUMA awareness
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 5f0486bb335..ca6fbab065a 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1161,7 +1161,7 @@ build-postgresql:
<listitem>
<para>
Build with libnuma support for basic NUMA support.
- Only supported on platforms for which the libnuma library is implemented.
+ Only supported on platforms for which the <productname>libnuma</productname> library is implemented.
</para>
</listitem>
</varlistentry>
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index b7144cbf32f..a31289cd1da 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -676,7 +676,7 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
-/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+/* Define to 1 to build with NUMA support. (--with-libnuma) */
#undef USE_LIBNUMA
/* Define to build with io_uring support. (--with-liburing) */
--
2.49.0
v22-0003-pg_buffercache-split-pg_buffercache_pages-into-p.patchtext/x-patch; charset=UTF-8; name=v22-0003-pg_buffercache-split-pg_buffercache_pages-into-p.patchDownload
From 89461994d2fb48dabfe38f3690e33f22e760e3ed Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v22 3/7] pg_buffercache: split pg_buffercache_pages into parts
Split pg_buffercache_pages() into multiple smaller functions, to allow
reuse in future patches. This introduces three new functions:
- pg_buffercache_init_entries
- pg_buffercache_build_tuple
- get_buffercache_tuple
that help adding entries into a tuplestore, describing the contents of
the buffercache.
This is a preparation for future patches extending pg_buffercache, e.g.
to add NUMA observabitily.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/pg_buffercache_pages.c | 293 +++++++++---------
1 file changed, 155 insertions(+), 138 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..ced4ec777a1 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -68,80 +68,171 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-Datum
-pg_buffercache_pages(PG_FUNCTION_ARGS)
+/*
+ * Helper routine for pg_buffercache_pages().
+ */
+static BufferCachePagesContext *
+pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
{
- FuncCallContext *funcctx;
- Datum result;
- MemoryContext oldcontext;
BufferCachePagesContext *fctx; /* User function context. */
+ MemoryContext oldcontext;
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
- HeapTuple tuple;
- if (SRF_IS_FIRSTCALL())
- {
- int i;
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
- funcctx = SRF_FIRSTCALL_INIT();
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. We unfortunately have to get the result type for that... - we
+ * can't use the result type determined by the function definition without
+ * potentially crashing when somebody uses the old (or even wrong)
+ * function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+ expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
+ INT2OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
+ BOOLOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
+ INT2OID, -1, 0);
+
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+ INT4OID, -1, 0);
- /* Switch context when allocating stuff to be used in later calls */
- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
- /* Create a user function context for cross-call persistence */
- fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCachePagesRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCachePagesRec) * NBuffers);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+ return fctx;
+}
+
+/*
+ * Helper routine for pg_buffercache_pages().
+ *
+ * Save buffer cache information for a single buffer.
+ */
+static void
+pg_buffercache_save_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+ BufferCachePagesRec *bufRecord = &(fctx->record[record_id]);
+
+ bufHdr = GetBufferDescriptor(record_id);
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+
+ bufRecord->bufferid = BufferDescriptorGetBuffer(bufHdr);
+ bufRecord->relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ bufRecord->reltablespace = bufHdr->tag.spcOid;
+ bufRecord->reldatabase = bufHdr->tag.dbOid;
+ bufRecord->forknum = BufTagGetForkNum(&bufHdr->tag);
+ bufRecord->blocknum = bufHdr->tag.blockNum;
+ bufRecord->usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ bufRecord->pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+ if (buf_state & BM_DIRTY)
+ bufRecord->isdirty = true;
+ else
+ bufRecord->isdirty = false;
+
+ /* Note if the buffer is valid, and has storage created */
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ bufRecord->isvalid = true;
+ else
+ bufRecord->isvalid = false;
+
+ UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
+ * Helper routine for pg_buffercache_pages().
+ *
+ * Format and return a tuple for a single buffer cache entry.
+ */
+static Datum
+get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+ HeapTuple tuple;
+ BufferCachePagesRec *bufRecord = &(fctx->record[record_id]);
+
+ values[0] = Int32GetDatum(bufRecord->bufferid);
+ memset(nulls, false, NUM_BUFFERCACHE_PAGES_ELEM);
+
+ /*
+ * Set all fields except the bufferid to null if the buffer is unused or
+ * not valid.
+ */
+ if (bufRecord->blocknum == InvalidBlockNumber || bufRecord->isvalid == false)
+ memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool));
+ else
+ {
+ values[1] = ObjectIdGetDatum(bufRecord->relfilenumber);
+ values[2] = ObjectIdGetDatum(bufRecord->reltablespace);
+ values[3] = ObjectIdGetDatum(bufRecord->reldatabase);
+ values[4] = ObjectIdGetDatum(bufRecord->forknum);
+ values[5] = Int64GetDatum((int64) bufRecord->blocknum);
+ values[6] = BoolGetDatum(bufRecord->isdirty);
+ values[7] = Int16GetDatum(bufRecord->usagecount);
/*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
+ * unused for v1.0 callers, but the array is always long enough
*/
- if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
- elog(ERROR, "return type must be a row type");
+ values[8] = Int32GetDatum(bufRecord->pinning_backends);
+ }
- if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
- expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
- elog(ERROR, "incorrect number of output arguments");
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
- /* Construct a tuple descriptor for the result rows. */
- tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
- TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
- INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
- INT2OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
- INT8OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
- BOOLOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
- INT2OID, -1, 0);
-
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);
-
- fctx->tupdesc = BlessTupleDesc(tupledesc);
-
- /* Allocate NBuffers worth of BufferCachePagesRec records. */
- fctx->record = (BufferCachePagesRec *)
- MemoryContextAllocHuge(CurrentMemoryContext,
- sizeof(BufferCachePagesRec) * NBuffers);
-
- /* Set max calls and remember the user function context. */
- funcctx->max_calls = NBuffers;
- funcctx->user_fctx = fctx;
-
- /* Return to original context when allocating transient memory */
- MemoryContextSwitchTo(oldcontext);
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
* Scan through all the buffers, saving the relevant fields in the
@@ -152,36 +243,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
* locks, so the information of each buffer is self-consistent.
*/
for (i = 0; i < NBuffers; i++)
- {
- BufferDesc *bufHdr;
- uint32 buf_state;
-
- bufHdr = GetBufferDescriptor(i);
- /* Lock each buffer header before inspecting. */
- buf_state = LockBufHdr(bufHdr);
-
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
- if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
- else
- fctx->record[i].isdirty = false;
-
- /* Note if the buffer is valid, and has storage created */
- if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
- else
- fctx->record[i].isvalid = false;
-
- UnlockBufHdr(bufHdr, buf_state);
- }
+ pg_buffercache_save_tuple(i, fctx);
}
funcctx = SRF_PERCALL_SETUP();
@@ -191,55 +253,10 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
if (funcctx->call_cntr < funcctx->max_calls)
{
+ Datum result;
uint32 i = funcctx->call_cntr;
- Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
- bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
-
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
- nulls[0] = false;
-
- /*
- * Set all fields except the bufferid to null if the buffer is unused
- * or not valid.
- */
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
- {
- nulls[1] = true;
- nulls[2] = true;
- nulls[3] = true;
- nulls[4] = true;
- nulls[5] = true;
- nulls[6] = true;
- nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
- nulls[8] = true;
- }
- else
- {
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
- nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
- nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
- nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
- nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
- nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
- nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
- nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
- nulls[8] = false;
- }
-
- /* Build and return the tuple. */
- tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
- result = HeapTupleGetDatum(tuple);
+ result = get_buffercache_tuple(i, fctx);
SRF_RETURN_NEXT(funcctx, result);
}
else
--
2.49.0
v22-0004-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchtext/x-patch; charset=UTF-8; name=v22-0004-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchDownload
From 76b537112dc94c3077a3058b0ff8361cdda1ec71 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v22 4/7] Add pg_buffercache_numa view with NUMA node info
Introduces a new view pg_buffercache_numa, showing a NUMA memory node
for each individual buffer.
To determine the NUMA node for a buffer, we first need to touch the
memory pages using pg_numa_touch_mem_if_required, otherwise we might get
status -2 (ENOENT = The page is not present), indicating the page is
either unmapped or unallocated.
The size of a database block and OS memory page may differ. For example
the default block size (BLCKSZ) is 8KB, while the memory page is 4KB,
but it's also possible to make the block size smaller (e.g. 1KB).
XXX: Right now we just report NUMA node of the first page when dealing with
multiple pages per single buffer.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache_numa.out | 28 +++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 24 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 225 +++++++++++++++++-
.../sql/pg_buffercache_numa.sql | 20 ++
doc/src/sgml/pgbuffercache.sgml | 61 ++++-
9 files changed, 360 insertions(+), 8 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..8c1e891eab2
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,24 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, node_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index ced4ec777a1..3460cf579f7 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,12 +11,13 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -46,6 +47,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_node_id;
} BufferCachePagesRec;
@@ -64,12 +66,41 @@ typedef struct
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+/*
+ * Helper routine to map Buffers into addresses that is used by
+ * pg_numa_query_pages(). Please see it's comment for explanation why we need to
+ * prepare pointers like this.
+ *
+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ *
+ */
+#if 0
+static inline void
+pg_buffercache_numa_prepare_ptrs(int buffer_id, double pages_per_blk,
+ Size os_page_size,
+ void **os_page_ptrs)
+{
+
+ /* XXX: move it here? */
+}
+#endif
+
/*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * Allocates and returns new user function context based on SRF context
+ * (requires that functx to be initalized by SRF_FIRSTCALL_INIT()) and
+ * standard function call info.
*/
static BufferCachePagesContext *
pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
@@ -119,9 +150,12 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
INT2OID, -1, 0);
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "node_id",
+ INT4OID, -1, 0);
fctx->tupdesc = BlessTupleDesc(tupledesc);
@@ -140,7 +174,7 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
}
/*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
*
* Save buffer cache information for a single buffer.
*/
@@ -175,11 +209,13 @@ pg_buffercache_save_tuple(int record_id, BufferCachePagesContext *fctx)
else
bufRecord->isvalid = false;
+ bufRecord->numa_node_id = -1;
+
UnlockBufHdr(bufHdr, buf_state);
}
/*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
*
* Format and return a tuple for a single buffer cache entry.
*/
@@ -214,6 +250,7 @@ get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
* unused for v1.0 callers, but the array is always long enough
*/
values[8] = Int32GetDatum(bufRecord->pinning_backends);
+ values[9] = Int32GetDatum(bufRecord->numa_node_id);
}
/* Build and return the tuple. */
@@ -263,6 +300,184 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inquiry about memory mappings.
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_pages_status = NULL;
+ uint64 os_page_query_count = 0;
+ int pages_per_buffer = 0;
+ int buffers_per_page = 0;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS
+ * memory page size 2. Calculate how many OS pages are used by all
+ * buffer blocks 3. Calculate how many OS pages are contained within
+ * each database block.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+ buffers_per_page = os_page_size / BLCKSZ;
+ pages_per_buffer = BLCKSZ / os_page_size;
+
+ /*
+ * How many addresses we are going to query (store) depends on the
+ * relation between BLCKSZ : PAGESIZE.
+ */
+ if (buffers_per_page > 1)
+ os_page_query_count = NBuffers;
+ else
+ os_page_query_count = NBuffers * pages_per_buffer;
+
+ elog(DEBUG1, "NUMA: NBuffers=%d os_page_query_count=" UINT64_FORMAT " os_page_size=%zu buffers_per_page=%d pages_per_buffer=%d",
+ NBuffers, os_page_query_count, os_page_size, buffers_per_page, pages_per_buffer);
+
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_query_count);
+ os_pages_status = palloc(sizeof(uint64) * os_page_query_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ *
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_query_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ *
+ * This loop touches and stores addresses into os_page_ptrs[] as input
+ * to one big big move_pages(2) inquiry system call. Basically we ask
+ * for all memory pages for NBuffers.
+ */
+ for (i = 0; i < NBuffers; i++)
+ {
+ int j;
+ volatile uint64 touch pg_attribute_unused();
+
+ pg_buffercache_save_tuple(i, fctx);
+
+ /*
+ * BLCKSZ >= PAGESIZE: If Buffer occupies more than one OS page we
+ * query all OS pages for NUMA information. This wont run for
+ * BLCKSZ < PAGESIZE.
+ */
+ for (j = 0; j < pages_per_buffer; j++)
+ {
+ size_t idx = (size_t) (i * pages_per_buffer) + j;
+
+ /* NBuffers starts from 1 */
+ os_page_ptrs[idx] = (char *) BufferGetBlock(i + 1) + (os_page_size * j);
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[idx]);
+ }
+
+ /* otherwise BLCKSZ < PAGESIZE: one page hosts many Buffers */
+ if (buffers_per_page > 1)
+ {
+ /*
+ * Altough we could query just once per each OS page, we do it
+ * repeatably for each Buffer and hit the same address as
+ * move_pages(2) requires page aligment. This is also
+ * simplifies retrieval code later on.
+ */
+ os_page_ptrs[i] = (char *) TYPEALIGN(os_page_size,
+ (char *) BufferGetBlock(i + 1));
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[i]);
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, os_page_query_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ /*
+ * Once we have our NUMA information we resolve memory pointers back
+ * to Buffers
+ */
+ for (i = 0; i < NBuffers; i++)
+ {
+ size_t idx;
+
+ /*
+ * Note: We could check for errors in os_pages_status and report
+ * them. Again, a single DB block might span multiple NUMA nodes
+ * if it crosses OS pages on node boundaries, but we only record
+ * the node of the first page. This is a simplification but should
+ * be sufficient for most analyses.
+ */
+
+ if (buffers_per_page > 1)
+ idx = i;
+ else
+ {
+ /*
+ * XXX: BLCKSZ < PAGESIZE: return the node id for this Buffer
+ * based only on >> FIRST << OS page. We could do something
+ * else with this.
+ */
+ idx = i * pages_per_buffer;
+ }
+ fctx->record[i].numa_node_id = os_pages_status[idx];
+ }
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ Datum result;
+ uint32 i = funcctx->call_cntr;
+
+ result = get_buffercache_tuple(i, fctx);
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ {
+ firstNumaTouch = false;
+ SRF_RETURN_DONE(funcctx);
+ }
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..315227bf0ce 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,14 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides the same information
+ as <function>pg_buffercache_pages()</function> but is slower because it also
+ provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +210,55 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache-numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed are identical to the
+ <structname>pg_buffercache</structname> view, except that this one includes
+ one additional <structfield>node_id</structfield> column as defined in
+ <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Extra column</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>integer</type>
+ </para>
+ <para>
+ <acronym>NUMA</acronym> node ID. NULL if the shared buffer
+ has not been used yet. On systems without <acronym>NUMA</acronym> support
+ this returns 0.
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
--
2.49.0
v22-0005-review.patchtext/x-patch; charset=UTF-8; name=v22-0005-review.patchDownload
From 2df0a06b206dedfebd6f4ef7f00eed15edbdee53 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 3 Apr 2025 12:43:21 +0200
Subject: [PATCH v22 5/7] review
---
contrib/pg_buffercache/pg_buffercache_pages.c | 31 +++++++++++++------
1 file changed, 22 insertions(+), 9 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3460cf579f7..dc200204478 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -90,7 +90,7 @@ pg_buffercache_numa_prepare_ptrs(int buffer_id, double pages_per_blk,
Size os_page_size,
void **os_page_ptrs)
{
-
+ /* ??? */
/* XXX: move it here? */
}
#endif
@@ -343,14 +343,28 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
buffers_per_page = os_page_size / BLCKSZ;
pages_per_buffer = BLCKSZ / os_page_size;
+ /*
+ * The pages and block size is expected to be 2^k, so one divides the
+ * other (we don't know in which direction).
+ */
+ Assert((os_page_size % BLCKSZ == 0) || (BLCKSZ % os_page_size == 0));
+
+ /*
+ * Either both counts are 1 (when the pages have the same size), or
+ * exacly one of them is zero. Both can't be zero at the same time.
+ */
+ Assert((buffers_per_page > 0) || (pages_per_buffer > 0));
+ Assert(((buffers_per_page == 1) && (pages_per_buffer == 1)) ||
+ ((buffers_per_page == 0) || (pages_per_buffer == 0)));
+
/*
* How many addresses we are going to query (store) depends on the
- * relation between BLCKSZ : PAGESIZE.
+ * relation between BLCKSZ : PAGESIZE. We need at least one status
+ * per buffer - if the memory page is larger than buffer, we still
+ * query it for each buffer. With multiple memory pages per buffer,
+ * we need that many entries.
*/
- if (buffers_per_page > 1)
- os_page_query_count = NBuffers;
- else
- os_page_query_count = NBuffers * pages_per_buffer;
+ os_page_query_count = Max(NBuffers, NBuffers * pages_per_buffer);
elog(DEBUG1, "NUMA: NBuffers=%d os_page_query_count=" UINT64_FORMAT " os_page_size=%zu buffers_per_page=%d pages_per_buffer=%d",
NBuffers, os_page_query_count, os_page_size, buffers_per_page, pages_per_buffer);
@@ -361,7 +375,6 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
/*
* If we ever get 0xff back from kernel inquiry, then we probably have
* bug in our buffers to OS page mapping code here.
- *
*/
memset(os_pages_status, 0xff, sizeof(int) * os_page_query_count);
@@ -410,8 +423,8 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
/*
* Altough we could query just once per each OS page, we do it
* repeatably for each Buffer and hit the same address as
- * move_pages(2) requires page aligment. This is also
- * simplifies retrieval code later on.
+ * move_pages(2) requires page aligment. This also simplifies
+ * retrieval code later on.
*/
os_page_ptrs[i] = (char *) TYPEALIGN(os_page_size,
(char *) BufferGetBlock(i + 1));
--
2.49.0
v22-0006-Add-new-pg_shmem_numa_allocations-view.patchtext/x-patch; charset=UTF-8; name=v22-0006-Add-new-pg_shmem_numa_allocations-view.patchDownload
From 58e17af7c48fd6eeafcff9523ecdacbd53e90ede Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v22 6/7] Add new pg_shmem_numa_allocations view
Introduce new pg_shmem_numa_alloctions view that allows viewing the shared memory split layout across
NUMA nodes.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 79 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 130 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 +++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 272 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 4f336ee0adf..6bb5c8a5669 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -181,6 +181,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-numa-allocations"><structname>pg_shmem_numa_allocations</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -4051,6 +4056,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-numa-allocations">
+ <title><structname>pg_shmem_numa_allocations</structname></title>
+
+ <indexterm zone="view-pg-shmem-numa-allocations">
+ <primary>pg_shmem_numa_allocations</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_numa_allocations</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_numa_allocations</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_numa_allocations</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 273008db37f..52ab03a37be 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_numa_allocations AS
+ SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index e453f856794..36d89a58783 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -90,6 +91,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -570,3 +573,130 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+ return (Datum) 0;
+ }
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ /* Get number of OS aliged pages */
+ shm_ent_page_count = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (which indicates unmapped/unallocated pages).
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+ /* Count number of NUMA nodes used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s >= 0 && s <= max_nodes)
+ nodes[s]++;
+ }
+
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory.
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 63859661951..72efe8df667 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8522,6 +8522,14 @@
proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,node_id,size}',
+ prosrc => 'pg_get_shmem_numa_allocations' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..fb882c5b771
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 5588d83e1bf..f66cf1bbfbd 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3150,6 +3150,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
has_table_privilege
@@ -3169,6 +3175,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 673c63b8d1b..8b5862cb11a 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1757,6 +1757,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+ node_id,
+ size
+ FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, node_id, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..fddb21a260a
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index 286b1d03756..ca51dfd7702 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1922,12 +1922,14 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.49.0
v22-0007-review.patchtext/x-patch; charset=UTF-8; name=v22-0007-review.patchDownload
From 13d7cd9087f64b0393f51b525d5267bbe57ce837 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 3 Apr 2025 12:50:18 +0200
Subject: [PATCH v22 7/7] review
---
src/backend/storage/ipc/shmem.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 36d89a58783..f711e7411db 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -673,7 +673,11 @@ pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
{
int s = pages_status[i];
- /* Ensure we are adding only valid index to the array */
+ /* Ensure we are adding only valid index to the array
+ *
+ * XXX I think we should just error-out if this is untrue, so that
+ * we don't silently hide issues.
+ */
if (s >= 0 && s <= max_nodes)
nodes[s]++;
}
--
2.49.0
On 4/3/25 10:23, Bertrand Drouvot wrote:
Hi,
On Thu, Apr 03, 2025 at 09:01:43AM +0200, Jakub Wartak wrote:
On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi Tomas,
OK, so you agree the commit messages are complete / correct?
Yes.
Not 100% sure on my side.
=== v21-0002
Says:
"
This introduces three new functions:- pg_buffercache_init_entries
- pg_buffercache_build_tuple
- get_buffercache_tuple
"While pg_buffercache_build_tuple() is not added (pg_buffercache_save_tuple()
is).
Ah, OK. Jakub, can you correct (and double-check) this in the next
version of the patch?
About v21-0002:
=== 1
I can see that the pg_buffercache_init_entries() helper comments are added in
v21-0003 but I think that it would be better to add them in v21-0002
(where the helper is actually created).
+1 to that
About v21-0003:
=== 2
I hear you, attached v21 / 0003 is free of float/double arithmetics
and uses non-float point values.+ if (buffers_per_page > 1) + os_page_query_count = NBuffers; + else + os_page_query_count = NBuffers * pages_per_buffer;yeah, that's more elegant. I think that it properly handles the relationships
between buffer and page sizes without relying on float arithmetic.
In the review I just submitted, I changed this to
os_page_query_count = Max(NBuffers, NBuffers * pages_per_buffer);
but maybe it's less clear. Feel free to undo my change.
=== 3
+ if (buffers_per_page > 1) + {As buffers_per_page does not change, I think I'd put this check outside of the
for (i = 0; i < NBuffers; i++) loop, something like:"
if (buffers_per_page > 1) {
/* BLCKSZ < PAGESIZE: one page hosts many Buffers */
for (i = 0; i < NBuffers; i++) {
.
.
.
.
} else {
/* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */
for (i = 0; i < NBuffers; i++) {
.
.
I don't know. It's a matter of opinion, but I find the current code
fairly understandable. Maybe if it meaningfully reduced the code
nesting, but even with the extra branch we'd still need the for loop.
I'm not against doing this differently, but I'd have to see how that
looks. Until then I think it's fine to have the code as is.
.
"=== 4
That _numa_prepare_ptrs() is unused and will need to be removed,
but we can still move some code there if necessary.Yeah I think that it can be simply removed then.
I'm not particularly attached on having the _ptrs() function, but why
couldn't it build the os_page_ptrs array as before?
=== 5
@Bertrand: do you have anything against pg_shm_allocations_numa
instead of pg_shm_numa_allocations? I don't mind changing it...I think that pg_shm_allocations_numa() is better (given the examples you just
shared).=== 6
I find all of those non-user friendly and I'm afraid I won't be able
to pull that alone in time...Maybe we could add a few words in the doc to mention the "multiple nodes per
buffer" case? And try to improve it for say 19? Also maybe we should just focus
till v21-0003 (and discard v21-0004 for 18).
IMHO it's not enough to paper this over by mentioning it in the docs.
I'm a bit confused about which patch you suggest to leave out. I think
0004 is pg_shmem_allocations_numa, but I think that part is fine, no?
We've been discussing how to represent nodes for buffers in 0003.
I understand it'd require a bit of coding, but AFAICS adding an array
would be fairly trivial amount of code. Something like
values[i++]
= PointerGetDatum(construct_array_builtin(nodes, nodecnt, INT4OID));
would do, I think. But if we decide to do the 3-column view, that'd be
even simpler, I think. AFAICS it means we could mostly ditch the
pg_buffercache refactoring in patch 0002, and 0003 would get much
simpler too I think.
If Jakub doesn't have time to work on this, I can take a stab at it
later today.
regards
--
Tomas Vondra
On Thu, Apr 3, 2025 at 10:23 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi,
Hi Bertrand,
On Thu, Apr 03, 2025 at 09:01:43AM +0200, Jakub Wartak wrote:
[..]
=== v21-0002
While pg_buffercache_build_tuple() is not added (pg_buffercache_save_tuple()
is).
Fixed
About v21-0002:
=== 1
I can see that the pg_buffercache_init_entries() helper comments are added in
v21-0003 but I think that it would be better to add them in v21-0002
(where the helper is actually created).
Moved
About v21-0003:
=== 2
I hear you, attached v21 / 0003 is free of float/double arithmetics
and uses non-float point values.+ if (buffers_per_page > 1) + os_page_query_count = NBuffers; + else + os_page_query_count = NBuffers * pages_per_buffer;yeah, that's more elegant. I think that it properly handles the relationships
between buffer and page sizes without relying on float arithmetic.
Cool
=== 3
+ if (buffers_per_page > 1) + {As buffers_per_page does not change, I think I'd put this check outside of the
for (i = 0; i < NBuffers; i++) loop, something like:"
if (buffers_per_page > 1) {
/* BLCKSZ < PAGESIZE: one page hosts many Buffers */
for (i = 0; i < NBuffers; i++) {
.
.
.
.
} else {
/* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */
for (i = 0; i < NBuffers; i++) {
.
.
.
"
Done.
=== 4
That _numa_prepare_ptrs() is unused and will need to be removed,
but we can still move some code there if necessary.Yeah I think that it can be simply removed then.
Removed.
=== 5
@Bertrand: do you have anything against pg_shm_allocations_numa
instead of pg_shm_numa_allocations? I don't mind changing it...I think that pg_shm_allocations_numa() is better (given the examples you just
shared).
OK, let's go with this name then (in v22).
=== 6
I find all of those non-user friendly and I'm afraid I won't be able
to pull that alone in time...Maybe we could add a few words in the doc to mention the "multiple nodes per
buffer" case? And try to improve it for say 19?
Right, we could also put it as a limitation. I would be happy to leave
it as it must be a rare condition, but Tomas stated he's not.
Also maybe we should just focus till v21-0003 (and discard v21-0004 for 18).
Do you mean discard pg_buffercache_numa (0002+0003) and instead go
with pg_shm_allocations_numa (0004) ?
BTW: I've noticed that Tomas responded with his v22 to this after I've
solved all of the above in mine v22, so I'll drop v23 soon and then
let's continue there.
-J.
Hi Jakub,
On Thu, Apr 03, 2025 at 02:36:57PM +0200, Jakub Wartak wrote:
On Thu, Apr 3, 2025 at 10:23 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Right, we could also put it as a limitation. I would be happy to leave
it as it must be a rare condition, but Tomas stated he's not.Also maybe we should just focus till v21-0003 (and discard v21-0004 for 18).
Do you mean discard pg_buffercache_numa (0002+0003) and instead go
with pg_shm_allocations_numa (0004) ?
No I meant the opposite: focus on 0001, 0002 and 0003 for 18. But if Tomas is
confident enough to also focus in addition to 0004, that's fine too.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Thu, Apr 3, 2025 at 2:15 PM Tomas Vondra <tomas@vondra.me> wrote:
Ah, OK. Jakub, can you correct (and double-check) this in the next
version of the patch?
Done.
About v21-0002:
=== 1
I can see that the pg_buffercache_init_entries() helper comments are added in
v21-0003 but I think that it would be better to add them in v21-0002
(where the helper is actually created).+1 to that
Done.
About v21-0003:
=== 2
I hear you, attached v21 / 0003 is free of float/double arithmetics
and uses non-float point values.+ if (buffers_per_page > 1) + os_page_query_count = NBuffers; + else + os_page_query_count = NBuffers * pages_per_buffer;yeah, that's more elegant. I think that it properly handles the relationships
between buffer and page sizes without relying on float arithmetic.In the review I just submitted, I changed this to
os_page_query_count = Max(NBuffers, NBuffers * pages_per_buffer);
but maybe it's less clear. Feel free to undo my change.
Cool, thanks, will send shortly v23 with this applied.
=== 3
+ if (buffers_per_page > 1) + {As buffers_per_page does not change, I think I'd put this check outside of the
for (i = 0; i < NBuffers; i++) loop, something like:"
if (buffers_per_page > 1) {
/* BLCKSZ < PAGESIZE: one page hosts many Buffers */
for (i = 0; i < NBuffers; i++) {
.
.
.
.
} else {
/* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */
for (i = 0; i < NBuffers; i++) {
.
.I don't know. It's a matter of opinion, but I find the current code
fairly understandable. Maybe if it meaningfully reduced the code
nesting, but even with the extra branch we'd still need the for loop.
I'm not against doing this differently, but I'd have to see how that
looks. Until then I think it's fine to have the code as is.
v23 will have incorporated Bertrand's idea soon. No hard feelings, but
it's kind of painful to switch like that ;)
=== 4
That _numa_prepare_ptrs() is unused and will need to be removed,
but we can still move some code there if necessary.Yeah I think that it can be simply removed then.
I'm not particularly attached on having the _ptrs() function, but why
couldn't it build the os_page_ptrs array as before?
I've removed it in v23, the code for me just didn't have flow...
=== 5
@Bertrand: do you have anything against pg_shm_allocations_numa
instead of pg_shm_numa_allocations? I don't mind changing it...I think that pg_shm_allocations_numa() is better (given the examples you just
shared).
Done.
=== 6
I find all of those non-user friendly and I'm afraid I won't be able
to pull that alone in time...Maybe we could add a few words in the doc to mention the "multiple nodes per
buffer" case? And try to improve it for say 19? Also maybe we should just focus
till v21-0003 (and discard v21-0004 for 18).IMHO it's not enough to paper this over by mentioning it in the docs.
OK
I'm a bit confused about which patch you suggest to leave out. I think
0004 is pg_shmem_allocations_numa, but I think that part is fine, no?
We've been discussing how to represent nodes for buffers in 0003.
IMHO 0001 + 0004 is good. 0003 is probably the last troublemaker, but
we settled on arrays right?
I understand it'd require a bit of coding, but AFAICS adding an array
would be fairly trivial amount of code. Something likevalues[i++]
= PointerGetDatum(construct_array_builtin(nodes, nodecnt, INT4OID));would do, I think. But if we decide to do the 3-column view, that'd be
even simpler, I think. AFAICS it means we could mostly ditch the
pg_buffercache refactoring in patch 0002, and 0003 would get much
simpler too I think.If Jakub doesn't have time to work on this, I can take a stab at it
later today.
I won't be able to even start on this today, so if you have cycles for
please do so...
-J.
On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:
On 4/3/25 09:01, Jakub Wartak wrote:
On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi Tomas,
Here's v23 attached (had to rework it because the you sent v22 just
the moment you I wanted to send it) Change include:
- your review should be incorporated
- name change for shm view
- pg_buffercache_numa refactor a little to keep out the if outside
loops (as Bertrand wanted initially)
So let's continue in this subthread.
I think a view with just 3 columns would be a good solution. It's what
pg_shmem_allocations_numa already does, so it'd be consistent with that
part too.I'm not too worried about the cost of the extra join - it's going to be
a couple dozen milliseconds at worst, I guess, and that's negligible in
the bigger scheme of things (e.g. compared to how long the move_pages is
expected to take). Also, it's not like having everything in the same
view is free - people would have to do some sort of post-processing, and
that has a cost too.So unless someone can demonstrate a use case where this would matter,
I'd not worry about it too much.
OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
it's just that I don't have cycles left today and probably lack skills
(i've never dealt with arrays so far) thus it would be slow to get it
right... but I can pick up anything tomorrow morning.
I'm -1 on JSON, I don't see how would that solve anything better than
e.g. a regular array, and it's going to be harder to work with. So if we
don't want to go with the 3-column view proposed earlier, I'd stick to a
simple array. I don't think there's a huge difference between those two
approaches, it should be easy to convert between those approaches using
unnest() and array_agg().Attached is v22, with some minor review comments:
1) I suggested we should just use "libnuma support" in configure,
instead of talking about "NUMA awareness support", and AFAICS you
agreed. But I still see the old text in configure ... is that
intentional or a bit of forgotten text?
It was my bad, too many rebases and mishaps with git voodoo..
2) I added a couple asserts to pg_buffercache_numa_pages() and comments,
and simplified a couple lines (but that's a matter of preference).
Great, thanks.
3) I don't think it's correct for pg_get_shmem_numa_allocations to just
silently ignore nodes outside the valid range. I suggest we simply do
elog(ERROR), as it's an internal error we don't expect to happen.
It's there too.
-J.
Attachments:
v23-0001-Add-support-for-basic-NUMA-awareness.patchapplication/octet-stream; name=v23-0001-Add-support-for-basic-NUMA-awareness.patchDownload
From eadcff4de1d21b0e522a53e19a37fc44eed56db0 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 2 Apr 2025 12:29:22 +0200
Subject: [PATCH v23 1/4] Add support for basic NUMA awareness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.
A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with NUMA library.
The libnuma library is not available on 32-bit builds (there's no shared
object for i386), so we disable it in that case. The i386 is very memory
limited anyway, even with PAE, so NUMA is mostly irrelevant.
On Linux we use move_pages(2) syscall for speed instead of
get_mempolicy(2).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 2 +
configure | 187 ++++++++++++++++++++++++++++
configure.ac | 14 +++
doc/src/sgml/func.sgml | 13 ++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 6 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 40 ++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 120 ++++++++++++++++++
17 files changed, 442 insertions(+), 2 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
--with-liburing \
\
${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
diff --git a/configure b/configure
index 11615d1122d..e27badd83c3 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
XML2_CFLAGS
XML2_CONFIG
with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
@@ -872,6 +875,7 @@ with_liburing
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
LIBURING_LIBS
LIBCURL_CFLAGS
LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
XML2_CONFIG
XML2_CFLAGS
XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -1629,6 +1636,10 @@ Some influential environment variables:
C compiler flags for LIBCURL, overriding pkg-config
LIBCURL_LIBS
linker flags for LIBCURL, overriding pkg-config
+ LIBNUMA_CFLAGS
+ C compiler flags for LIBNUMA, overriding pkg-config
+ LIBNUMA_LIBS
+ linker flags for LIBNUMA, overriding pkg-config
XML2_CONFIG path to xml2-config utility
XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
XML2_LIBS linker flags for XML2, overriding pkg-config
@@ -9063,6 +9074,182 @@ $as_echo "$as_me: WARNING: *** OAuth support tests require --with-python to run"
fi
+#
+# libnuma
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libnuma support" >&5
+$as_echo_n "checking whether to build with libnuma support... " >&6; }
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libnuma" >&5
+$as_echo "$with_libnuma" >&6; }
+
+
+if test "$with_libnuma" = yes ; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'libnuma' is required for NUMA support" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa" >&5
+$as_echo_n "checking for numa... " >&6; }
+
+if test -n "$LIBNUMA_CFLAGS"; then
+ pkg_cv_LIBNUMA_CFLAGS="$LIBNUMA_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_CFLAGS=`$PKG_CONFIG --cflags "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBNUMA_LIBS"; then
+ pkg_cv_LIBNUMA_LIBS="$LIBNUMA_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_LIBS=`$PKG_CONFIG --libs "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "numa" 2>&1`
+ else
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "numa" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBNUMA_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (numa) were not met:
+
+$LIBNUMA_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBNUMA_CFLAGS=$pkg_cv_LIBNUMA_CFLAGS
+ LIBNUMA_LIBS=$pkg_cv_LIBNUMA_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
#
# XML
#
diff --git a/configure.ac b/configure.ac
index debdf165044..d365a486d3d 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,20 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma support],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA support])])
+ PKG_CHECK_MODULES(LIBNUMA, numa)
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2488e9ba998..4bb60e9e080 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25143,6 +25143,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..8ebf0b03ec0 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname> library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-liburing">
<term><option>--with-liburing</option></term>
<listitem>
@@ -2645,6 +2655,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname> library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 454ed81f5ea..46e92daeb62 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: liburing
@@ -3243,6 +3264,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
liburing,
libxml,
lz4,
@@ -3899,6 +3921,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..06bf5627d3c 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA support')
+
option('liburing', type : 'feature', value: 'auto',
description: 'io_uring support, for asynchronous I/O')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 737b2dd1869..6722fbdf365 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
@@ -223,6 +224,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBNUMA_CFLAGS = @LIBNUMA_CFLAGS@
+LIBNUMA_LIBS = @LIBNUMA_LIBS@
+
LIBURING_CFLAGS = @LIBURING_CFLAGS@
LIBURING_LIBS = @LIBURING_LIBS@
@@ -250,7 +254,7 @@ CPP = @CPP@
CPPFLAGS = @CPPFLAGS@
PG_SYSROOT = @PG_SYSROOT@
-override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBNUMA_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
ifdef PGXS
override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..ea8d796e7c4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a28a15993a2..63859661951 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8518,6 +8518,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c2f1241b234..b3166ec8f42 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -686,6 +686,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to build with io_uring support. (--with-liburing) */
#undef USE_LIBURING
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..314cff94dbc
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
@@ -232,6 +234,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index f11896440d5..4274949dfa4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
path.o \
pg_bitutils.o \
pg_localeconv_r.o \
+ pg_numa.o \
pg_popcount_aarch64.o \
pg_popcount_avx512.o \
pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 51041e75609..228888b2f66 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_localeconv_r.c',
+ 'pg_numa.c',
'pg_popcount_aarch64.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..5e2523cf798
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+/*
+ * We use move_pages(2) syscall here - instead of get_mempolicy(2) - as the
+ * first one allows us to batch and query about many memory pages in one single
+ * giant system call that is way faster.
+ */
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+#else
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
+
+/* This should be used only after the server is started */
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+ Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
--
2.39.5
v23-0004-Add-new-pg_shmem_allocations_numa-view.patchapplication/octet-stream; name=v23-0004-Add-new-pg_shmem_allocations_numa-view.patchDownload
From 6dbbbe7602cfe4ba33cb7536a6e7c091a568ff96 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v23 4/4] Add new pg_shmem_allocations_numa view
Introduce new pg_shmem_alloctions_numa view that allows viewing the shared memory split layout across
NUMA nodes.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 79 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 132 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 +++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 274 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 4f336ee0adf..a83365ae24a 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -181,6 +181,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -4051,6 +4056,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-allocations-numa">
+ <title><structname>pg_shmem_allocations_numa</structname></title>
+
+ <indexterm zone="view-pg-shmem-allocations-numa">
+ <primary>pg_shmem_allocations_numa</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_allocations_numa</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_allocations_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_allocations_numa</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 273008db37f..08f780a2e63 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_allocations_numa AS
+ SELECT * FROM pg_get_shmem_allocations_numa();
+
+REVOKE ALL ON pg_shmem_allocations_numa FROM PUBLIC;
+GRANT SELECT ON pg_shmem_allocations_numa TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index e453f856794..4313e6db62c 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -90,6 +91,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -570,3 +573,132 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+Datum
+pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+ return (Datum) 0;
+ }
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ /* Get number of OS aliged pages */
+ shm_ent_page_count = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (which indicates unmapped/unallocated pages).
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+ /* Count number of NUMA nodes used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s >= 0 && s <= max_nodes)
+ nodes[s]++;
+ else
+ elog(ERROR, "invalid NUMA node id outside of allowed range [0, %ld]: %d", max_nodes, s);
+ }
+
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory.
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 63859661951..0a1fc2a25cd 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8522,6 +8522,14 @@
proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_allocations_numa', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,node_id,size}',
+ prosrc => 'pg_get_shmem_allocations_numa' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..668172f7d79
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 5588d83e1bf..f9c3ff259bc 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3150,6 +3150,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
has_table_privilege
@@ -3169,6 +3175,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 673c63b8d1b..abfdc97abc5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1757,6 +1757,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_allocations_numa| SELECT name,
+ node_id,
+ size
+ FROM pg_get_shmem_allocations_numa() pg_get_shmem_allocations_numa(name, node_id, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..034098783fb
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index 286b1d03756..81b40a1a330 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1922,12 +1922,14 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.39.5
v23-0002-pg_buffercache-split-pg_buffercache_pages-into-p.patchapplication/octet-stream; name=v23-0002-pg_buffercache-split-pg_buffercache_pages-into-p.patchDownload
From db68dc5a48226eeaf74e109160972888e99899db Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v23 2/4] pg_buffercache: split pg_buffercache_pages into parts
Split pg_buffercache_pages() into multiple smaller functions, to allow
reuse in future patches. This introduces three new functions:
- pg_buffercache_init_entries
- pg_buffercache_save_tuple
- get_buffercache_tuple
that help adding entries into a tuplestore, describing the contents of
the buffercache.
This is a preparation for future patches extending pg_buffercache, e.g.
to add NUMA observabitily.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/pg_buffercache_pages.c | 297 ++++++++++--------
1 file changed, 159 insertions(+), 138 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..b89bd228fb7 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -68,80 +68,175 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-Datum
-pg_buffercache_pages(PG_FUNCTION_ARGS)
+/*
+ * Helper routine for pg_buffercache_pages().
+ *
+ * Allocates and returns new user function context based on SRF context
+ * (requires that functx to be initalized by SRF_FIRSTCALL_INIT()) and
+ * standard function call info.
+ */
+static BufferCachePagesContext *
+pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
{
- FuncCallContext *funcctx;
- Datum result;
- MemoryContext oldcontext;
BufferCachePagesContext *fctx; /* User function context. */
+ MemoryContext oldcontext;
TupleDesc tupledesc;
TupleDesc expected_tupledesc;
- HeapTuple tuple;
- if (SRF_IS_FIRSTCALL())
- {
- int i;
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
- funcctx = SRF_FIRSTCALL_INIT();
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. We unfortunately have to get the result type for that... - we
+ * can't use the result type determined by the function definition without
+ * potentially crashing when somebody uses the old (or even wrong)
+ * function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+ expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
+ OIDOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
+ INT2OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
+ BOOLOID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
+ INT2OID, -1, 0);
+
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCachePagesRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCachePagesRec) * NBuffers);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+ return fctx;
+}
+
+/*
+ * Helper routine for pg_buffercache_pages().
+ *
+ * Save buffer cache information for a single buffer.
+ */
+static void
+pg_buffercache_save_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+ BufferCachePagesRec *bufRecord = &(fctx->record[record_id]);
+
+ bufHdr = GetBufferDescriptor(record_id);
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+
+ bufRecord->bufferid = BufferDescriptorGetBuffer(bufHdr);
+ bufRecord->relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ bufRecord->reltablespace = bufHdr->tag.spcOid;
+ bufRecord->reldatabase = bufHdr->tag.dbOid;
+ bufRecord->forknum = BufTagGetForkNum(&bufHdr->tag);
+ bufRecord->blocknum = bufHdr->tag.blockNum;
+ bufRecord->usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ bufRecord->pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+ if (buf_state & BM_DIRTY)
+ bufRecord->isdirty = true;
+ else
+ bufRecord->isdirty = false;
+
+ /* Note if the buffer is valid, and has storage created */
+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ bufRecord->isvalid = true;
+ else
+ bufRecord->isvalid = false;
+
+ UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
+ * Helper routine for pg_buffercache_pages().
+ *
+ * Format and return a tuple for a single buffer cache entry.
+ */
+static Datum
+get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+ Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+ HeapTuple tuple;
+ BufferCachePagesRec *bufRecord = &(fctx->record[record_id]);
- /* Switch context when allocating stuff to be used in later calls */
- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ values[0] = Int32GetDatum(bufRecord->bufferid);
+ memset(nulls, false, NUM_BUFFERCACHE_PAGES_ELEM);
- /* Create a user function context for cross-call persistence */
- fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+ /*
+ * Set all fields except the bufferid to null if the buffer is unused or
+ * not valid.
+ */
+ if (bufRecord->blocknum == InvalidBlockNumber || bufRecord->isvalid == false)
+ memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool));
+ else
+ {
+ values[1] = ObjectIdGetDatum(bufRecord->relfilenumber);
+ values[2] = ObjectIdGetDatum(bufRecord->reltablespace);
+ values[3] = ObjectIdGetDatum(bufRecord->reldatabase);
+ values[4] = ObjectIdGetDatum(bufRecord->forknum);
+ values[5] = Int64GetDatum((int64) bufRecord->blocknum);
+ values[6] = BoolGetDatum(bufRecord->isdirty);
+ values[7] = Int16GetDatum(bufRecord->usagecount);
/*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
+ * unused for v1.0 callers, but the array is always long enough
*/
- if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
- elog(ERROR, "return type must be a row type");
+ values[8] = Int32GetDatum(bufRecord->pinning_backends);
+ }
- if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
- expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
- elog(ERROR, "incorrect number of output arguments");
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
- /* Construct a tuple descriptor for the result rows. */
- tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
- TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
- INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
- OIDOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
- INT2OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
- INT8OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
- BOOLOID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
- INT2OID, -1, 0);
-
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
- TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
- INT4OID, -1, 0);
-
- fctx->tupdesc = BlessTupleDesc(tupledesc);
-
- /* Allocate NBuffers worth of BufferCachePagesRec records. */
- fctx->record = (BufferCachePagesRec *)
- MemoryContextAllocHuge(CurrentMemoryContext,
- sizeof(BufferCachePagesRec) * NBuffers);
-
- /* Set max calls and remember the user function context. */
- funcctx->max_calls = NBuffers;
- funcctx->user_fctx = fctx;
-
- /* Return to original context when allocating transient memory */
- MemoryContextSwitchTo(oldcontext);
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
* Scan through all the buffers, saving the relevant fields in the
@@ -152,36 +247,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
* locks, so the information of each buffer is self-consistent.
*/
for (i = 0; i < NBuffers; i++)
- {
- BufferDesc *bufHdr;
- uint32 buf_state;
-
- bufHdr = GetBufferDescriptor(i);
- /* Lock each buffer header before inspecting. */
- buf_state = LockBufHdr(bufHdr);
-
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
- if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
- else
- fctx->record[i].isdirty = false;
-
- /* Note if the buffer is valid, and has storage created */
- if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
- else
- fctx->record[i].isvalid = false;
-
- UnlockBufHdr(bufHdr, buf_state);
- }
+ pg_buffercache_save_tuple(i, fctx);
}
funcctx = SRF_PERCALL_SETUP();
@@ -191,55 +257,10 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
if (funcctx->call_cntr < funcctx->max_calls)
{
+ Datum result;
uint32 i = funcctx->call_cntr;
- Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
- bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
-
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
- nulls[0] = false;
-
- /*
- * Set all fields except the bufferid to null if the buffer is unused
- * or not valid.
- */
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
- {
- nulls[1] = true;
- nulls[2] = true;
- nulls[3] = true;
- nulls[4] = true;
- nulls[5] = true;
- nulls[6] = true;
- nulls[7] = true;
- /* unused for v1.0 callers, but the array is always long enough */
- nulls[8] = true;
- }
- else
- {
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
- nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
- nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
- nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
- nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
- nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
- nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
- nulls[7] = false;
- /* unused for v1.0 callers, but the array is always long enough */
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
- nulls[8] = false;
- }
-
- /* Build and return the tuple. */
- tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
- result = HeapTupleGetDatum(tuple);
+ result = get_buffercache_tuple(i, fctx);
SRF_RETURN_NEXT(funcctx, result);
}
else
--
2.39.5
v23-0003-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchapplication/octet-stream; name=v23-0003-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchDownload
From 0c6dce72f23b5064b1f606d684f9107b4c3e95b8 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v23 3/4] Add pg_buffercache_numa view with NUMA node info
Introduces a new view pg_buffercache_numa, showing a NUMA memory node
for each individual buffer.
To determine the NUMA node for a buffer, we first need to touch the
memory pages using pg_numa_touch_mem_if_required, otherwise we might get
status -2 (ENOENT = The page is not present), indicating the page is
either unmapped or unallocated.
The size of a database block and OS memory page may differ. For example
the default block size (BLCKSZ) is 8KB, while the memory page is 4KB,
but it's also possible to make the block size smaller (e.g. 1KB).
XXX: Right now we just report NUMA node of the first page when dealing with
multiple pages per single buffer.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 3 +-
.../expected/pg_buffercache_numa.out | 28 +++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 24 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 227 +++++++++++++++++-
.../pg_buffercache/pg_buffercache_pages.c.rej | 23 ++
.../sql/pg_buffercache_numa.sql | 20 ++
doc/src/sgml/pgbuffercache.sgml | 61 ++++-
10 files changed, 386 insertions(+), 7 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/pg_buffercache_pages.c.rej
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..8c1e891eab2
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,24 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+ relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+ pinning_backends int4, node_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index b89bd228fb7..a2c1d192b83 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,12 +11,13 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
-#define NUM_BUFFERCACHE_PAGES_ELEM 9
+#define NUM_BUFFERCACHE_PAGES_ELEM 10
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
@@ -46,6 +47,7 @@ typedef struct
* because of bufmgr.c's PrivateRefCount infrastructure.
*/
int32 pinning_backends;
+ int32 numa_node_id;
} BufferCachePagesRec;
@@ -64,10 +66,14 @@ typedef struct
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
/*
* Helper routine for pg_buffercache_pages().
*
@@ -123,9 +129,12 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
INT2OID, -1, 0);
- if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
INT4OID, -1, 0);
+ if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "node_id",
+ INT4OID, -1, 0);
fctx->tupdesc = BlessTupleDesc(tupledesc);
@@ -144,7 +153,7 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
}
/*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
*
* Save buffer cache information for a single buffer.
*/
@@ -179,11 +188,13 @@ pg_buffercache_save_tuple(int record_id, BufferCachePagesContext *fctx)
else
bufRecord->isvalid = false;
+ bufRecord->numa_node_id = -1;
+
UnlockBufHdr(bufHdr, buf_state);
}
/*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
*
* Format and return a tuple for a single buffer cache entry.
*/
@@ -218,6 +229,7 @@ get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
* unused for v1.0 callers, but the array is always long enough
*/
values[8] = Int32GetDatum(bufRecord->pinning_backends);
+ values[9] = Int32GetDatum(bufRecord->numa_node_id);
}
/* Build and return the tuple. */
@@ -267,6 +279,213 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inquiry about memory mappings.
+ *
+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i,
+ j;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_pages_status = NULL;
+ uint64 os_page_query_count = 0;
+ int pages_per_buffer = 0;
+ int buffers_per_page = 0;
+ volatile uint64 touch pg_attribute_unused();
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ fctx = pg_buffercache_init_entries(funcctx, fcinfo);
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS
+ * memory page size 2. Calculate how many OS pages are used by all
+ * buffer blocks 3. Calculate how many OS pages are contained within
+ * each database block.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+ buffers_per_page = os_page_size / BLCKSZ;
+ pages_per_buffer = BLCKSZ / os_page_size;
+
+ /*
+ * The pages and block size is expected to be 2^k, so one divides the
+ * other (we don't know in which direction).
+ */
+ Assert((os_page_size % BLCKSZ == 0) || (BLCKSZ % os_page_size == 0));
+
+ /*
+ * Either both counts are 1 (when the pages have the same size), or
+ * exacly one of them is zero. Both can't be zero at the same time.
+ */
+ Assert((buffers_per_page > 0) || (pages_per_buffer > 0));
+ Assert(((buffers_per_page == 1) && (pages_per_buffer == 1)) ||
+ ((buffers_per_page == 0) || (pages_per_buffer == 0)));
+
+ /*
+ * How many addresses we are going to query (store) depends on the
+ * relation between BLCKSZ : PAGESIZE. We need at least one status per
+ * buffer - if the memory page is larger than buffer, we still query
+ * it for each buffer. With multiple memory pages per buffer, we need
+ * that many entries.
+ */
+ os_page_query_count = Max(NBuffers, NBuffers * pages_per_buffer);
+
+ elog(DEBUG1, "NUMA: NBuffers=%d os_page_query_count=" UINT64_FORMAT " os_page_size=%zu buffers_per_page=%d pages_per_buffer=%d",
+ NBuffers, os_page_query_count, os_page_size, buffers_per_page, pages_per_buffer);
+
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_query_count);
+ os_pages_status = palloc(sizeof(uint64) * os_page_query_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(os_pages_status, 0xff, sizeof(int) * os_page_query_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ *
+ * This loop touches and stores addresses into os_page_ptrs[] as input
+ * to one big big move_pages(2) inquiry system call. Basically we ask
+ * for all memory pages for NBuffers.
+ */
+
+ if (buffers_per_page > 1)
+ {
+ /* BLCKSZ < PAGESIZE: one page hosts many Buffers */
+ for (i = 0; i < NBuffers; i++)
+ {
+ pg_buffercache_save_tuple(i, fctx);
+
+ /*
+ * Altough we could query just once per each OS page, we do it
+ * repeatably for each Buffer and hit the same address as
+ * move_pages(2) requires page aligment. This also simplifies
+ * retrieval code later on. Also NBuffers starts from 1
+ */
+ os_page_ptrs[i] = (char *) TYPEALIGN(os_page_size,
+ (char *) BufferGetBlock(i + 1));
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ }
+ else
+ {
+ /*
+ * BLCKSZ >= PAGESIZE: If Buffer occupies more than one OS page we
+ * query all OS pages for NUMA information. This wont run for
+ * BLCKSZ < PAGESIZE.
+ */
+ for (i = 0; i < NBuffers; i++)
+ {
+ pg_buffercache_save_tuple(i, fctx);
+
+ for (j = 0; j < pages_per_buffer; j++)
+ {
+ size_t idx = (size_t) (i * pages_per_buffer) + j;
+
+ /* NBuffers starts from 1 */
+ os_page_ptrs[idx] = (char *) BufferGetBlock(i + 1) + (os_page_size * j);
+
+ /*
+ * Only need to touch memory once per backend process
+ * lifetime
+ */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[idx]);
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ }
+
+ if (pg_numa_query_pages(0, os_page_query_count, os_page_ptrs, os_pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ /*
+ * Once we have our NUMA information we resolve memory pointers back
+ * to Buffers
+ */
+ for (i = 0; i < NBuffers; i++)
+ {
+ size_t idx;
+
+ /*
+ * Note: We could check for errors in os_pages_status and report
+ * them. Again, a single DB block might span multiple NUMA nodes
+ * if it crosses OS pages on node boundaries, but we only record
+ * the node of the first page. This is a simplification but should
+ * be sufficient for most analyses.
+ */
+
+ if (buffers_per_page > 1)
+ idx = i;
+ else
+ {
+ /*
+ * XXX: BLCKSZ < PAGESIZE: return the node id for this Buffer
+ * based only on >> FIRST << OS page. We could do something
+ * else with this.
+ */
+ idx = i * pages_per_buffer;
+ }
+ fctx->record[i].numa_node_id = os_pages_status[idx];
+ }
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ Datum result;
+ uint32 i = funcctx->call_cntr;
+
+ result = get_buffercache_tuple(i, fctx);
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ {
+ firstNumaTouch = false;
+ SRF_RETURN_DONE(funcctx);
+ }
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c.rej b/contrib/pg_buffercache/pg_buffercache_pages.c.rej
new file mode 100644
index 00000000000..2027b60d163
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c.rej
@@ -0,0 +1,23 @@
+diff a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c (rejected hunks)
+@@ -64,12 +66,20 @@ typedef struct
+ * relation node/tablespace/database/blocknum and dirty indicator.
+ */
+ PG_FUNCTION_INFO_V1(pg_buffercache_pages);
++PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
+ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
+ PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
+ PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+
++/* Only need to touch memory once per backend process lifetime */
++static bool firstNumaTouch = true;
++
+ /*
+- * Helper routine for pg_buffercache_pages().
++ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
++ *
++ * Allocates and returns new user function context based on SRF context
++ * (requires that functx to be initalized by SRF_FIRSTCALL_INIT()) and
++ * standard function call info.
+ */
+ static BufferCachePagesContext *
+ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..315227bf0ce 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,14 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides the same information
+ as <function>pg_buffercache_pages()</function> but is slower because it also
+ provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +210,55 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache-numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed are identical to the
+ <structname>pg_buffercache</structname> view, except that this one includes
+ one additional <structfield>node_id</structfield> column as defined in
+ <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Extra column</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>integer</type>
+ </para>
+ <para>
+ <acronym>NUMA</acronym> node ID. NULL if the shared buffer
+ has not been used yet. On systems without <acronym>NUMA</acronym> support
+ this returns 0.
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
--
2.39.5
On 4/3/25 15:12, Jakub Wartak wrote:
On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:
...
So unless someone can demonstrate a use case where this would matter,
I'd not worry about it too much.OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
it's just that I don't have cycles left today and probably lack skills
(i've never dealt with arrays so far) thus it would be slow to get it
right... but I can pick up anything tomorrow morning.
OK, I took a stab at reworking/simplifying this the way I proposed.
Here's v24 - needs more polishing, but hopefully enough to show what I
had in mind.
It does these changes:
1) Drops 0002 with the pg_buffercache refactoring, because the new view
is not "extending" the existing one.
2) Reworks pg_buffercache_num to return just three columns, bufferid,
page_num and node_id. page_num is a sequence starting from 0 for each
buffer.
3) It now builds an array of records, with one record per buffer/page.
4) I realized we don't really need to worry about buffers_per_page very
much, except for logging/debugging. There's always "at least one page"
per buffer, even if an incomplete one, so we can do this:
os_page_count = NBuffers * Max(1, pages_per_buffer);
and then
for (i = 0; i < NBuffers; i++)
{
for (j = 0; j < Max(1, pages_per_buffer); j++)
{
..
}
}
and everything just works fine, I think.
Opinions? I personally find this much cleaner / easier to understand.
regards
--
Tomas Vondra
Attachments:
v24-0001-Add-support-for-basic-NUMA-awareness.patchtext/x-patch; charset=UTF-8; name=v24-0001-Add-support-for-basic-NUMA-awareness.patchDownload
From 70018899b698b186ffebb03c7336022ae83fbfb8 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 2 Apr 2025 12:29:22 +0200
Subject: [PATCH v24 1/3] Add support for basic NUMA awareness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.
A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with NUMA library.
The libnuma library is not available on 32-bit builds (there's no shared
object for i386), so we disable it in that case. The i386 is very memory
limited anyway, even with PAE, so NUMA is mostly irrelevant.
On Linux we use move_pages(2) syscall for speed instead of
get_mempolicy(2).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 2 +
configure | 187 ++++++++++++++++++++++++++++
configure.ac | 14 +++
doc/src/sgml/func.sgml | 13 ++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 6 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 40 ++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 120 ++++++++++++++++++
17 files changed, 442 insertions(+), 2 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
--with-liburing \
\
${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
diff --git a/configure b/configure
index 11615d1122d..e27badd83c3 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
XML2_CFLAGS
XML2_CONFIG
with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
@@ -872,6 +875,7 @@ with_liburing
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
LIBURING_LIBS
LIBCURL_CFLAGS
LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
XML2_CONFIG
XML2_CFLAGS
XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -1629,6 +1636,10 @@ Some influential environment variables:
C compiler flags for LIBCURL, overriding pkg-config
LIBCURL_LIBS
linker flags for LIBCURL, overriding pkg-config
+ LIBNUMA_CFLAGS
+ C compiler flags for LIBNUMA, overriding pkg-config
+ LIBNUMA_LIBS
+ linker flags for LIBNUMA, overriding pkg-config
XML2_CONFIG path to xml2-config utility
XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
XML2_LIBS linker flags for XML2, overriding pkg-config
@@ -9063,6 +9074,182 @@ $as_echo "$as_me: WARNING: *** OAuth support tests require --with-python to run"
fi
+#
+# libnuma
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libnuma support" >&5
+$as_echo_n "checking whether to build with libnuma support... " >&6; }
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libnuma" >&5
+$as_echo "$with_libnuma" >&6; }
+
+
+if test "$with_libnuma" = yes ; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'libnuma' is required for NUMA support" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa" >&5
+$as_echo_n "checking for numa... " >&6; }
+
+if test -n "$LIBNUMA_CFLAGS"; then
+ pkg_cv_LIBNUMA_CFLAGS="$LIBNUMA_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_CFLAGS=`$PKG_CONFIG --cflags "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBNUMA_LIBS"; then
+ pkg_cv_LIBNUMA_LIBS="$LIBNUMA_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_LIBS=`$PKG_CONFIG --libs "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "numa" 2>&1`
+ else
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "numa" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBNUMA_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (numa) were not met:
+
+$LIBNUMA_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBNUMA_CFLAGS=$pkg_cv_LIBNUMA_CFLAGS
+ LIBNUMA_LIBS=$pkg_cv_LIBNUMA_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
#
# XML
#
diff --git a/configure.ac b/configure.ac
index debdf165044..d365a486d3d 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,20 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma support],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA support])])
+ PKG_CHECK_MODULES(LIBNUMA, numa)
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2488e9ba998..4bb60e9e080 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25143,6 +25143,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..8ebf0b03ec0 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname> library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-liburing">
<term><option>--with-liburing</option></term>
<listitem>
@@ -2645,6 +2655,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname> library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 454ed81f5ea..46e92daeb62 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: liburing
@@ -3243,6 +3264,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
liburing,
libxml,
lz4,
@@ -3899,6 +3921,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..06bf5627d3c 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA support')
+
option('liburing', type : 'feature', value: 'auto',
description: 'io_uring support, for asynchronous I/O')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 737b2dd1869..6722fbdf365 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
@@ -223,6 +224,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBNUMA_CFLAGS = @LIBNUMA_CFLAGS@
+LIBNUMA_LIBS = @LIBNUMA_LIBS@
+
LIBURING_CFLAGS = @LIBURING_CFLAGS@
LIBURING_LIBS = @LIBURING_LIBS@
@@ -250,7 +254,7 @@ CPP = @CPP@
CPPFLAGS = @CPPFLAGS@
PG_SYSROOT = @PG_SYSROOT@
-override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBNUMA_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
ifdef PGXS
override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..ea8d796e7c4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a28a15993a2..63859661951 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8518,6 +8518,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c2f1241b234..b3166ec8f42 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -686,6 +686,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to build with io_uring support. (--with-liburing) */
#undef USE_LIBURING
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..314cff94dbc
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *)ptr
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
@@ -232,6 +234,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index f11896440d5..4274949dfa4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
path.o \
pg_bitutils.o \
pg_localeconv_r.o \
+ pg_numa.o \
pg_popcount_aarch64.o \
pg_popcount_avx512.o \
pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 51041e75609..228888b2f66 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_localeconv_r.c',
+ 'pg_numa.c',
'pg_popcount_aarch64.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..5e2523cf798
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+/*
+ * We use move_pages(2) syscall here - instead of get_mempolicy(2) - as the
+ * first one allows us to batch and query about many memory pages in one single
+ * giant system call that is way faster.
+ */
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+#else
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
+
+/* This should be used only after the server is started */
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+ Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
--
2.49.0
v24-0002-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchtext/x-patch; charset=UTF-8; name=v24-0002-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchDownload
From 70e9617f969c8b4cbb27a6d10afaf0cb6ad90206 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 3 Apr 2025 20:21:25 +0200
Subject: [PATCH v24 2/3] Add pg_buffercache_numa view with NUMA node info
Introduces a new view pg_buffercache_numa, showing a NUMA memory node
for each individual buffer.
To determine the NUMA node for a buffer, we first need to touch the
memory pages using pg_numa_touch_mem_if_required, otherwise we might get
status -2 (ENOENT = The page is not present), indicating the page is
either unmapped or unallocated.
The size of a database block and OS memory page may differ. For example
the default block size (BLCKSZ) is 8KB, while the memory page is 4KB,
but it's also possible to make the block size smaller (e.g. 1KB).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 5 +-
.../expected/pg_buffercache_numa.out | 28 ++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 22 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 276 ++++++++++++++++++
.../sql/pg_buffercache_numa.sql | 20 ++
doc/src/sgml/pgbuffercache.sgml | 74 ++++-
9 files changed, 428 insertions(+), 4 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..5f748543e2e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,10 +8,11 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
-REGRESS = pg_buffercache
+REGRESS = pg_buffercache pg_buffercache_numa
ifdef USE_PGXS
PG_CONFIG = pg_config
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..1230e244a5f
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,22 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, page_num int4, node_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..d653f4af394 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,6 +11,7 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
@@ -20,6 +21,8 @@
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
+#define NUM_BUFFERCACHE_NUMA_ELEM 3
+
PG_MODULE_MAGIC_EXT(
.name = "pg_buffercache",
.version = PG_VERSION
@@ -59,15 +62,41 @@ typedef struct
} BufferCachePagesContext;
+typedef struct
+{
+ uint32 bufferid;
+ int32 numa_page;
+ int32 numa_node;
+} BufferCacheNumaRec;
+
+/*
+ * Function context for data persisting over repeated calls.
+ */
+typedef struct
+{
+ TupleDesc tupdesc;
+ int buffers_per_page;
+ int pages_per_buffer;
+ int os_page_size;
+ BufferCacheNumaRec *record;
+} BufferCacheNumaContext;
+
+
/*
* Function returning data from the shared buffer cache - buffer number,
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+
Datum
pg_buffercache_pages(PG_FUNCTION_ARGS)
{
@@ -246,6 +275,253 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * Inquire about NUMA memory mappings, especially the NUMA node.
+ *
+ * In order to get reliable results we also need to touch memory pages, so
+ * that the inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ MemoryContext oldcontext;
+ BufferCacheNumaContext *fctx; /* User function context. */
+ TupleDesc tupledesc;
+ TupleDesc expected_tupledesc;
+ HeapTuple tuple;
+ Datum result;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i,
+ j,
+ idx;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_page_status;
+ uint64 os_page_count;
+ int pages_per_buffer;
+ int buffers_per_page;
+ volatile uint64 touch pg_attribute_unused();
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS
+ * memory page size 2. Calculate how many OS pages are used by all
+ * buffer blocks 3. Calculate how many OS pages are contained within
+ * each database block.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+ buffers_per_page = os_page_size / BLCKSZ;
+ pages_per_buffer = BLCKSZ / os_page_size;
+
+ /*
+ * The pages and block size is expected to be 2^k, so one divides the
+ * other (we don't know in which direction).
+ */
+ Assert((os_page_size % BLCKSZ == 0) || (BLCKSZ % os_page_size == 0));
+
+ /*
+ * Either both counts are 1 (when the pages have the same size), or
+ * exacly one of them is zero. Both can't be zero at the same time.
+ */
+ Assert((buffers_per_page > 0) || (pages_per_buffer > 0));
+ Assert(((buffers_per_page == 1) && (pages_per_buffer == 1)) ||
+ ((buffers_per_page == 0) || (pages_per_buffer == 0)));
+
+ /*
+ * How many addresses we are going to query (store) depends on the
+ * relation between BLCKSZ : PAGESIZE. We need at least one status per
+ * buffer - if the memory page is larger than buffer, we still query
+ * it for each buffer. With multiple memory pages per buffer, we need
+ * that many entries.
+ */
+ os_page_count = NBuffers * Max(1, pages_per_buffer);
+
+ elog(DEBUG1, "NUMA: NBuffers=%d os_page_query_count=" UINT64_FORMAT " os_page_size=%zu buffers_per_page=%d pages_per_buffer=%d",
+ NBuffers, os_page_count, os_page_size, buffers_per_page, pages_per_buffer);
+
+
+ /* initialize the multi-call context */
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCacheNumaContext *) palloc(sizeof(BufferCacheNumaContext));
+
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. We unfortunately have to get the result type for that... -
+ * we can't use the result type determined by the function definition
+ * without potentially crashing when somebody uses the old (or even
+ * wrong) function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts != NUM_BUFFERCACHE_NUMA_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "page_num",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "node_id",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCacheNumaRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCacheNumaRec) * os_page_count);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+
+
+ /* determine the NUMA node for OS pages */
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+ os_page_status = palloc(sizeof(uint64) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(os_page_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ *
+ * This loop touches and stores addresses into os_page_ptrs[] as input
+ * to one big big move_pages(2) inquiry system call. Basically we ask
+ * for all memory pages for NBuffers.
+ */
+ idx = 0;
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+ uint32 bufferid;
+
+ CHECK_FOR_INTERRUPTS();
+
+ bufHdr = GetBufferDescriptor(i);
+
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+ bufferid = BufferDescriptorGetBuffer(bufHdr);
+
+ UnlockBufHdr(bufHdr, buf_state);
+
+ /*
+ * If we have multiple OS pages per buffer, fill those in too. We
+ * always want at least one OS page, even if there are multiple
+ * buffers per page.
+ *
+ * Altough we could query just once per each OS page, we do it
+ * repeatably for each Buffer and hit the same address as
+ * move_pages(2) requires page aligment. This also simplifies
+ * retrieval code later on. Also NBuffers starts from 1.
+ */
+ for (j = 0; j < Max(1, pages_per_buffer); j++)
+ {
+ char *buffptr = (char *) BufferGetBlock(i + 1);
+
+ fctx->record[idx].bufferid = bufferid;
+ fctx->record[idx].numa_page = j;
+
+ os_page_ptrs[idx]
+ = (char *) TYPEALIGN(os_page_size,
+ buffptr + (os_page_size * j));
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[idx]);
+
+ ++idx;
+ }
+
+ }
+
+ /* we should get exactly the expected number of entrires */
+ Assert(idx == os_page_count);
+
+ /* query NUMA status for all the pointers */
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_page_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ /*
+ * Update the entries with NUMA node ID. The status array is indexed the
+ * same way as the entry index.
+ */
+ for (i = 0; i < os_page_count; i++)
+ {
+ fctx->record[i].numa_node = os_page_status[i];
+ }
+
+ pfree(os_page_status);
+ pfree(os_page_ptrs);
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ uint32 i = funcctx->call_cntr;
+ Datum values[NUM_BUFFERCACHE_NUMA_ELEM];
+ bool nulls[NUM_BUFFERCACHE_NUMA_ELEM];
+
+ values[0] = Int32GetDatum(fctx->record[i].bufferid);
+ nulls[0] = false;
+
+ values[1] = Int32GetDatum(fctx->record[i].numa_page);
+ nulls[1] = false;
+
+ values[2] = Int32GetDatum(fctx->record[i].numa_node);
+ nulls[2] = false;
+
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ result = HeapTupleGetDatum(tuple);
+
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ SRF_RETURN_DONE(funcctx);
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..59dbbd2b25e 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,14 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides the same information
+ as <function>pg_buffercache_pages()</function> but is slower because it also
+ provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +210,68 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache-numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed by the view are shown in <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>bufferid</structfield> <type>integer</type>
+ </para>
+ <para>
+ ID, in the range 1..<varname>shared_buffers</varname>
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>page_num</structfield> <type>int</type>
+ </para>
+ <para>
+ number of OS memory page for this buffer
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int</type>
+ </para>
+ <para>
+ ID of NUMA node
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
--
2.49.0
v24-0003-Introduce-pg_shmem_allocations_numa-view.patchtext/x-patch; charset=UTF-8; name=v24-0003-Introduce-pg_shmem_allocations_numa-view.patchDownload
From b2c8061d09cf13d6713d0b7f41b88b33b0af7710 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v24 3/3] Introduce pg_shmem_allocations_numa view
Introduce new pg_shmem_alloctions_numa view with information about how
shared memory is distributed across NUMA nodes.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 79 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 132 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 +++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 274 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 4f336ee0adf..a83365ae24a 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -181,6 +181,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -4051,6 +4056,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-allocations-numa">
+ <title><structname>pg_shmem_allocations_numa</structname></title>
+
+ <indexterm zone="view-pg-shmem-allocations-numa">
+ <primary>pg_shmem_allocations_numa</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_allocations_numa</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_allocations_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_allocations_numa</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 273008db37f..08f780a2e63 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_allocations_numa AS
+ SELECT * FROM pg_get_shmem_allocations_numa();
+
+REVOKE ALL ON pg_shmem_allocations_numa FROM PUBLIC;
+GRANT SELECT ON pg_shmem_allocations_numa TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index e453f856794..4313e6db62c 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -90,6 +91,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -570,3 +573,132 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+Datum
+pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+ return (Datum) 0;
+ }
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ /* Get number of OS aliged pages */
+ shm_ent_page_count = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (which indicates unmapped/unallocated pages).
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+ /* Count number of NUMA nodes used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s >= 0 && s <= max_nodes)
+ nodes[s]++;
+ else
+ elog(ERROR, "invalid NUMA node id outside of allowed range [0, %ld]: %d", max_nodes, s);
+ }
+
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory.
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 63859661951..0a1fc2a25cd 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8522,6 +8522,14 @@
proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_allocations_numa', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,node_id,size}',
+ prosrc => 'pg_get_shmem_allocations_numa' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..668172f7d79
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 5588d83e1bf..f9c3ff259bc 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3150,6 +3150,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
has_table_privilege
@@ -3169,6 +3175,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 673c63b8d1b..abfdc97abc5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1757,6 +1757,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_allocations_numa| SELECT name,
+ node_id,
+ size
+ FROM pg_get_shmem_allocations_numa() pg_get_shmem_allocations_numa(name, node_id, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..034098783fb
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index 286b1d03756..81b40a1a330 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1922,12 +1922,14 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.49.0
Hi,
On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote:
On 4/3/25 15:12, Jakub Wartak wrote:
On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:
...
So unless someone can demonstrate a use case where this would matter,
I'd not worry about it too much.OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
it's just that I don't have cycles left today and probably lack skills
(i've never dealt with arrays so far) thus it would be slow to get it
right... but I can pick up anything tomorrow morning.OK, I took a stab at reworking/simplifying this the way I proposed.
Here's v24 - needs more polishing, but hopefully enough to show what I
had in mind.It does these changes:
1) Drops 0002 with the pg_buffercache refactoring, because the new view
is not "extending" the existing one.
I think that makes sense. One would just need to join on the pg_buffercache
view to get more information about the buffer if needed.
The pg_buffercache_numa_pages() doc needs an update though as I don't think that
"+ The <function>pg_buffercache_numa_pages()</function> provides the same
information as <function>pg_buffercache_pages()</function>" is still true.
2) Reworks pg_buffercache_num to return just three columns, bufferid,
page_num and node_id. page_num is a sequence starting from 0 for each
buffer.
+1 on the idea
3) It now builds an array of records, with one record per buffer/page.
4) I realized we don't really need to worry about buffers_per_page very
much, except for logging/debugging. There's always "at least one page"
per buffer, even if an incomplete one, so we can do this:os_page_count = NBuffers * Max(1, pages_per_buffer);
and then
for (i = 0; i < NBuffers; i++)
{
for (j = 0; j < Max(1, pages_per_buffer); j++)
That's a nice simplification as we always need to take care of at least one page
per buffer.
and everything just works fine, I think.
I think the same.
Opinions? I personally find this much cleaner / easier to understand.
I agree that's easier to understand and that that looks correct.
A few random comments:
=== 1
It looks like that firstNumaTouch is not set to false anymore.
=== 2
+ pfree(os_page_status);
+ pfree(os_page_ptrs);
Not sure that's needed, we should be in a short-lived memory context here
(ExprContext or such).
=== 3
+ ro_volatile_var = *(uint64 *)ptr
space missing before "ptr"?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Fri, Apr 4, 2025 at 8:50 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi,
On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote:
On 4/3/25 15:12, Jakub Wartak wrote:
On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:
...
So unless someone can demonstrate a use case where this would matter,
I'd not worry about it too much.OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
it's just that I don't have cycles left today and probably lack skills
(i've never dealt with arrays so far) thus it would be slow to get it
right... but I can pick up anything tomorrow morning.OK, I took a stab at reworking/simplifying this the way I proposed.
Here's v24 - needs more polishing, but hopefully enough to show what I
had in mind.It does these changes:
1) Drops 0002 with the pg_buffercache refactoring, because the new view
is not "extending" the existing one.I think that makes sense. One would just need to join on the pg_buffercache
view to get more information about the buffer if needed.The pg_buffercache_numa_pages() doc needs an update though as I don't think that
"+ The <function>pg_buffercache_numa_pages()</function> provides the same
information as <function>pg_buffercache_pages()</function>" is still true.2) Reworks pg_buffercache_num to return just three columns, bufferid,
page_num and node_id. page_num is a sequence starting from 0 for each
buffer.+1 on the idea
3) It now builds an array of records, with one record per buffer/page.
4) I realized we don't really need to worry about buffers_per_page very
much, except for logging/debugging. There's always "at least one page"
per buffer, even if an incomplete one, so we can do this:os_page_count = NBuffers * Max(1, pages_per_buffer);
and then
for (i = 0; i < NBuffers; i++)
{
for (j = 0; j < Max(1, pages_per_buffer); j++)That's a nice simplification as we always need to take care of at least one page
per buffer.and everything just works fine, I think.
I think the same.
Opinions? I personally find this much cleaner / easier to understand.
I agree that's easier to understand and that that looks correct.
A few random comments:
=== 1
It looks like that firstNumaTouch is not set to false anymore.
=== 2
+ pfree(os_page_status);
+ pfree(os_page_ptrs);Not sure that's needed, we should be in a short-lived memory context here
(ExprContext or such).=== 3
+ ro_volatile_var = *(uint64 *)ptr
space missing before "ptr"?
+my feedback as I've noticed that Bertrand already provided a review.
Right, the code is now simple , and that Max() is brilliant. I've
attached some review findings as .txt
0001 100%LGTM
0002 doc fix + pgident + Tomas, you should take Authored-by yourself
there for sure, I couldn't pull this off alone in time! So big thanks!
0003 fixes elog UINT64_FORMAT for ming32 (a little bit funny to have
NUMA on ming32...:))
When started with interleave=all on serious hardware, I'm getting (~5s
for s_b=64GB) from pg_buffercache_numa
node_id | count
---------+---------
3 | 2097152
0 | 2097152
2 | 2097152
1 | 2097152
so this is valid result (2097152*4 numa nodes*8192 buffer
size/1024/1024/1024 = 64GB)
Also with pgbench -i -s 20 , after ~8s:
select c.relname, n.node_id, count(*) from pg_buffercache_numa n
join pg_buffercache b on (b.bufferid = n.bufferid)
join pg_class c on (c.relfilenode = b.relfilenode)
group by c.relname, n.node_id order by count(*) desc;
relname | node_id | count
-----------------------------------------------+---------+-------
pgbench_accounts | 2 | 8217
pgbench_accounts | 0 | 8190
pgbench_accounts | 3 | 8189
pgbench_accounts | 1 | 8187
pg_statistic | 2 | 32
pg_operator | 2 | 14
pg_depend | 3 | 14
[..]
pg_shm_allocations_numa also looks good.
I think v24+tiny fixes is good enough to go in.
-J.
Attachments:
v24-review-0002.docfix.txttext/plain; charset=US-ASCII; name=v24-review-0002.docfix.txtDownload
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 59dbbd2b25e..3d9032efafb 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -45,9 +45,10 @@
</para>
<para>
- The <function>pg_buffercache_numa_pages()</function> provides the same information
- as <function>pg_buffercache_pages()</function> but is slower because it also
- provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
+ The <function>pg_buffercache_numa_pages()</function> provides
+ <acronym>NUMA</acronym> node mappings for shared buffer entries. This
+ information is not part of <function>pg_buffercache_pages()</function>
+ itself, as it is much slower to retrieve.
The <structname>pg_buffercache_numa</structname> view wraps the function for
convenient use.
</para>
v24-review-0002.pgident.txttext/plain; charset=US-ASCII; name=v24-review-0002.pgident.txtDownload
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index d653f4af394..5526dee7171 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -67,7 +67,7 @@ typedef struct
uint32 bufferid;
int32 numa_page;
int32 numa_node;
-} BufferCacheNumaRec;
+} BufferCacheNumaRec;
/*
* Function context for data persisting over repeated calls.
@@ -79,7 +79,7 @@ typedef struct
int pages_per_buffer;
int os_page_size;
BufferCacheNumaRec *record;
-} BufferCacheNumaContext;
+} BufferCacheNumaContext;
/*
@@ -454,7 +454,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
*/
for (j = 0; j < Max(1, pages_per_buffer); j++)
{
- char *buffptr = (char *) BufferGetBlock(i + 1);
+ char *buffptr = (char *) BufferGetBlock(i + 1);
fctx->record[idx].bufferid = bufferid;
fctx->record[idx].numa_page = j;
@@ -480,8 +480,8 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
elog(ERROR, "failed NUMA pages inquiry: %m");
/*
- * Update the entries with NUMA node ID. The status array is indexed the
- * same way as the entry index.
+ * Update the entries with NUMA node ID. The status array is indexed
+ * the same way as the entry index.
*/
for (i = 0; i < os_page_count; i++)
{
v24-review-0003.elogfix.txttext/plain; charset=US-ASCII; name=v24-review-0003.elogfix.txtDownload
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 4313e6db62c..e26af975a7d 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -677,7 +677,7 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
if (s >= 0 && s <= max_nodes)
nodes[s]++;
else
- elog(ERROR, "invalid NUMA node id outside of allowed range [0, %ld]: %d", max_nodes, s);
+ elog(ERROR, "invalid NUMA node id outside of allowed range [0, " UINT64_FORMAT "]: %d", max_nodes, s);
}
for (i = 0; i <= max_nodes; i++)
On 4/4/25 09:35, Jakub Wartak wrote:
On Fri, Apr 4, 2025 at 8:50 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:Hi,
On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote:
On 4/3/25 15:12, Jakub Wartak wrote:
On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:
...
So unless someone can demonstrate a use case where this would matter,
I'd not worry about it too much.OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
it's just that I don't have cycles left today and probably lack skills
(i've never dealt with arrays so far) thus it would be slow to get it
right... but I can pick up anything tomorrow morning.OK, I took a stab at reworking/simplifying this the way I proposed.
Here's v24 - needs more polishing, but hopefully enough to show what I
had in mind.It does these changes:
1) Drops 0002 with the pg_buffercache refactoring, because the new view
is not "extending" the existing one.I think that makes sense. One would just need to join on the pg_buffercache
view to get more information about the buffer if needed.The pg_buffercache_numa_pages() doc needs an update though as I don't think that
"+ The <function>pg_buffercache_numa_pages()</function> provides the same
information as <function>pg_buffercache_pages()</function>" is still true.2) Reworks pg_buffercache_num to return just three columns, bufferid,
page_num and node_id. page_num is a sequence starting from 0 for each
buffer.+1 on the idea
3) It now builds an array of records, with one record per buffer/page.
4) I realized we don't really need to worry about buffers_per_page very
much, except for logging/debugging. There's always "at least one page"
per buffer, even if an incomplete one, so we can do this:os_page_count = NBuffers * Max(1, pages_per_buffer);
and then
for (i = 0; i < NBuffers; i++)
{
for (j = 0; j < Max(1, pages_per_buffer); j++)That's a nice simplification as we always need to take care of at least one page
per buffer.and everything just works fine, I think.
I think the same.
Opinions? I personally find this much cleaner / easier to understand.
I agree that's easier to understand and that that looks correct.
A few random comments:
=== 1
It looks like that firstNumaTouch is not set to false anymore.
=== 2
+ pfree(os_page_status);
+ pfree(os_page_ptrs);Not sure that's needed, we should be in a short-lived memory context here
(ExprContext or such).=== 3
+ ro_volatile_var = *(uint64 *)ptr
space missing before "ptr"?
+my feedback as I've noticed that Bertrand already provided a review.
Right, the code is now simple , and that Max() is brilliant. I've
attached some review findings as .txt0001 100%LGTM
0002 doc fix + pgident + Tomas, you should take Authored-by yourself
there for sure, I couldn't pull this off alone in time! So big thanks!
0003 fixes elog UINT64_FORMAT for ming32 (a little bit funny to have
NUMA on ming32...:))
OK
When started with interleave=all on serious hardware, I'm getting (~5s
for s_b=64GB) from pg_buffercache_numanode_id | count
---------+---------
3 | 2097152
0 | 2097152
2 | 2097152
1 | 2097152so this is valid result (2097152*4 numa nodes*8192 buffer
size/1024/1024/1024 = 64GB)Also with pgbench -i -s 20 , after ~8s:
select c.relname, n.node_id, count(*) from pg_buffercache_numa n
join pg_buffercache b on (b.bufferid = n.bufferid)
join pg_class c on (c.relfilenode = b.relfilenode)
group by c.relname, n.node_id order by count(*) desc;
relname | node_id | count
-----------------------------------------------+---------+-------
pgbench_accounts | 2 | 8217
pgbench_accounts | 0 | 8190
pgbench_accounts | 3 | 8189
pgbench_accounts | 1 | 8187
pg_statistic | 2 | 32
pg_operator | 2 | 14
pg_depend | 3 | 14
[..]pg_shm_allocations_numa also looks good.
I think v24+tiny fixes is good enough to go in.
OK.
Do you have any suggestions regarding the column names in the new view?
I'm not sure I like node_id and page_num.
regards
--
Tomas Vondra
On 4/4/25 08:50, Bertrand Drouvot wrote:
Hi,
On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote:
On 4/3/25 15:12, Jakub Wartak wrote:
On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:
...
So unless someone can demonstrate a use case where this would matter,
I'd not worry about it too much.OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
it's just that I don't have cycles left today and probably lack skills
(i've never dealt with arrays so far) thus it would be slow to get it
right... but I can pick up anything tomorrow morning.OK, I took a stab at reworking/simplifying this the way I proposed.
Here's v24 - needs more polishing, but hopefully enough to show what I
had in mind.It does these changes:
1) Drops 0002 with the pg_buffercache refactoring, because the new view
is not "extending" the existing one.I think that makes sense. One would just need to join on the pg_buffercache
view to get more information about the buffer if needed.The pg_buffercache_numa_pages() doc needs an update though as I don't think that
"+ The <function>pg_buffercache_numa_pages()</function> provides the same
information as <function>pg_buffercache_pages()</function>" is still true.
Right, thanks for checking the docs.
2) Reworks pg_buffercache_num to return just three columns, bufferid,
page_num and node_id. page_num is a sequence starting from 0 for each
buffer.+1 on the idea
3) It now builds an array of records, with one record per buffer/page.
4) I realized we don't really need to worry about buffers_per_page very
much, except for logging/debugging. There's always "at least one page"
per buffer, even if an incomplete one, so we can do this:os_page_count = NBuffers * Max(1, pages_per_buffer);
and then
for (i = 0; i < NBuffers; i++)
{
for (j = 0; j < Max(1, pages_per_buffer); j++)That's a nice simplification as we always need to take care of at least one page
per buffer.
OK. I think I'll consider moving some of this code "building" the
entries into a separate function, to keep the main function easier to
understand.
and everything just works fine, I think.
I think the same.
Opinions? I personally find this much cleaner / easier to understand.
I agree that's easier to understand and that that looks correct.
A few random comments:
=== 1
It looks like that firstNumaTouch is not set to false anymore.
Damn, my mistake.
=== 2
+ pfree(os_page_status);
+ pfree(os_page_ptrs);Not sure that's needed, we should be in a short-lived memory context here
(ExprContext or such).
Yeah, maybe. It's not allocated in the multi-call context, but I wasn't
sure. Will check.
=== 3
+ ro_volatile_var = *(uint64 *)ptr
space missing before "ptr"?
Interesting the pgindent didn't tweak this.
regards
--
Tomas Vondra
On Fri, Apr 4, 2025 at 4:36 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi Tomas,
Do you have any suggestions regarding the column names in the new view?
I'm not sure I like node_id and page_num.
They actually look good to me. We've discussed earlier dropping
s/numa_//g for column names (after all views contain it already) so
they are fine in this regard.
There's also the question of consistency: (bufferid, page_num,
node_id) -- maybe should just drop "_" and that's it?
Well I would even possibly consider page_num -> ospagenumber, but that's ugly.
-J.
OK,
here's v25 after going through the patches once more, fixing the issues
mentioned by Bertrand, etc. I think 0001 and 0002 are fine, I have a
couple minor questions about 0003.
0002
----
- Adds the new types to typedefs.list, to make pgindent happy.
- Improves comment for pg_buffercache_numa_pages
- Minor formatting tweaks.
- I was wondering if maybe we should have some "global ID" of memory
page, so that with large memory pages it's indicated the buffers are on
the same memory page. Right now each buffer starts page_num from 0, but
it should not be very hard to have a global counter. Opinions?
0003
----
- Minor formatting tweaks, comment improvements.
- Isn't this comment a bit confusing / misleading?
/* Get number of OS aligned pages */
AFAICS the point is to adjust the allocated_size to be a multiple of
os-page_size, to get "all" memory pages the segment uses. But that's not
what I understand by "aligned page" (which is about there the page is
expected to start). Or did I get this wrong?
- There's a comment at the end which talks about "ignored segments".
IMHO that type of information should be in the function comment, but I'm
also not quite sure I understand what "output shared memory" is ...
regards
--
Tomas Vondra
Attachments:
v25-0001-Add-support-for-basic-NUMA-awareness.patchtext/x-patch; charset=UTF-8; name=v25-0001-Add-support-for-basic-NUMA-awareness.patchDownload
From 381c5077592e38dbcbbf6acc4f1e86a767a92957 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 2 Apr 2025 12:29:22 +0200
Subject: [PATCH v25 1/5] Add support for basic NUMA awareness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.
A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with NUMA library.
The libnuma library is not available on 32-bit builds (there's no shared
object for i386), so we disable it in that case. The i386 is very memory
limited anyway, even with PAE, so NUMA is mostly irrelevant.
On Linux we use move_pages(2) syscall for speed instead of
get_mempolicy(2).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 2 +
configure | 187 ++++++++++++++++++++++++++++
configure.ac | 14 +++
doc/src/sgml/func.sgml | 13 ++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 6 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 40 ++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 120 ++++++++++++++++++
17 files changed, 442 insertions(+), 2 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
--with-liburing \
\
${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
diff --git a/configure b/configure
index 11615d1122d..e27badd83c3 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
XML2_CFLAGS
XML2_CONFIG
with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
@@ -872,6 +875,7 @@ with_liburing
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
LIBURING_LIBS
LIBCURL_CFLAGS
LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
XML2_CONFIG
XML2_CFLAGS
XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -1629,6 +1636,10 @@ Some influential environment variables:
C compiler flags for LIBCURL, overriding pkg-config
LIBCURL_LIBS
linker flags for LIBCURL, overriding pkg-config
+ LIBNUMA_CFLAGS
+ C compiler flags for LIBNUMA, overriding pkg-config
+ LIBNUMA_LIBS
+ linker flags for LIBNUMA, overriding pkg-config
XML2_CONFIG path to xml2-config utility
XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
XML2_LIBS linker flags for XML2, overriding pkg-config
@@ -9063,6 +9074,182 @@ $as_echo "$as_me: WARNING: *** OAuth support tests require --with-python to run"
fi
+#
+# libnuma
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libnuma support" >&5
+$as_echo_n "checking whether to build with libnuma support... " >&6; }
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libnuma" >&5
+$as_echo "$with_libnuma" >&6; }
+
+
+if test "$with_libnuma" = yes ; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'libnuma' is required for NUMA support" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa" >&5
+$as_echo_n "checking for numa... " >&6; }
+
+if test -n "$LIBNUMA_CFLAGS"; then
+ pkg_cv_LIBNUMA_CFLAGS="$LIBNUMA_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_CFLAGS=`$PKG_CONFIG --cflags "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBNUMA_LIBS"; then
+ pkg_cv_LIBNUMA_LIBS="$LIBNUMA_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_LIBS=`$PKG_CONFIG --libs "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "numa" 2>&1`
+ else
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "numa" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBNUMA_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (numa) were not met:
+
+$LIBNUMA_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBNUMA_CFLAGS=$pkg_cv_LIBNUMA_CFLAGS
+ LIBNUMA_LIBS=$pkg_cv_LIBNUMA_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
#
# XML
#
diff --git a/configure.ac b/configure.ac
index debdf165044..d365a486d3d 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,20 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma support],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA support])])
+ PKG_CHECK_MODULES(LIBNUMA, numa)
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0224f93733d..9ab070adffb 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25143,6 +25143,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..8ebf0b03ec0 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname> library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-liburing">
<term><option>--with-liburing</option></term>
<listitem>
@@ -2645,6 +2655,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname> library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 454ed81f5ea..46e92daeb62 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: liburing
@@ -3243,6 +3264,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
liburing,
libxml,
lz4,
@@ -3899,6 +3921,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..06bf5627d3c 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA support')
+
option('liburing', type : 'feature', value: 'auto',
description: 'io_uring support, for asynchronous I/O')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 737b2dd1869..6722fbdf365 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
@@ -223,6 +224,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBNUMA_CFLAGS = @LIBNUMA_CFLAGS@
+LIBNUMA_LIBS = @LIBNUMA_LIBS@
+
LIBURING_CFLAGS = @LIBURING_CFLAGS@
LIBURING_LIBS = @LIBURING_LIBS@
@@ -250,7 +254,7 @@ CPP = @CPP@
CPPFLAGS = @CPPFLAGS@
PG_SYSROOT = @PG_SYSROOT@
-override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBNUMA_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
ifdef PGXS
override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..ea8d796e7c4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5d5be8ba4e1..dfc59ea0cc8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8542,6 +8542,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c2f1241b234..b3166ec8f42 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -686,6 +686,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to build with io_uring support. (--with-liburing) */
#undef USE_LIBURING
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..3c1b50c1428
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *) ptr
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
@@ -232,6 +234,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index f11896440d5..4274949dfa4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
path.o \
pg_bitutils.o \
pg_localeconv_r.o \
+ pg_numa.o \
pg_popcount_aarch64.o \
pg_popcount_avx512.o \
pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 51041e75609..228888b2f66 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_localeconv_r.c',
+ 'pg_numa.c',
'pg_popcount_aarch64.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..5e2523cf798
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+/*
+ * We use move_pages(2) syscall here - instead of get_mempolicy(2) - as the
+ * first one allows us to batch and query about many memory pages in one single
+ * giant system call that is way faster.
+ */
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+#else
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
+
+/* This should be used only after the server is started */
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+ Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
--
2.49.0
v25-0002-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchtext/x-patch; charset=UTF-8; name=v25-0002-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchDownload
From 0dde27480440e58c045341902051771d2b11e8e5 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 3 Apr 2025 20:21:25 +0200
Subject: [PATCH v25 2/5] Add pg_buffercache_numa view with NUMA node info
Introduces a new view pg_buffercache_numa, showing a NUMA memory node
for each individual buffer.
To determine the NUMA node for a buffer, we first need to touch the
memory pages using pg_numa_touch_mem_if_required, otherwise we might get
status -2 (ENOENT = The page is not present), indicating the page is
either unmapped or unallocated.
The size of a database block and OS memory page may differ. For example
the default block size (BLCKSZ) is 8KB, while the memory page is 4KB,
but it's also possible to make the block size smaller (e.g. 1KB).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 5 +-
.../expected/pg_buffercache_numa.out | 28 ++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 22 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 276 ++++++++++++++++++
.../sql/pg_buffercache_numa.sql | 20 ++
doc/src/sgml/pgbuffercache.sgml | 75 ++++-
9 files changed, 429 insertions(+), 4 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..5f748543e2e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,10 +8,11 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
-REGRESS = pg_buffercache
+REGRESS = pg_buffercache pg_buffercache_numa
ifdef USE_PGXS
PG_CONFIG = pg_config
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..1230e244a5f
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,22 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, page_num int4, node_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..a0e4cd69aee 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,6 +11,7 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
@@ -20,6 +21,8 @@
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
+#define NUM_BUFFERCACHE_NUMA_ELEM 3
+
PG_MODULE_MAGIC_EXT(
.name = "pg_buffercache",
.version = PG_VERSION
@@ -59,15 +62,41 @@ typedef struct
} BufferCachePagesContext;
+typedef struct
+{
+ uint32 bufferid;
+ int32 numa_page;
+ int32 numa_node;
+} BufferCacheNumaRec;
+
+/*
+ * Function context for data persisting over repeated calls.
+ */
+typedef struct
+{
+ TupleDesc tupdesc;
+ int buffers_per_page;
+ int pages_per_buffer;
+ int os_page_size;
+ BufferCacheNumaRec *record;
+} BufferCacheNumaContext;
+
+
/*
* Function returning data from the shared buffer cache - buffer number,
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+
Datum
pg_buffercache_pages(PG_FUNCTION_ARGS)
{
@@ -246,6 +275,253 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * Inquire about NUMA memory mappings, especially the NUMA node.
+ *
+ * In order to get reliable results we also need to touch memory pages, so
+ * that the inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ MemoryContext oldcontext;
+ BufferCacheNumaContext *fctx; /* User function context. */
+ TupleDesc tupledesc;
+ TupleDesc expected_tupledesc;
+ HeapTuple tuple;
+ Datum result;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i,
+ j,
+ idx;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_page_status;
+ uint64 os_page_count;
+ int pages_per_buffer;
+ int buffers_per_page;
+ volatile uint64 touch pg_attribute_unused();
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS
+ * memory page size 2. Calculate how many OS pages are used by all
+ * buffer blocks 3. Calculate how many OS pages are contained within
+ * each database block.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+ buffers_per_page = os_page_size / BLCKSZ;
+ pages_per_buffer = BLCKSZ / os_page_size;
+
+ /*
+ * The pages and block size is expected to be 2^k, so one divides the
+ * other (we don't know in which direction).
+ */
+ Assert((os_page_size % BLCKSZ == 0) || (BLCKSZ % os_page_size == 0));
+
+ /*
+ * Either both counts are 1 (when the pages have the same size), or
+ * exacly one of them is zero. Both can't be zero at the same time.
+ */
+ Assert((buffers_per_page > 0) || (pages_per_buffer > 0));
+ Assert(((buffers_per_page == 1) && (pages_per_buffer == 1)) ||
+ ((buffers_per_page == 0) || (pages_per_buffer == 0)));
+
+ /*
+ * How many addresses we are going to query (store) depends on the
+ * relation between BLCKSZ : PAGESIZE. We need at least one status per
+ * buffer - if the memory page is larger than buffer, we still query
+ * it for each buffer. With multiple memory pages per buffer, we need
+ * that many entries.
+ */
+ os_page_count = NBuffers * Max(1, pages_per_buffer);
+
+ elog(DEBUG1, "NUMA: NBuffers=%d os_page_query_count=" UINT64_FORMAT " os_page_size=%zu buffers_per_page=%d pages_per_buffer=%d",
+ NBuffers, os_page_count, os_page_size, buffers_per_page, pages_per_buffer);
+
+
+ /* initialize the multi-call context */
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCacheNumaContext *) palloc(sizeof(BufferCacheNumaContext));
+
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. We unfortunately have to get the result type for that... -
+ * we can't use the result type determined by the function definition
+ * without potentially crashing when somebody uses the old (or even
+ * wrong) function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts != NUM_BUFFERCACHE_NUMA_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "page_num",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "node_id",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCacheNumaRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCacheNumaRec) * os_page_count);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+
+
+ /* determine the NUMA node for OS pages */
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+ os_page_status = palloc(sizeof(uint64) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(os_page_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ *
+ * This loop touches and stores addresses into os_page_ptrs[] as input
+ * to one big big move_pages(2) inquiry system call. Basically we ask
+ * for all memory pages for NBuffers.
+ */
+ idx = 0;
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+ uint32 bufferid;
+
+ CHECK_FOR_INTERRUPTS();
+
+ bufHdr = GetBufferDescriptor(i);
+
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+ bufferid = BufferDescriptorGetBuffer(bufHdr);
+
+ UnlockBufHdr(bufHdr, buf_state);
+
+ /*
+ * If we have multiple OS pages per buffer, fill those in too. We
+ * always want at least one OS page, even if there are multiple
+ * buffers per page.
+ *
+ * Altough we could query just once per each OS page, we do it
+ * repeatably for each Buffer and hit the same address as
+ * move_pages(2) requires page aligment. This also simplifies
+ * retrieval code later on. Also NBuffers starts from 1.
+ */
+ for (j = 0; j < Max(1, pages_per_buffer); j++)
+ {
+ char *buffptr = (char *) BufferGetBlock(i + 1);
+
+ fctx->record[idx].bufferid = bufferid;
+ fctx->record[idx].numa_page = j;
+
+ os_page_ptrs[idx]
+ = (char *) TYPEALIGN(os_page_size,
+ buffptr + (os_page_size * j));
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[idx]);
+
+ ++idx;
+ }
+
+ }
+
+ /* we should get exactly the expected number of entrires */
+ Assert(idx == os_page_count);
+
+ /* query NUMA status for all the pointers */
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_page_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ /*
+ * Update the entries with NUMA node ID. The status array is indexed
+ * the same way as the entry index.
+ */
+ for (i = 0; i < os_page_count; i++)
+ {
+ fctx->record[i].numa_node = os_page_status[i];
+ }
+
+ /* remember this backend touched the pages */
+ firstNumaTouch = false;
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ uint32 i = funcctx->call_cntr;
+ Datum values[NUM_BUFFERCACHE_NUMA_ELEM];
+ bool nulls[NUM_BUFFERCACHE_NUMA_ELEM];
+
+ values[0] = Int32GetDatum(fctx->record[i].bufferid);
+ nulls[0] = false;
+
+ values[1] = Int32GetDatum(fctx->record[i].numa_page);
+ nulls[1] = false;
+
+ values[2] = Int32GetDatum(fctx->record[i].numa_node);
+ nulls[2] = false;
+
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ result = HeapTupleGetDatum(tuple);
+
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ SRF_RETURN_DONE(funcctx);
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..3d9032efafb 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,15 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides
+ <acronym>NUMA</acronym> node mappings for shared buffer entries. This
+ information is not part of <function>pg_buffercache_pages()</function>
+ itself, as it is much slower to retrieve.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +211,68 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache-numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed by the view are shown in <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>bufferid</structfield> <type>integer</type>
+ </para>
+ <para>
+ ID, in the range 1..<varname>shared_buffers</varname>
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>page_num</structfield> <type>int</type>
+ </para>
+ <para>
+ number of OS memory page for this buffer
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int</type>
+ </para>
+ <para>
+ ID of NUMA node
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
--
2.49.0
v25-0003-review.patchtext/x-patch; charset=UTF-8; name=v25-0003-review.patchDownload
From 9b7de94cd2dacc40ffd0b4dfc992678c7eee41c2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Fri, 4 Apr 2025 21:10:11 +0200
Subject: [PATCH v25 3/5] review
---
contrib/pg_buffercache/pg_buffercache_pages.c | 28 +++++++++++++------
doc/src/sgml/pgbuffercache.sgml | 2 +-
src/tools/pgindent/typedefs.list | 2 ++
3 files changed, 23 insertions(+), 9 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index a0e4cd69aee..65ade9d8135 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -61,13 +61,15 @@ typedef struct
BufferCachePagesRec *record;
} BufferCachePagesContext;
-
+/*
+ * Record structure holding the to be exposed cache data.
+ */
typedef struct
{
uint32 bufferid;
int32 numa_page;
int32 numa_node;
-} BufferCacheNumaRec;
+} BufferCacheNumaRec;
/*
* Function context for data persisting over repeated calls.
@@ -79,7 +81,7 @@ typedef struct
int pages_per_buffer;
int os_page_size;
BufferCacheNumaRec *record;
-} BufferCacheNumaContext;
+} BufferCacheNumaContext;
/*
@@ -276,7 +278,15 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
}
/*
- * Inquire about NUMA memory mappings, especially the NUMA node.
+ * Inquire about NUMA memory mappings for shared buffers.
+ *
+ * Returns NUMA node ID for each memory page used by the buffer. Buffers may
+ * be smaller or larger than OS memory pages. For each buffer we return one
+ * entry for each memory page used by the buffer (it fhe buffer is smaller,
+ * it only uses a part of one memory page).
+ *
+ * We expect both sizes (for buffers and memory pages) to be a power-of-2, so
+ * one is always a multiple of the other.
*
* In order to get reliable results we also need to touch memory pages, so
* that the inquiry about NUMA memory node doesn't return -2 (which indicates
@@ -348,11 +358,13 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
*/
os_page_count = NBuffers * Max(1, pages_per_buffer);
- elog(DEBUG1, "NUMA: NBuffers=%d os_page_query_count=" UINT64_FORMAT " os_page_size=%zu buffers_per_page=%d pages_per_buffer=%d",
- NBuffers, os_page_count, os_page_size, buffers_per_page, pages_per_buffer);
+ elog(DEBUG1, "NUMA: NBuffers=%d os_page_query_count=" UINT64_FORMAT " "
+ "os_page_size=%zu buffers_per_page=%d pages_per_buffer=%d",
+ NBuffers, os_page_count, os_page_size,
+ buffers_per_page, pages_per_buffer);
- /* initialize the multi-call context */
+ /* initialize the multi-call context, load entries about buffers */
funcctx = SRF_FIRSTCALL_INIT();
@@ -400,7 +412,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
MemoryContextSwitchTo(oldcontext);
- /* determine the NUMA node for OS pages */
+ /* used to determine the NUMA node for all OS pages at once */
os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
os_page_status = palloc(sizeof(uint64) * os_page_count);
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 3d9032efafb..b01f8e71357 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -256,7 +256,7 @@
<structfield>node_id</structfield> <type>int</type>
</para>
<para>
- ID of NUMA node
+ ID of <acronym>NUMA</acronym> node
</para></entry>
</row>
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 229fbff47ae..714cee6d6f1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -340,6 +340,8 @@ BufFile
Buffer
BufferAccessStrategy
BufferAccessStrategyType
+BufferCacheNumaRec
+BufferCacheNumaContext
BufferCachePagesContext
BufferCachePagesRec
BufferDesc
--
2.49.0
v25-0004-Introduce-pg_shmem_allocations_numa-view.patchtext/x-patch; charset=UTF-8; name=v25-0004-Introduce-pg_shmem_allocations_numa-view.patchDownload
From 65f4d373afc7166de769035bcb4333fdcc78707d Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v25 4/5] Introduce pg_shmem_allocations_numa view
Introduce new pg_shmem_alloctions_numa view with information about how
shared memory is distributed across NUMA nodes.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 79 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 132 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 +++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 274 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 4f336ee0adf..a83365ae24a 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -181,6 +181,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -4051,6 +4056,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-allocations-numa">
+ <title><structname>pg_shmem_allocations_numa</structname></title>
+
+ <indexterm zone="view-pg-shmem-allocations-numa">
+ <primary>pg_shmem_allocations_numa</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_allocations_numa</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_allocations_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_allocations_numa</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 273008db37f..08f780a2e63 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_allocations_numa AS
+ SELECT * FROM pg_get_shmem_allocations_numa();
+
+REVOKE ALL ON pg_shmem_allocations_numa FROM PUBLIC;
+GRANT SELECT ON pg_shmem_allocations_numa TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..852f2c7c453 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,132 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+Datum
+pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ if (pg_numa_init() == -1)
+ {
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+ return (Datum) 0;
+ }
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ /* Get number of OS aliged pages */
+ shm_ent_page_count = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ /*
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (which indicates unmapped/unallocated pages).
+ */
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+ /* Count number of NUMA nodes used for this shared memory entry */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s >= 0 && s <= max_nodes)
+ nodes[s]++;
+ else
+ elog(ERROR, "invalid NUMA node id outside of allowed range [0, " UINT64_FORMAT "]: %d", max_nodes, s);
+ }
+
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory.
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index dfc59ea0cc8..a93075c675c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8546,6 +8546,14 @@
proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_allocations_numa', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,node_id,size}',
+ prosrc => 'pg_get_shmem_allocations_numa' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..668172f7d79
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 1fddb13b6ae..c25062c288f 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3219,8 +3219,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3242,6 +3242,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
has_table_privilege
@@ -3261,6 +3267,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 673c63b8d1b..abfdc97abc5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1757,6 +1757,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_allocations_numa| SELECT name,
+ node_id,
+ size
+ FROM pg_get_shmem_allocations_numa() pg_get_shmem_allocations_numa(name, node_id, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..034098783fb
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index 85d7280f35f..f337aa67c13 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1947,8 +1947,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1958,12 +1958,14 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.49.0
v25-0005-review.patchtext/x-patch; charset=UTF-8; name=v25-0005-review.patchDownload
From f48800fe6dbc86a94df8f97f2bbde6bc68f74639 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Fri, 4 Apr 2025 20:45:21 +0200
Subject: [PATCH v25 5/5] review
---
src/backend/storage/ipc/shmem.c | 54 ++++++++++++++++++++++-----------
1 file changed, 37 insertions(+), 17 deletions(-)
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 852f2c7c453..5d979423bd9 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -590,13 +590,11 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
max_nodes;
Size *nodes;
- InitMaterializedSRF(fcinfo, 0);
-
if (pg_numa_init() == -1)
- {
elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
- return (Datum) 0;
- }
+
+ InitMaterializedSRF(fcinfo, 0);
+
max_nodes = pg_numa_get_max_node();
nodes = palloc(sizeof(Size) * (max_nodes + 1));
@@ -619,6 +617,9 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
* memory size. This simplified approach allocates enough space for all
* pages in shared memory rather than calculating the exact requirements
* for each segment.
+ *
+ * XXX Isn't this wasteful? But there probably is one large segment of
+ * shared memory, much larger than the rest anyway.
*/
shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
@@ -637,8 +638,12 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
{
int i;
- /* Get number of OS aliged pages */
- shm_ent_page_count = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
+ /* XXX I assume we use TYPEALIGN as a way to round to whole pages.
+ * It's a bit misleading to call that "aligned", no? */
+
+ /* Get number of OS aligned pages */
+ shm_ent_page_count
+ = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
/*
* If we get ever 0xff back from kernel inquiry, then we probably have
@@ -646,16 +651,20 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
*/
memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+ /*
+ * Setup page_ptrs[] with pointers to all OS pages for this segment,
+ * and get the NUMA status using pg_numa_query_pages.
+ *
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (ENOENT, which indicates unmapped/unallocated pages).
+ */
for (i = 0; i < shm_ent_page_count; i++)
{
- /*
- * In order to get reliable results we also need to touch memory
- * pages, so that inquiry about NUMA memory node doesn't return -2
- * (which indicates unmapped/unallocated pages).
- */
volatile uint64 touch pg_attribute_unused();
page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+
if (firstNumaTouch)
pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
@@ -665,19 +674,27 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
elog(ERROR, "failed NUMA pages inquiry status: %m");
- memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
/* Count number of NUMA nodes used for this shared memory entry */
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+
for (i = 0; i < shm_ent_page_count; i++)
{
int s = pages_status[i];
/* Ensure we are adding only valid index to the array */
- if (s >= 0 && s <= max_nodes)
- nodes[s]++;
- else
- elog(ERROR, "invalid NUMA node id outside of allowed range [0, " UINT64_FORMAT "]: %d", max_nodes, s);
+ if (s < 0 || s > max_nodes)
+ {
+ elog(ERROR, "invalid NUMA node id outside of allowed range "
+ "[0, " UINT64_FORMAT "]: %d", max_nodes, s);
+ }
+
+ nodes[s]++;
}
+ /*
+ * Add one entry for each NUMA node, including those without allocated
+ * memory for this segment.
+ */
for (i = 0; i <= max_nodes; i++)
{
values[0] = CStringGetTextDatum(ent->key);
@@ -693,6 +710,9 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
* We are ignoring the following memory regions (as compared to
* pg_get_shmem_allocations()): 1. output shared memory allocated but not
* counted via the shmem index 2. output as-of-yet unused shared memory.
+ *
+ * XXX Not quite sure why this is at the end, and what "output memory"
+ * refers to.
*/
LWLockRelease(ShmemIndexLock);
--
2.49.0
Hi,
On Fri, Apr 04, 2025 at 09:25:57PM +0200, Tomas Vondra wrote:
OK,
here's v25 after going through the patches once more, fixing the issues
mentioned by Bertrand, etc.
Thanks!
I think 0001 and 0002 are fine,
Agree, I just have some cosmetic nits comments: please find them in
nit-bertrand-0002.txt attached.
I have a
couple minor questions about 0003.- I was wondering if maybe we should have some "global ID" of memory
page, so that with large memory pages it's indicated the buffers are on
the same memory page. Right now each buffer starts page_num from 0, but
it should not be very hard to have a global counter. Opinions?
I think that's a good idea. We could then add a new column (say os_page_id) that
would help identify which buffers are sharing the same "physical" page.
0003
----
- Minor formatting tweaks, comment improvements.
- Isn't this comment a bit confusing / misleading?/* Get number of OS aligned pages */
AFAICS the point is to adjust the allocated_size to be a multiple of
os-page_size, to get "all" memory pages the segment uses. But that's not
what I understand by "aligned page" (which is about there the page is
expected to start).
Agree, what about?
"
Align the start of the allocated size to an OS page size boundary and then get
the total number of OS pages used by this segment"
"
- There's a comment at the end which talks about "ignored segments".
IMHO that type of information should be in the function comment,
but I'm
also not quite sure I understand what "output shared memory" is ...
I think that comes from the comments that are already in
pg_get_shmem_allocations().
I think that those are located here and worded that way to ease to understand
what is not in with pg_get_shmem_allocations_numa() if one look at both
functions. That said, I'm +1 to put this kind of comments in the function comment.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
nit-bertrand-0002.txttext/plain; charset=us-asciiDownload
commit 17942e95d99d288658b6530e49795c28a93886d2
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Sat Apr 5 08:56:03 2025 +0000
Nit
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 65ade9d8135..0b96476c319 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -364,7 +364,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
buffers_per_page, pages_per_buffer);
- /* initialize the multi-call context, load entries about buffers */
+ /* Initialize the multi-call context, load entries about buffers */
funcctx = SRF_FIRSTCALL_INIT();
@@ -412,7 +412,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
MemoryContextSwitchTo(oldcontext);
- /* used to determine the NUMA node for all OS pages at once */
+ /* Used to determine the NUMA node for all OS pages at once */
os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
os_page_status = palloc(sizeof(uint64) * os_page_count);
@@ -484,10 +484,10 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
}
- /* we should get exactly the expected number of entrires */
+ /* We should get exactly the expected number of entrires */
Assert(idx == os_page_count);
- /* query NUMA status for all the pointers */
+ /* Query NUMA status for all the pointers */
if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_page_status) == -1)
elog(ERROR, "failed NUMA pages inquiry: %m");
@@ -500,7 +500,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
fctx->record[i].numa_node = os_page_status[i];
}
- /* remember this backend touched the pages */
+ /* Remember this backend touched the pages */
firstNumaTouch = false;
}
On 4/5/25 11:37, Bertrand Drouvot wrote:
Hi,
On Fri, Apr 04, 2025 at 09:25:57PM +0200, Tomas Vondra wrote:
OK,
here's v25 after going through the patches once more, fixing the issues
mentioned by Bertrand, etc.Thanks!
I think 0001 and 0002 are fine,
Agree, I just have some cosmetic nits comments: please find them in
nit-bertrand-0002.txt attached.I have a
couple minor questions about 0003.- I was wondering if maybe we should have some "global ID" of memory
page, so that with large memory pages it's indicated the buffers are on
the same memory page. Right now each buffer starts page_num from 0, but
it should not be very hard to have a global counter. Opinions?I think that's a good idea. We could then add a new column (say os_page_id) that
would help identify which buffers are sharing the same "physical" page.
I was thinking we'd change the definition of the existing page_num
column, i.e. it wouldn't be 0..N sequence for each buffer, but a global
page ID. But I don't know if this would be useful in practice.
0003
----
- Minor formatting tweaks, comment improvements.
- Isn't this comment a bit confusing / misleading?/* Get number of OS aligned pages */
AFAICS the point is to adjust the allocated_size to be a multiple of
os-page_size, to get "all" memory pages the segment uses. But that's not
what I understand by "aligned page" (which is about there the page is
expected to start).Agree, what about?
"
Align the start of the allocated size to an OS page size boundary and then get
the total number of OS pages used by this segment"
"
Something like that. But I think it should be "align the size of ...",
we're not aligning the start.
- There's a comment at the end which talks about "ignored segments".
IMHO that type of information should be in the function comment,
but I'm
also not quite sure I understand what "output shared memory" is ...I think that comes from the comments that are already in
pg_get_shmem_allocations().I think that those are located here and worded that way to ease to understand
what is not in with pg_get_shmem_allocations_numa() if one look at both
functions. That said, I'm +1 to put this kind of comments in the function comment.
OK. But I'm still not sure what "output shared memory" is about. Can you
explain what shmem segments are not included?
regards
--
Tomas Vondra
On 4/5/25 15:23, Tomas Vondra wrote:
On 4/5/25 11:37, Bertrand Drouvot wrote:
Hi,
On Fri, Apr 04, 2025 at 09:25:57PM +0200, Tomas Vondra wrote:
OK,
here's v25 after going through the patches once more, fixing the issues
mentioned by Bertrand, etc.Thanks!
I think 0001 and 0002 are fine,
Agree, I just have some cosmetic nits comments: please find them in
nit-bertrand-0002.txt attached.I have a
couple minor questions about 0003.- I was wondering if maybe we should have some "global ID" of memory
page, so that with large memory pages it's indicated the buffers are on
the same memory page. Right now each buffer starts page_num from 0, but
it should not be very hard to have a global counter. Opinions?I think that's a good idea. We could then add a new column (say os_page_id) that
would help identify which buffers are sharing the same "physical" page.I was thinking we'd change the definition of the existing page_num
column, i.e. it wouldn't be 0..N sequence for each buffer, but a global
page ID. But I don't know if this would be useful in practice.
See the attached v25 with a draft of this in patch 0003.
While working on this, I realized it's probably wrong to use TYPEALIGN()
to calculate the OS page pointer. The code did this:
os_page_ptrs[idx]
= (char *) TYPEALIGN(os_page_size,
buffptr + (os_page_size * j));
but TYPEALIGN() rounds "up". Let's assume we have 1KB buffers and 4KB
memory pages, and that the first buffer is aligned to 4kB (i.e. it
starts right at the beginning of a memory page). Then we expect to get
page_num sequence:
0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, ...
with 4 buffers per memory page. But we get this:
0, 1, 1, 1, 1, 2, 2, 2, 2, ...
So I've changed this to TYPEALIGN_DOWN(), which fixes the result.
The pg_shmem_allocations_numa had a variant of this issue when
calculating the number of pages, I think. I believe the shmem segment
may start half-way through a page, and the allocated_size may not be a
multiple of os_page_size (otherwise why use TYPEALIGN). In which case we
might skip the first page, I think.
The one question I have about this is whether we know the pointer
returned by TYPEALIGN_DOWN() is valid. It's before ent->location (or
before the first shared buffer) ...
regards
--
Tomas Vondra
Attachments:
v25-0001-Add-support-for-basic-NUMA-awareness.patchtext/x-patch; charset=UTF-8; name=v25-0001-Add-support-for-basic-NUMA-awareness.patchDownload
From 759e12ab166f1af975fb2af5a9c1adaed8f8490b Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 2 Apr 2025 12:29:22 +0200
Subject: [PATCH v25 1/5] Add support for basic NUMA awareness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.
A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with NUMA library.
The libnuma library is not available on 32-bit builds (there's no shared
object for i386), so we disable it in that case. The i386 is very memory
limited anyway, even with PAE, so NUMA is mostly irrelevant.
On Linux we use move_pages(2) syscall for speed instead of
get_mempolicy(2).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 2 +
configure | 187 ++++++++++++++++++++++++++++
configure.ac | 14 +++
doc/src/sgml/func.sgml | 13 ++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 6 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 40 ++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 120 ++++++++++++++++++
17 files changed, 442 insertions(+), 2 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
--with-liburing \
\
${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
diff --git a/configure b/configure
index 11615d1122d..e27badd83c3 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
XML2_CFLAGS
XML2_CONFIG
with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
@@ -872,6 +875,7 @@ with_liburing
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
LIBURING_LIBS
LIBCURL_CFLAGS
LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
XML2_CONFIG
XML2_CFLAGS
XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -1629,6 +1636,10 @@ Some influential environment variables:
C compiler flags for LIBCURL, overriding pkg-config
LIBCURL_LIBS
linker flags for LIBCURL, overriding pkg-config
+ LIBNUMA_CFLAGS
+ C compiler flags for LIBNUMA, overriding pkg-config
+ LIBNUMA_LIBS
+ linker flags for LIBNUMA, overriding pkg-config
XML2_CONFIG path to xml2-config utility
XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
XML2_LIBS linker flags for XML2, overriding pkg-config
@@ -9063,6 +9074,182 @@ $as_echo "$as_me: WARNING: *** OAuth support tests require --with-python to run"
fi
+#
+# libnuma
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libnuma support" >&5
+$as_echo_n "checking whether to build with libnuma support... " >&6; }
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libnuma" >&5
+$as_echo "$with_libnuma" >&6; }
+
+
+if test "$with_libnuma" = yes ; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'libnuma' is required for NUMA support" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa" >&5
+$as_echo_n "checking for numa... " >&6; }
+
+if test -n "$LIBNUMA_CFLAGS"; then
+ pkg_cv_LIBNUMA_CFLAGS="$LIBNUMA_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_CFLAGS=`$PKG_CONFIG --cflags "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBNUMA_LIBS"; then
+ pkg_cv_LIBNUMA_LIBS="$LIBNUMA_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_LIBS=`$PKG_CONFIG --libs "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "numa" 2>&1`
+ else
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "numa" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBNUMA_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (numa) were not met:
+
+$LIBNUMA_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBNUMA_CFLAGS=$pkg_cv_LIBNUMA_CFLAGS
+ LIBNUMA_LIBS=$pkg_cv_LIBNUMA_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
#
# XML
#
diff --git a/configure.ac b/configure.ac
index debdf165044..d365a486d3d 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,20 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma support],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA support])])
+ PKG_CHECK_MODULES(LIBNUMA, numa)
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0224f93733d..9ab070adffb 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25143,6 +25143,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..8ebf0b03ec0 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname> library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-liburing">
<term><option>--with-liburing</option></term>
<listitem>
@@ -2645,6 +2655,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname> library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 454ed81f5ea..46e92daeb62 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: liburing
@@ -3243,6 +3264,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
liburing,
libxml,
lz4,
@@ -3899,6 +3921,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..06bf5627d3c 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA support')
+
option('liburing', type : 'feature', value: 'auto',
description: 'io_uring support, for asynchronous I/O')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 737b2dd1869..6722fbdf365 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
@@ -223,6 +224,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBNUMA_CFLAGS = @LIBNUMA_CFLAGS@
+LIBNUMA_LIBS = @LIBNUMA_LIBS@
+
LIBURING_CFLAGS = @LIBURING_CFLAGS@
LIBURING_LIBS = @LIBURING_LIBS@
@@ -250,7 +254,7 @@ CPP = @CPP@
CPPFLAGS = @CPPFLAGS@
PG_SYSROOT = @PG_SYSROOT@
-override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBNUMA_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
ifdef PGXS
override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..ea8d796e7c4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5d5be8ba4e1..dfc59ea0cc8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8542,6 +8542,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c2f1241b234..b3166ec8f42 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -686,6 +686,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to build with io_uring support. (--with-liburing) */
#undef USE_LIBURING
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..3c1b50c1428
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *) ptr
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
@@ -232,6 +234,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index f11896440d5..4274949dfa4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
path.o \
pg_bitutils.o \
pg_localeconv_r.o \
+ pg_numa.o \
pg_popcount_aarch64.o \
pg_popcount_avx512.o \
pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 51041e75609..228888b2f66 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_localeconv_r.c',
+ 'pg_numa.c',
'pg_popcount_aarch64.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..5e2523cf798
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+/*
+ * We use move_pages(2) syscall here - instead of get_mempolicy(2) - as the
+ * first one allows us to batch and query about many memory pages in one single
+ * giant system call that is way faster.
+ */
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+#else
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
+
+/* This should be used only after the server is started */
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+ Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
--
2.49.0
v25-0002-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchtext/x-patch; charset=UTF-8; name=v25-0002-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchDownload
From c6862620829f958feb71b987aa34c528149d64ad Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 3 Apr 2025 20:21:25 +0200
Subject: [PATCH v25 2/5] Add pg_buffercache_numa view with NUMA node info
Introduces a new view pg_buffercache_numa, showing a NUMA memory node
for each individual buffer.
To determine the NUMA node for a buffer, we first need to touch the
memory pages using pg_numa_touch_mem_if_required, otherwise we might get
status -2 (ENOENT = The page is not present), indicating the page is
either unmapped or unallocated.
The size of a database block and OS memory page may differ. For example
the default block size (BLCKSZ) is 8KB, while the memory page is 4KB,
but it's also possible to make the block size smaller (e.g. 1KB).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 5 +-
.../expected/pg_buffercache_numa.out | 28 ++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 22 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 288 ++++++++++++++++++
.../sql/pg_buffercache_numa.sql | 20 ++
doc/src/sgml/pgbuffercache.sgml | 75 ++++-
src/tools/pgindent/typedefs.list | 2 +
10 files changed, 443 insertions(+), 4 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..5f748543e2e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,10 +8,11 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
-REGRESS = pg_buffercache
+REGRESS = pg_buffercache pg_buffercache_numa
ifdef USE_PGXS
PG_CONFIG = pg_config
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..1230e244a5f
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,22 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, page_num int4, node_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..0b96476c319 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,6 +11,7 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
@@ -20,6 +21,8 @@
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
+#define NUM_BUFFERCACHE_NUMA_ELEM 3
+
PG_MODULE_MAGIC_EXT(
.name = "pg_buffercache",
.version = PG_VERSION
@@ -58,16 +61,44 @@ typedef struct
BufferCachePagesRec *record;
} BufferCachePagesContext;
+/*
+ * Record structure holding the to be exposed cache data.
+ */
+typedef struct
+{
+ uint32 bufferid;
+ int32 numa_page;
+ int32 numa_node;
+} BufferCacheNumaRec;
+
+/*
+ * Function context for data persisting over repeated calls.
+ */
+typedef struct
+{
+ TupleDesc tupdesc;
+ int buffers_per_page;
+ int pages_per_buffer;
+ int os_page_size;
+ BufferCacheNumaRec *record;
+} BufferCacheNumaContext;
+
/*
* Function returning data from the shared buffer cache - buffer number,
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+
Datum
pg_buffercache_pages(PG_FUNCTION_ARGS)
{
@@ -246,6 +277,263 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * Inquire about NUMA memory mappings for shared buffers.
+ *
+ * Returns NUMA node ID for each memory page used by the buffer. Buffers may
+ * be smaller or larger than OS memory pages. For each buffer we return one
+ * entry for each memory page used by the buffer (it fhe buffer is smaller,
+ * it only uses a part of one memory page).
+ *
+ * We expect both sizes (for buffers and memory pages) to be a power-of-2, so
+ * one is always a multiple of the other.
+ *
+ * In order to get reliable results we also need to touch memory pages, so
+ * that the inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ MemoryContext oldcontext;
+ BufferCacheNumaContext *fctx; /* User function context. */
+ TupleDesc tupledesc;
+ TupleDesc expected_tupledesc;
+ HeapTuple tuple;
+ Datum result;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i,
+ j,
+ idx;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_page_status;
+ uint64 os_page_count;
+ int pages_per_buffer;
+ int buffers_per_page;
+ volatile uint64 touch pg_attribute_unused();
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS
+ * memory page size 2. Calculate how many OS pages are used by all
+ * buffer blocks 3. Calculate how many OS pages are contained within
+ * each database block.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+ buffers_per_page = os_page_size / BLCKSZ;
+ pages_per_buffer = BLCKSZ / os_page_size;
+
+ /*
+ * The pages and block size is expected to be 2^k, so one divides the
+ * other (we don't know in which direction).
+ */
+ Assert((os_page_size % BLCKSZ == 0) || (BLCKSZ % os_page_size == 0));
+
+ /*
+ * Either both counts are 1 (when the pages have the same size), or
+ * exacly one of them is zero. Both can't be zero at the same time.
+ */
+ Assert((buffers_per_page > 0) || (pages_per_buffer > 0));
+ Assert(((buffers_per_page == 1) && (pages_per_buffer == 1)) ||
+ ((buffers_per_page == 0) || (pages_per_buffer == 0)));
+
+ /*
+ * How many addresses we are going to query (store) depends on the
+ * relation between BLCKSZ : PAGESIZE. We need at least one status per
+ * buffer - if the memory page is larger than buffer, we still query
+ * it for each buffer. With multiple memory pages per buffer, we need
+ * that many entries.
+ */
+ os_page_count = NBuffers * Max(1, pages_per_buffer);
+
+ elog(DEBUG1, "NUMA: NBuffers=%d os_page_query_count=" UINT64_FORMAT " "
+ "os_page_size=%zu buffers_per_page=%d pages_per_buffer=%d",
+ NBuffers, os_page_count, os_page_size,
+ buffers_per_page, pages_per_buffer);
+
+
+ /* Initialize the multi-call context, load entries about buffers */
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCacheNumaContext *) palloc(sizeof(BufferCacheNumaContext));
+
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. We unfortunately have to get the result type for that... -
+ * we can't use the result type determined by the function definition
+ * without potentially crashing when somebody uses the old (or even
+ * wrong) function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts != NUM_BUFFERCACHE_NUMA_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "page_num",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "node_id",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCacheNumaRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCacheNumaRec) * os_page_count);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+
+
+ /* Used to determine the NUMA node for all OS pages at once */
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+ os_page_status = palloc(sizeof(uint64) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(os_page_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ *
+ * This loop touches and stores addresses into os_page_ptrs[] as input
+ * to one big big move_pages(2) inquiry system call. Basically we ask
+ * for all memory pages for NBuffers.
+ */
+ idx = 0;
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+ uint32 bufferid;
+
+ CHECK_FOR_INTERRUPTS();
+
+ bufHdr = GetBufferDescriptor(i);
+
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+ bufferid = BufferDescriptorGetBuffer(bufHdr);
+
+ UnlockBufHdr(bufHdr, buf_state);
+
+ /*
+ * If we have multiple OS pages per buffer, fill those in too. We
+ * always want at least one OS page, even if there are multiple
+ * buffers per page.
+ *
+ * Altough we could query just once per each OS page, we do it
+ * repeatably for each Buffer and hit the same address as
+ * move_pages(2) requires page aligment. This also simplifies
+ * retrieval code later on. Also NBuffers starts from 1.
+ */
+ for (j = 0; j < Max(1, pages_per_buffer); j++)
+ {
+ char *buffptr = (char *) BufferGetBlock(i + 1);
+
+ fctx->record[idx].bufferid = bufferid;
+ fctx->record[idx].numa_page = j;
+
+ os_page_ptrs[idx]
+ = (char *) TYPEALIGN(os_page_size,
+ buffptr + (os_page_size * j));
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[idx]);
+
+ ++idx;
+ }
+
+ }
+
+ /* We should get exactly the expected number of entrires */
+ Assert(idx == os_page_count);
+
+ /* Query NUMA status for all the pointers */
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_page_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ /*
+ * Update the entries with NUMA node ID. The status array is indexed
+ * the same way as the entry index.
+ */
+ for (i = 0; i < os_page_count; i++)
+ {
+ fctx->record[i].numa_node = os_page_status[i];
+ }
+
+ /* Remember this backend touched the pages */
+ firstNumaTouch = false;
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ uint32 i = funcctx->call_cntr;
+ Datum values[NUM_BUFFERCACHE_NUMA_ELEM];
+ bool nulls[NUM_BUFFERCACHE_NUMA_ELEM];
+
+ values[0] = Int32GetDatum(fctx->record[i].bufferid);
+ nulls[0] = false;
+
+ values[1] = Int32GetDatum(fctx->record[i].numa_page);
+ nulls[1] = false;
+
+ values[2] = Int32GetDatum(fctx->record[i].numa_node);
+ nulls[2] = false;
+
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ result = HeapTupleGetDatum(tuple);
+
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ SRF_RETURN_DONE(funcctx);
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..b01f8e71357 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,15 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides
+ <acronym>NUMA</acronym> node mappings for shared buffer entries. This
+ information is not part of <function>pg_buffercache_pages()</function>
+ itself, as it is much slower to retrieve.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +211,68 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache-numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed by the view are shown in <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>bufferid</structfield> <type>integer</type>
+ </para>
+ <para>
+ ID, in the range 1..<varname>shared_buffers</varname>
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>page_num</structfield> <type>int</type>
+ </para>
+ <para>
+ number of OS memory page for this buffer
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 0c81d03950d..ed74a76a5c7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -341,6 +341,8 @@ BufFile
Buffer
BufferAccessStrategy
BufferAccessStrategyType
+BufferCacheNumaRec
+BufferCacheNumaContext
BufferCachePagesContext
BufferCachePagesRec
BufferDesc
--
2.49.0
v25-0003-adjust-page_num.patchtext/x-patch; charset=UTF-8; name=v25-0003-adjust-page_num.patchDownload
From 402feee87f2a14f670ad766c95c7773c7cf712d7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 5 Apr 2025 16:00:39 +0200
Subject: [PATCH v25 3/5] adjust page_num
---
contrib/pg_buffercache/pg_buffercache_pages.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 0b96476c319..a3c4a2578d9 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -315,6 +315,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
int pages_per_buffer;
int buffers_per_page;
volatile uint64 touch pg_attribute_unused();
+ char *startptr = NULL;
if (pg_numa_init() == -1)
elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
@@ -437,6 +438,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
* to one big big move_pages(2) inquiry system call. Basically we ask
* for all memory pages for NBuffers.
*/
+ startptr = (char *) BufferGetBlock(1);
idx = 0;
for (i = 0; i < NBuffers; i++)
{
@@ -469,11 +471,14 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
char *buffptr = (char *) BufferGetBlock(i + 1);
fctx->record[idx].bufferid = bufferid;
- fctx->record[idx].numa_page = j;
os_page_ptrs[idx]
- = (char *) TYPEALIGN(os_page_size,
- buffptr + (os_page_size * j));
+ = (char *) TYPEALIGN_DOWN(os_page_size,
+ buffptr + (os_page_size * j));
+
+ /* calculate ID of the OS memory page */
+ fctx->record[idx].numa_page
+ = ((char *) os_page_ptrs[idx] - startptr) / os_page_size;
/* Only need to touch memory once per backend process lifetime */
if (firstNumaTouch)
--
2.49.0
v25-0004-Introduce-pg_shmem_allocations_numa-view.patchtext/x-patch; charset=UTF-8; name=v25-0004-Introduce-pg_shmem_allocations_numa-view.patchDownload
From 03d24af540f8235ad9ca9537db0a1ba5dbcf6ccb Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v25 4/5] Introduce pg_shmem_allocations_numa view
Introduce new pg_shmem_alloctions_numa view with information about how
shared memory is distributed across NUMA nodes.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 79 ++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 152 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 ++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 294 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 4f336ee0adf..a83365ae24a 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -181,6 +181,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -4051,6 +4056,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-allocations-numa">
+ <title><structname>pg_shmem_allocations_numa</structname></title>
+
+ <indexterm zone="view-pg-shmem-allocations-numa">
+ <primary>pg_shmem_allocations_numa</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_allocations_numa</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_allocations_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_allocations_numa</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 273008db37f..08f780a2e63 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_allocations_numa AS
+ SELECT * FROM pg_get_shmem_allocations_numa();
+
+REVOKE ALL ON pg_shmem_allocations_numa FROM PUBLIC;
+GRANT SELECT ON pg_shmem_allocations_numa TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..5d979423bd9 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,152 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+Datum
+pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ *
+ * XXX Isn't this wasteful? But there probably is one large segment of
+ * shared memory, much larger than the rest anyway.
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ /* XXX I assume we use TYPEALIGN as a way to round to whole pages.
+ * It's a bit misleading to call that "aligned", no? */
+
+ /* Get number of OS aligned pages */
+ shm_ent_page_count
+ = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ /*
+ * Setup page_ptrs[] with pointers to all OS pages for this segment,
+ * and get the NUMA status using pg_numa_query_pages.
+ *
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (ENOENT, which indicates unmapped/unallocated pages).
+ */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ /* Count number of NUMA nodes used for this shared memory entry */
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s < 0 || s > max_nodes)
+ {
+ elog(ERROR, "invalid NUMA node id outside of allowed range "
+ "[0, " UINT64_FORMAT "]: %d", max_nodes, s);
+ }
+
+ nodes[s]++;
+ }
+
+ /*
+ * Add one entry for each NUMA node, including those without allocated
+ * memory for this segment.
+ */
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory.
+ *
+ * XXX Not quite sure why this is at the end, and what "output memory"
+ * refers to.
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index dfc59ea0cc8..a93075c675c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8546,6 +8546,14 @@
proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_allocations_numa', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,node_id,size}',
+ prosrc => 'pg_get_shmem_allocations_numa' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..668172f7d79
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 1fddb13b6ae..c25062c288f 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3219,8 +3219,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3242,6 +3242,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
has_table_privilege
@@ -3261,6 +3267,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 673c63b8d1b..abfdc97abc5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1757,6 +1757,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_allocations_numa| SELECT name,
+ node_id,
+ size
+ FROM pg_get_shmem_allocations_numa() pg_get_shmem_allocations_numa(name, node_id, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..034098783fb
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index 85d7280f35f..f337aa67c13 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1947,8 +1947,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1958,12 +1958,14 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.49.0
v25-0005-adjust-page-alignment.patchtext/x-patch; charset=UTF-8; name=v25-0005-adjust-page-alignment.patchDownload
From a3d50ee60f313f7c0d665dc03a188dec7f32a4e3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 5 Apr 2025 16:20:13 +0200
Subject: [PATCH v25 5/5] adjust page alignment
---
src/backend/storage/ipc/shmem.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 5d979423bd9..4a9a9606f2e 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -637,13 +637,22 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
{
int i;
+ char *startptr,
+ *endptr;
+ Size total_len;
- /* XXX I assume we use TYPEALIGN as a way to round to whole pages.
- * It's a bit misleading to call that "aligned", no? */
+ /*
+ * Calculate the range of OS pages used by this segment. The segment
+ * may start / end half-way through a page, we want to count these
+ * pages too. So we align the start/end pointers down/up, and then
+ * calculate the number of pages from that.
+ */
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size, ent->location);
+ endptr = (char *) TYPEALIGN(os_page_size,
+ (char *) ent->location + ent->allocated_size);
+ total_len = (endptr - startptr);
- /* Get number of OS aligned pages */
- shm_ent_page_count
- = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
+ shm_ent_page_count = total_len / os_page_size;
/*
* If we get ever 0xff back from kernel inquiry, then we probably have
@@ -663,7 +672,7 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
{
volatile uint64 touch pg_attribute_unused();
- page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ page_ptrs[i] = startptr + (i * os_page_size);
if (firstNumaTouch)
pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
--
2.49.0
Hi,
On Sat, Apr 05, 2025 at 04:33:28PM +0200, Tomas Vondra wrote:
On 4/5/25 15:23, Tomas Vondra wrote:
I was thinking we'd change the definition of the existing page_num
column, i.e. it wouldn't be 0..N sequence for each buffer, but a global
page ID. But I don't know if this would be useful in practice.See the attached v25 with a draft of this in patch 0003.
I see, thanks for sharing. I think that's useful because that could help
identify which buffers share the same OS page.
While working on this, I realized it's probably wrong to use TYPEALIGN()
to calculate the OS page pointer. The code did this:os_page_ptrs[idx]
= (char *) TYPEALIGN(os_page_size,
buffptr + (os_page_size * j));but TYPEALIGN() rounds "up". Let's assume we have 1KB buffers and 4KB
memory pages, and that the first buffer is aligned to 4kB (i.e. it
starts right at the beginning of a memory page). Then we expect to get
page_num sequence:0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, ...
with 4 buffers per memory page. But we get this:
0, 1, 1, 1, 1, 2, 2, 2, 2, ...
So I've changed this to TYPEALIGN_DOWN(), which fixes the result.
Good catch, that makes fully sense.
But now I can see some page_num < 0 :
postgres=# select page_num,node_id,count(*) from pg_buffercache_numa group by page_num,node_id order by 1 limit 4;
page_num | node_id | count
----------+---------+-------
-1 | 0 | 386
0 | 1 | 1024
1 | 0 | 1024
2 | 1 | 1024
I think that can be solved that way:
- startptr = (char *) BufferGetBlock(1);
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char *) BufferGetBlock(1));
so that startptr is aligned to the same boundaries. But I guess that we'll
have the same question as the following one:
The one question I have about this is whether we know the pointer
returned by TYPEALIGN_DOWN() is valid. It's before ent->location (or
before the first shared buffer) ...
Yeah, I'm not 100% sure about that... Maybe for safety we could use TYPEALIGN_DOWN()
for the reporting and use the actual buffer address when pg_numa_touch_mem_if_required()
is called? (to avoid touching "invalid" memory).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Sat, Apr 05, 2025 at 03:23:38PM +0200, Tomas Vondra wrote:
Something like that. But I think it should be "align the size of ...",
we're not aligning the start.- There's a comment at the end which talks about "ignored segments".
IMHO that type of information should be in the function comment,
but I'm
also not quite sure I understand what "output shared memory" is ...I think that comes from the comments that are already in
pg_get_shmem_allocations().I think that those are located here and worded that way to ease to understand
what is not in with pg_get_shmem_allocations_numa() if one look at both
functions. That said, I'm +1 to put this kind of comments in the function comment.OK. But I'm still not sure what "output shared memory" is about. Can you
explain what shmem segments are not included?
Looking at pg_get_shmem_allocations() and the pg_shmem_allocations view
documentation, I would say the wording is linked to "anonymous allocations" and
"unused memory" (i.e the ones reported with <anonymous> or NULL as name in the
pg_shmem_allocations view). Using output in this comment sounds confusing while
it makes sense in pg_get_shmem_allocations() because it really reports those.
I think that we could just mention in the function comment that
pg_get_shmem_allocations_numa() does not handle anonymous allocations and unused
memory.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
I just played around with this for a bit. As noted somewhere further down,
pg_buffercache_numa.page_num ends up wonky in different ways for the different
pages.
I think one thing that the docs should mention is that calling the numa
functions/views will force the pages to be allocated, even if they're
currently unused.
Newly started server, with s_b of 32GB an 2MB huge pages:
grep ^Huge /proc/meminfo
HugePages_Total: 34802
HugePages_Free: 34448
HugePages_Rsvd: 16437
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 76517376 kB
run
SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id;
Now the pages that previously were marked as reserved are actually allocated:
grep ^Huge /proc/meminfo
HugePages_Total: 34802
HugePages_Free: 18012
HugePages_Rsvd: 1
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 76517376 kB
I don't see how we can avoid that right now, but at the very least we ought to
document it.
On 2025-04-05 16:33:28 +0200, Tomas Vondra wrote:
The libnuma library is not available on 32-bit builds (there's no shared
object for i386), so we disable it in that case. The i386 is very memory
limited anyway, even with PAE, so NUMA is mostly irrelevant.
Hm? libnuma1:i386 installs just fine to me on debian and contains the shared
library.
+############################################################### +# Library: libnuma +############################################################### + +libnumaopt = get_option('libnuma') +if not libnumaopt.disabled() + # via pkg-config + libnuma = dependency('numa', required: libnumaopt) + if not libnuma.found() + libnuma = cc.find_library('numa', required: libnumaopt) + endif
This fallback isn't going to work if -dlibnuma=enabled is used, as
dependency() will error out, due to not finding a required dependency. You'd
need to use required: false there.
Do we actually need a non-dependency() fallback here? It's linux only and a
new dependency, so just requiring a .pc file seems like it'd be fine?
+#ifdef USE_LIBNUMA + +/* + * This is required on Linux, before pg_numa_query_pages() as we + * need to page-fault before move_pages(2) syscall returns valid results. + */ +#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \ + ro_volatile_var = *(uint64 *) ptr
Does it really work that way? A volatile variable to assign the result of
dereferencing ptr ensures that the *store* isn't removed by the compiler, but
it doesn't at all guarantee that the *load* isn't removed, since that memory
isn't marked as volatile.
I think you'd need to cast the source pointer to a volatile uint64* to ensure
the load happens.
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit + +-- Register the new functions. +CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages() +RETURNS SETOF RECORD +AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages' +LANGUAGE C PARALLEL SAFE; + +-- Create a view for convenient access. +CREATE OR REPLACE VIEW pg_buffercache_numa AS + SELECT P.* FROM pg_buffercache_numa_pages() AS P + (bufferid integer, page_num int4, node_id int4);
Why CREATE OR REPLACE?
+Datum +pg_buffercache_numa_pages(PG_FUNCTION_ARGS) +{ ... + + /* + * To smoothly support upgrades from version 1.0 of this extension + * transparently handle the (non-)existence of the pinning_backends + * column. We unfortunately have to get the result type for that... - + * we can't use the result type determined by the function definition + * without potentially crashing when somebody uses the old (or even + * wrong) function definition though. + */ + if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); +
Isn't that comment inapplicable for pg_buffercache_numa_pages(), a new view?
+ + if (firstNumaTouch) + elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
Over the patchseries the related code is duplicated. Seems like it'd be good
to put it into pg_numa instead? This seems like the thing that's good to
abstract away in one central spot.
+ /* + * Scan through all the buffers, saving the relevant fields in the + * fctx->record structure. + * + * We don't hold the partition locks, so we don't get a consistent + * snapshot across all buffers, but we do grab the buffer header + * locks, so the information of each buffer is self-consistent. + * + * This loop touches and stores addresses into os_page_ptrs[] as input + * to one big big move_pages(2) inquiry system call. Basically we ask + * for all memory pages for NBuffers. + */ + idx = 0; + for (i = 0; i < NBuffers; i++) + { + BufferDesc *bufHdr; + uint32 buf_state; + uint32 bufferid; + + CHECK_FOR_INTERRUPTS(); + + bufHdr = GetBufferDescriptor(i); + + /* Lock each buffer header before inspecting. */ + buf_state = LockBufHdr(bufHdr); + bufferid = BufferDescriptorGetBuffer(bufHdr); + + UnlockBufHdr(bufHdr, buf_state);
Given that the only thing you're getting here is the buffer id, it's probably
not worth getting the spinlock, a single 4 byte read is always atomic on our
platforms.
+ /* + * If we have multiple OS pages per buffer, fill those in too. We + * always want at least one OS page, even if there are multiple + * buffers per page. + * + * Altough we could query just once per each OS page, we do it + * repeatably for each Buffer and hit the same address as + * move_pages(2) requires page aligment. This also simplifies + * retrieval code later on. Also NBuffers starts from 1. + */ + for (j = 0; j < Max(1, pages_per_buffer); j++) + { + char *buffptr = (char *) BufferGetBlock(i + 1); + + fctx->record[idx].bufferid = bufferid; + fctx->record[idx].numa_page = j; + + os_page_ptrs[idx] + = (char *) TYPEALIGN(os_page_size, + buffptr + (os_page_size * j));
FWIW, this bit here is the most expensive part of the function itself, as the
compiler has no choice than to implement it as an actual division, as
os_page_size is runtime variable.
It'd be fine to leave it like that, the call to numa_move_pages() is way more
expensive. But it shouldn't be too hard to do that alignment once, rather than
having to do it over and over.
FWIW, neither this definition of numa_page, nor the one from "adjust page_num"
works quite right for me.
This definition afaict is always 0 when using huge pages and just 0 and 1 for
4k pages. But my understanding of numa_page is that it's the "id" of the numa
pages, which isn't that?
With "adjust page_num" I get a number that starts at -1 and then increments
from there. More correct, but doesn't quite seem right either.
+ <tbody> + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>bufferid</structfield> <type>integer</type> + </para> + <para> + ID, in the range 1..<varname>shared_buffers</varname> + </para></entry> + </row> + + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>page_num</structfield> <type>int</type> + </para> + <para> + number of OS memory page for this buffer + </para></entry> + </row>
"page_num" doesn't really seem very descriptive for "number of OS memory page
for this buffer". To me "number of" sounds like it's counting the number of
associated pages, but it's really just a "page id" or something like that.
Maybe rename page_num to "os_page_id" and rephrase the comment a bit?
From 03d24af540f8235ad9ca9537db0a1ba5dbcf6ccb Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v25 4/5] Introduce pg_shmem_allocations_numa viewIntroduce new pg_shmem_alloctions_numa view with information about how
shared memory is distributed across NUMA nodes.
Nice.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: /messages/by-id/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N=q1w+DiH-696Xw@mail.gmail.com
---
doc/src/sgml/system-views.sgml | 79 ++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 152 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 ++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 294 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sqldiff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml index 4f336ee0adf..a83365ae24a 100644 --- a/doc/src/sgml/system-views.sgml +++ b/doc/src/sgml/system-views.sgml @@ -181,6 +181,11 @@ <entry>shared memory allocations</entry> </row>+ <row> + <entry><link linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry> + <entry>NUMA node mappings for shared memory allocations</entry> + </row> + <row> <entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry> <entry>planner statistics</entry> @@ -4051,6 +4056,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx </para> </sect1>+ <sect1 id="view-pg-shmem-allocations-numa"> + <title><structname>pg_shmem_allocations_numa</structname></title> + + <indexterm zone="view-pg-shmem-allocations-numa"> + <primary>pg_shmem_allocations_numa</primary> + </indexterm> + + <para> + The <structname>pg_shmem_allocations_numa</structname> shows how shared + memory allocations in the server's main shared memory segment are distributed + across NUMA nodes. This includes both memory allocated by + <productname>PostgreSQL</productname> itself and memory allocated + by extensions using the mechanisms detailed in + <xref linkend="xfunc-shared-addin" />. + </para>
I think it'd be good to describe that the view will include multiple rows for
each name if spread across multiple numa nodes.
Perhaps also that querying this view is expensive and that
pg_shmem_allocations should be used if numa information isn't needed?
+ /* + * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while + * the OS may have different memory page sizes. + * + * To correctly map between them, we need to: 1. Determine the OS memory + * page size 2. Calculate how many OS pages are used by all buffer blocks + * 3. Calculate how many OS pages are contained within each database + * block. + * + * This information is needed before calling move_pages() for NUMA memory + * node inquiry. + */ + os_page_size = pg_numa_get_pagesize(); + + /* + * Allocate memory for page pointers and status based on total shared + * memory size. This simplified approach allocates enough space for all + * pages in shared memory rather than calculating the exact requirements + * for each segment. + * + * XXX Isn't this wasteful? But there probably is one large segment of + * shared memory, much larger than the rest anyway. + */ + shm_total_page_count = ShmemSegHdr->totalsize / os_page_size; + page_ptrs = palloc0(sizeof(void *) * shm_total_page_count); + pages_status = palloc(sizeof(int) * shm_total_page_count);
There's a fair bit of duplicated code here with pg_buffercache_numa_pages(),
could more be moved to a shared helper function?
+ hash_seq_init(&hstat, ShmemIndex); + + /* output all allocated entries */ + memset(nulls, 0, sizeof(nulls)); + while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL) + {
One thing that's interesting with using ShmemIndex is that this won't get
anonymous allocations. I think that's ok for now, it'd be complicated to
figure out the unmapped "regions". But I guess it' be good to mention it in
the ocs?
+ int i; + + /* XXX I assume we use TYPEALIGN as a way to round to whole pages. + * It's a bit misleading to call that "aligned", no? */ + + /* Get number of OS aligned pages */ + shm_ent_page_count + = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size; + + /* + * If we get ever 0xff back from kernel inquiry, then we probably have + * bug in our buffers to OS page mapping code here. + */ + memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
There's obviously no guarantee that shm_ent_page_count is a multiple of
os_page_size. I think it'd be interesting to show in the view when one shmem
allocation shares a page with the prior allocation - that can contribute a bit
to contention. What about showing a start_os_page_id and end_os_page_id or
something? That could be a feature for later though.
+SELECT NOT(pg_numa_available()) AS skip_test \gset +\if :skip_test +\quit +\endif +-- switch to superuser +\c - +SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa; + ok +---- + t +(1 row)
Could it be worthwhile to run the test if !pg_numa_available(), to test that
we do the right thing in that case? We need an alternative output anyway, so
that might be fine?
Greetings,
Andres Freund
Hi,
On 2025-04-05 18:29:22 -0400, Andres Freund wrote:
I think one thing that the docs should mention is that calling the numa
functions/views will force the pages to be allocated, even if they're
currently unused.Newly started server, with s_b of 32GB an 2MB huge pages:
grep ^Huge /proc/meminfo
HugePages_Total: 34802
HugePages_Free: 34448
HugePages_Rsvd: 16437
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 76517376 kBrun
SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id;Now the pages that previously were marked as reserved are actually allocated:
grep ^Huge /proc/meminfo
HugePages_Total: 34802
HugePages_Free: 18012
HugePages_Rsvd: 1
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 76517376 kBI don't see how we can avoid that right now, but at the very least we ought to
document it.
The only allocation where that really matters is shared_buffers. I wonder if
we could special case the logic for that, by only probing if at least one of
the buffers in the range is valid.
Then we could treat a page status of -ENOENT as "page is not mapped" and
display NULL for the node_id?
Of course that would mean that we'd always need to
pg_numa_touch_mem_if_required(), not just the first time round, because we
previously might not have for a page that is now valid. But compared to the
cost of actually allocating pages, the cost for that seems small.
Greetings,
Andres Freund
On 4/6/25 00:29, Andres Freund wrote:
Hi,
I just played around with this for a bit. As noted somewhere further down,
pg_buffercache_numa.page_num ends up wonky in different ways for the different
pages.I think one thing that the docs should mention is that calling the numa
functions/views will force the pages to be allocated, even if they're
currently unused.Newly started server, with s_b of 32GB an 2MB huge pages:
grep ^Huge /proc/meminfo
HugePages_Total: 34802
HugePages_Free: 34448
HugePages_Rsvd: 16437
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 76517376 kBrun
SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id;Now the pages that previously were marked as reserved are actually allocated:
grep ^Huge /proc/meminfo
HugePages_Total: 34802
HugePages_Free: 18012
HugePages_Rsvd: 1
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 76517376 kBI don't see how we can avoid that right now, but at the very least we ought to
document it.
+1 to documenting this
On 2025-04-05 16:33:28 +0200, Tomas Vondra wrote:
The libnuma library is not available on 32-bit builds (there's no shared
object for i386), so we disable it in that case. The i386 is very memory
limited anyway, even with PAE, so NUMA is mostly irrelevant.Hm? libnuma1:i386 installs just fine to me on debian and contains the shared
library.+############################################################### +# Library: libnuma +############################################################### + +libnumaopt = get_option('libnuma') +if not libnumaopt.disabled() + # via pkg-config + libnuma = dependency('numa', required: libnumaopt) + if not libnuma.found() + libnuma = cc.find_library('numa', required: libnumaopt) + endifThis fallback isn't going to work if -dlibnuma=enabled is used, as
dependency() will error out, due to not finding a required dependency. You'd
need to use required: false there.Do we actually need a non-dependency() fallback here? It's linux only and a
new dependency, so just requiring a .pc file seems like it'd be fine?
No idea.
+#ifdef USE_LIBNUMA + +/* + * This is required on Linux, before pg_numa_query_pages() as we + * need to page-fault before move_pages(2) syscall returns valid results. + */ +#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \ + ro_volatile_var = *(uint64 *) ptrDoes it really work that way? A volatile variable to assign the result of
dereferencing ptr ensures that the *store* isn't removed by the compiler, but
it doesn't at all guarantee that the *load* isn't removed, since that memory
isn't marked as volatile.I think you'd need to cast the source pointer to a volatile uint64* to ensure
the load happens.+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit + +-- Register the new functions. +CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages() +RETURNS SETOF RECORD +AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages' +LANGUAGE C PARALLEL SAFE; + +-- Create a view for convenient access. +CREATE OR REPLACE VIEW pg_buffercache_numa AS + SELECT P.* FROM pg_buffercache_numa_pages() AS P + (bufferid integer, page_num int4, node_id int4);Why CREATE OR REPLACE?
I think this is simply due to copy-pasting the code, a plain CREATE
would be enough here.
+Datum +pg_buffercache_numa_pages(PG_FUNCTION_ARGS) +{ ... + + /* + * To smoothly support upgrades from version 1.0 of this extension + * transparently handle the (non-)existence of the pinning_backends + * column. We unfortunately have to get the result type for that... - + * we can't use the result type determined by the function definition + * without potentially crashing when somebody uses the old (or even + * wrong) function definition though. + */ + if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); +Isn't that comment inapplicable for pg_buffercache_numa_pages(), a new view?
Yes, good catch.
+ + if (firstNumaTouch) + elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");Over the patchseries the related code is duplicated. Seems like it'd be good
to put it into pg_numa instead? This seems like the thing that's good to
abstract away in one central spot.
Abstract away which part, exactly? I thought about moving some of the
code to port/pg_numa.c, but it didn't seem worth it.
+ /* + * Scan through all the buffers, saving the relevant fields in the + * fctx->record structure. + * + * We don't hold the partition locks, so we don't get a consistent + * snapshot across all buffers, but we do grab the buffer header + * locks, so the information of each buffer is self-consistent. + * + * This loop touches and stores addresses into os_page_ptrs[] as input + * to one big big move_pages(2) inquiry system call. Basically we ask + * for all memory pages for NBuffers. + */ + idx = 0; + for (i = 0; i < NBuffers; i++) + { + BufferDesc *bufHdr; + uint32 buf_state; + uint32 bufferid; + + CHECK_FOR_INTERRUPTS(); + + bufHdr = GetBufferDescriptor(i); + + /* Lock each buffer header before inspecting. */ + buf_state = LockBufHdr(bufHdr); + bufferid = BufferDescriptorGetBuffer(bufHdr); + + UnlockBufHdr(bufHdr, buf_state);Given that the only thing you're getting here is the buffer id, it's probably
not worth getting the spinlock, a single 4 byte read is always atomic on our
platforms.
Fine with me. I don't expect this to make a measurable difference so I
kept the spinlock, but if we want to remove it, I won't obect.
+ /* + * If we have multiple OS pages per buffer, fill those in too. We + * always want at least one OS page, even if there are multiple + * buffers per page. + * + * Altough we could query just once per each OS page, we do it + * repeatably for each Buffer and hit the same address as + * move_pages(2) requires page aligment. This also simplifies + * retrieval code later on. Also NBuffers starts from 1. + */ + for (j = 0; j < Max(1, pages_per_buffer); j++) + { + char *buffptr = (char *) BufferGetBlock(i + 1); + + fctx->record[idx].bufferid = bufferid; + fctx->record[idx].numa_page = j; + + os_page_ptrs[idx] + = (char *) TYPEALIGN(os_page_size, + buffptr + (os_page_size * j));FWIW, this bit here is the most expensive part of the function itself, as the
compiler has no choice than to implement it as an actual division, as
os_page_size is runtime variable.It'd be fine to leave it like that, the call to numa_move_pages() is way more
expensive. But it shouldn't be too hard to do that alignment once, rather than
having to do it over and over.
Division? It's entirely possible I'm missing something obvious, but I
don't see any divisions in this code. You're however right we could get
rid of most of this, because we could get the buffer pointer once (it's
a bit silly we get it for each page), align that, and then simply add
the page size. Something like this:
/* align to start of OS page, determine pointer to end of buffer */
char *buffptr = (char *) BufferGetBlock(i + 1);
char *ptr = buffptr - (buffptr % os_page_size);
char *endptr = buffptr + BLCKSZ;
while (ptr < endptr)
{
os_page_ptrs[idx] = ptr;
...
ptr += os_page_size;
}
This also made me think a bit more about how the blocks and pages might
align / overlap. AFAIK the buffers are aligned to PG_IO_ALIGN_SIZE,
which on x86 is 4K, i.e. the same as OS page size. But let's say the OS
page size is larger, say 1MB. AFAIK it could happen a buffer could span
multiple larger memory pages.
For example, 8K buffer could start 4K before the 1MB page boundary, and
use 4K from the next memory page. This would mean the current formulas
for buffer_per_page and pages_per_buffer can be off by 1.
This would complicate calculating os_page_count a bit, because only some
buffers would actually need the +1 (in the array / view output).
Or what do I miss? It there something that guarantees this won't happen?
FWIW, neither this definition of numa_page, nor the one from "adjust page_num"
works quite right for me.This definition afaict is always 0 when using huge pages and just 0 and 1 for
4k pages. But my understanding of numa_page is that it's the "id" of the numa
pages, which isn't that?With "adjust page_num" I get a number that starts at -1 and then increments
from there. More correct, but doesn't quite seem right either.+ <tbody> + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>bufferid</structfield> <type>integer</type> + </para> + <para> + ID, in the range 1..<varname>shared_buffers</varname> + </para></entry> + </row> + + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>page_num</structfield> <type>int</type> + </para> + <para> + number of OS memory page for this buffer + </para></entry> + </row>"page_num" doesn't really seem very descriptive for "number of OS memory page
for this buffer". To me "number of" sounds like it's counting the number of
associated pages, but it's really just a "page id" or something like that.Maybe rename page_num to "os_page_id" and rephrase the comment a bit?
Yeah, I haven't updated the docs in 0003 when adjusting the page_num
definition. It was more an experiment to see if others like this change.
From 03d24af540f8235ad9ca9537db0a1ba5dbcf6ccb Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v25 4/5] Introduce pg_shmem_allocations_numa viewIntroduce new pg_shmem_alloctions_numa view with information about how
shared memory is distributed across NUMA nodes.Nice.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: /messages/by-id/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N=q1w+DiH-696Xw@mail.gmail.com
---
doc/src/sgml/system-views.sgml | 79 ++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 152 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 ++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 294 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sqldiff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml index 4f336ee0adf..a83365ae24a 100644 --- a/doc/src/sgml/system-views.sgml +++ b/doc/src/sgml/system-views.sgml @@ -181,6 +181,11 @@ <entry>shared memory allocations</entry> </row>+ <row> + <entry><link linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry> + <entry>NUMA node mappings for shared memory allocations</entry> + </row> + <row> <entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry> <entry>planner statistics</entry> @@ -4051,6 +4056,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx </para> </sect1>+ <sect1 id="view-pg-shmem-allocations-numa"> + <title><structname>pg_shmem_allocations_numa</structname></title> + + <indexterm zone="view-pg-shmem-allocations-numa"> + <primary>pg_shmem_allocations_numa</primary> + </indexterm> + + <para> + The <structname>pg_shmem_allocations_numa</structname> shows how shared + memory allocations in the server's main shared memory segment are distributed + across NUMA nodes. This includes both memory allocated by + <productname>PostgreSQL</productname> itself and memory allocated + by extensions using the mechanisms detailed in + <xref linkend="xfunc-shared-addin" />. + </para>I think it'd be good to describe that the view will include multiple rows for
each name if spread across multiple numa nodes.Perhaps also that querying this view is expensive and that
pg_shmem_allocations should be used if numa information isn't needed?+ /* + * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while + * the OS may have different memory page sizes. + * + * To correctly map between them, we need to: 1. Determine the OS memory + * page size 2. Calculate how many OS pages are used by all buffer blocks + * 3. Calculate how many OS pages are contained within each database + * block. + * + * This information is needed before calling move_pages() for NUMA memory + * node inquiry. + */ + os_page_size = pg_numa_get_pagesize(); + + /* + * Allocate memory for page pointers and status based on total shared + * memory size. This simplified approach allocates enough space for all + * pages in shared memory rather than calculating the exact requirements + * for each segment. + * + * XXX Isn't this wasteful? But there probably is one large segment of + * shared memory, much larger than the rest anyway. + */ + shm_total_page_count = ShmemSegHdr->totalsize / os_page_size; + page_ptrs = palloc0(sizeof(void *) * shm_total_page_count); + pages_status = palloc(sizeof(int) * shm_total_page_count);There's a fair bit of duplicated code here with pg_buffercache_numa_pages(),
could more be moved to a shared helper function?+ hash_seq_init(&hstat, ShmemIndex); + + /* output all allocated entries */ + memset(nulls, 0, sizeof(nulls)); + while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL) + {One thing that's interesting with using ShmemIndex is that this won't get
anonymous allocations. I think that's ok for now, it'd be complicated to
figure out the unmapped "regions". But I guess it' be good to mention it in
the ocs?
Agreed.
+ int i; + + /* XXX I assume we use TYPEALIGN as a way to round to whole pages. + * It's a bit misleading to call that "aligned", no? */ + + /* Get number of OS aligned pages */ + shm_ent_page_count + = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size; + + /* + * If we get ever 0xff back from kernel inquiry, then we probably have + * bug in our buffers to OS page mapping code here. + */ + memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);There's obviously no guarantee that shm_ent_page_count is a multiple of
os_page_size. I think it'd be interesting to show in the view when one shmem
allocation shares a page with the prior allocation - that can contribute a bit
to contention. What about showing a start_os_page_id and end_os_page_id or
something? That could be a feature for later though.
Yeah, adding first/last page might be interesting.
+SELECT NOT(pg_numa_available()) AS skip_test \gset +\if :skip_test +\quit +\endif +-- switch to superuser +\c - +SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa; + ok +---- + t +(1 row)Could it be worthwhile to run the test if !pg_numa_available(), to test that
we do the right thing in that case? We need an alternative output anyway, so
that might be fine?
+1
regards
--
Tomas Vondra
On 4/6/25 01:00, Andres Freund wrote:
Hi,
On 2025-04-05 18:29:22 -0400, Andres Freund wrote:
I think one thing that the docs should mention is that calling the numa
functions/views will force the pages to be allocated, even if they're
currently unused.Newly started server, with s_b of 32GB an 2MB huge pages:
grep ^Huge /proc/meminfo
HugePages_Total: 34802
HugePages_Free: 34448
HugePages_Rsvd: 16437
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 76517376 kBrun
SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id;Now the pages that previously were marked as reserved are actually allocated:
grep ^Huge /proc/meminfo
HugePages_Total: 34802
HugePages_Free: 18012
HugePages_Rsvd: 1
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 76517376 kBI don't see how we can avoid that right now, but at the very least we ought to
document it.The only allocation where that really matters is shared_buffers. I wonder if
we could special case the logic for that, by only probing if at least one of
the buffers in the range is valid.Then we could treat a page status of -ENOENT as "page is not mapped" and
display NULL for the node_id?Of course that would mean that we'd always need to
pg_numa_touch_mem_if_required(), not just the first time round, because we
previously might not have for a page that is now valid. But compared to the
cost of actually allocating pages, the cost for that seems small.
I don't think this would be a good trade off. The buffers already have a
NUMA node, and users would be interested in that. It's just that we
don't have the memory mapped in the current backend, so I'd bet people
would not be happy with NULL, and would proceed to force the allocation
in some other way (say, a large query of some sort). Which obviously
causes a lot of other problems.
I can imagine having a flag that makes the allocation optional, but
there's no convenient way to pass that to a view, and I think most
people want the allocation anyway.
Especially for monitoring purposes, which usually happens in a new
connection, so the backend has little opportunity to allocate the pages
"naturally."
regards
--
Tomas Vondra
On Sun, Apr 6, 2025 at 12:29 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
Hi Andres/Tomas,
I've noticed that Tomas responded to this while writing this, so I'm
attaching git-am patches based on his v25 (no squash) and there's only
one new (last one contains fixes based on this review) + slight commit
amendment to 0001.
I just played around with this for a bit. As noted somewhere further down,
pg_buffercache_numa.page_num ends up wonky in different ways for the different
pages.
I think page_num is under heavy work in progress... I'm still not sure
is it worth exposing this (is it worth the hassle). If we scratch it
it won't be perfect, but we have everything , otherwise we risk this
feature as we are going into a feature freeze literally tomorrow.
I think one thing that the docs should mention is that calling the numa
functions/views will force the pages to be allocated, even if they're
currently unused.
[..]
I don't see how we can avoid that right now, but at the very least we ought to
document it.
Added statement about this.
On 2025-04-05 16:33:28 +0200, Tomas Vondra wrote:
The libnuma library is not available on 32-bit builds (there's no shared
object for i386), so we disable it in that case. The i386 is very memory
limited anyway, even with PAE, so NUMA is mostly irrelevant.Hm? libnuma1:i386 installs just fine to me on debian and contains the shared
library.
OK, removed from the commit message, as yeah google states it really
exists (somehow I couldn't find it back then at least here)...
+############################################################### +# Library: libnuma +############################################################### + +libnumaopt = get_option('libnuma') +if not libnumaopt.disabled() + # via pkg-config + libnuma = dependency('numa', required: libnumaopt) + if not libnuma.found() + libnuma = cc.find_library('numa', required: libnumaopt) + endifThis fallback isn't going to work if -dlibnuma=enabled is used, as
dependency() will error out, due to not finding a required dependency. You'd
need to use required: false there.Do we actually need a non-dependency() fallback here? It's linux only and a
new dependency, so just requiring a .pc file seems like it'd be fine?
I'm not sure pkg-config is present everywhere, but I'm not expernt and
AFAIR we are not consistent how various libs are handled there. It's
quite late, but for now i've now just follwed Your's recommendation
for dependency() with false.
+#ifdef USE_LIBNUMA + +/* + * This is required on Linux, before pg_numa_query_pages() as we + * need to page-fault before move_pages(2) syscall returns valid results. + */ +#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \ + ro_volatile_var = *(uint64 *) ptrDoes it really work that way? A volatile variable to assign the result of
dereferencing ptr ensures that the *store* isn't removed by the compiler, but
it doesn't at all guarantee that the *load* isn't removed, since that memory
isn't marked as volatile.I think you'd need to cast the source pointer to a volatile uint64* to ensure
the load happens.
OK, thanks, good finding, I was not aware compiler can bypass it like that.
[..]
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
Why CREATE OR REPLACE?
Fixed.
+Datum +pg_buffercache_numa_pages(PG_FUNCTION_ARGS) +{ ... + + /* + * To smoothly support upgrades from version 1.0 of this extension + * transparently handle the (non-)existence of the pinning_backends + * column. We unfortunately have to get the result type for that... - + * we can't use the result type determined by the function definition + * without potentially crashing when somebody uses the old (or even + * wrong) function definition though. + */ + if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); +Isn't that comment inapplicable for pg_buffercache_numa_pages(), a new view?
Removed.
+ + if (firstNumaTouch) + elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");Over the patchseries the related code is duplicated. Seems like it'd be good
to put it into pg_numa instead? This seems like the thing that's good to
abstract away in one central spot.
I hear you, but we are using those statistics for per-localized-code
(shm.c's firstTouch != pgbuffercache.c's firstTouch).
+ /* + * Scan through all the buffers, saving the relevant fields in the + * fctx->record structure. + * + * We don't hold the partition locks, so we don't get a consistent + * snapshot across all buffers, but we do grab the buffer header + * locks, so the information of each buffer is self-consistent. + * + * This loop touches and stores addresses into os_page_ptrs[] as input + * to one big big move_pages(2) inquiry system call. Basically we ask + * for all memory pages for NBuffers. + */ + idx = 0; + for (i = 0; i < NBuffers; i++) + { + BufferDesc *bufHdr; + uint32 buf_state; + uint32 bufferid; + + CHECK_FOR_INTERRUPTS(); + + bufHdr = GetBufferDescriptor(i); + + /* Lock each buffer header before inspecting. */ + buf_state = LockBufHdr(bufHdr); + bufferid = BufferDescriptorGetBuffer(bufHdr); + + UnlockBufHdr(bufHdr, buf_state);Given that the only thing you're getting here is the buffer id, it's probably
not worth getting the spinlock, a single 4 byte read is always atomic on our
platforms.
Well, I think this is just copy&pasted from original function, so we
follow the pattern for consistency.
+ /* + * If we have multiple OS pages per buffer, fill those in too. We + * always want at least one OS page, even if there are multiple + * buffers per page. + * + * Altough we could query just once per each OS page, we do it + * repeatably for each Buffer and hit the same address as + * move_pages(2) requires page aligment. This also simplifies + * retrieval code later on. Also NBuffers starts from 1. + */ + for (j = 0; j < Max(1, pages_per_buffer); j++) + { + char *buffptr = (char *) BufferGetBlock(i + 1); + + fctx->record[idx].bufferid = bufferid; + fctx->record[idx].numa_page = j; + + os_page_ptrs[idx] + = (char *) TYPEALIGN(os_page_size, + buffptr + (os_page_size * j));FWIW, this bit here is the most expensive part of the function itself, as the
compiler has no choice than to implement it as an actual division, as
os_page_size is runtime variable.It'd be fine to leave it like that, the call to numa_move_pages() is way more
expensive. But it shouldn't be too hard to do that alignment once, rather than
having to do it over and over.
TBH, I don't think we should spend lot of time optimizing, after all
it's debugging view (at the start I was actually considering putting
it as developer-only compile time option, but with shm view it is
actually usuable for others too and well... we want to have it as
foundation for real NUMA optimizations)
FWIW, neither this definition of numa_page, nor the one from "adjust page_num"
works quite right for me.This definition afaict is always 0 when using huge pages and just 0 and 1 for
4k pages. But my understanding of numa_page is that it's the "id" of the numa
pages, which isn't that?With "adjust page_num" I get a number that starts at -1 and then increments
from there. More correct, but doesn't quite seem right either.
Apparently handling this special case of splitted buffers edge-cases
was Pandora box ;) Tomas what do we do about it? Does that has chance
to get in before freeze?
+ <tbody> + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>bufferid</structfield> <type>integer</type> + </para> + <para> + ID, in the range 1..<varname>shared_buffers</varname> + </para></entry> + </row> + + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>page_num</structfield> <type>int</type> + </para> + <para> + number of OS memory page for this buffer + </para></entry> + </row>"page_num" doesn't really seem very descriptive for "number of OS memory page
for this buffer". To me "number of" sounds like it's counting the number of
associated pages, but it's really just a "page id" or something like that.Maybe rename page_num to "os_page_id" and rephrase the comment a bit?
Tomas, are you good with rename? I think I've would also prefer os_page_id.
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
[..]
+ <para> + The <structname>pg_shmem_allocations_numa</structname> shows how shared + memory allocations in the server's main shared memory segment are distributed + across NUMA nodes. This includes both memory allocated by + <productname>PostgreSQL</productname> itself and memory allocated + by extensions using the mechanisms detailed in + <xref linkend="xfunc-shared-addin" />. + </para>I think it'd be good to describe that the view will include multiple rows for
each name if spread across multiple numa nodes.
Added.
Perhaps also that querying this view is expensive and that
pg_shmem_allocations should be used if numa information isn't needed?
Already covered by 1st finding fix.
+ /* + * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while + * the OS may have different memory page sizes. + * + * To correctly map between them, we need to: 1. Determine the OS memory + * page size 2. Calculate how many OS pages are used by all buffer blocks + * 3. Calculate how many OS pages are contained within each database + * block. + * + * This information is needed before calling move_pages() for NUMA memory + * node inquiry. + */ + os_page_size = pg_numa_get_pagesize(); + + /* + * Allocate memory for page pointers and status based on total shared + * memory size. This simplified approach allocates enough space for all + * pages in shared memory rather than calculating the exact requirements + * for each segment. + * + * XXX Isn't this wasteful? But there probably is one large segment of + * shared memory, much larger than the rest anyway. + */ + shm_total_page_count = ShmemSegHdr->totalsize / os_page_size; + page_ptrs = palloc0(sizeof(void *) * shm_total_page_count); + pages_status = palloc(sizeof(int) * shm_total_page_count);There's a fair bit of duplicated code here with pg_buffercache_numa_pages(),
could more be moved to a shared helper function?
-> Tomas?
+ hash_seq_init(&hstat, ShmemIndex); + + /* output all allocated entries */ + memset(nulls, 0, sizeof(nulls)); + while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL) + {One thing that's interesting with using ShmemIndex is that this won't get
anonymous allocations. I think that's ok for now, it'd be complicated to
figure out the unmapped "regions". But I guess it' be good to mention it in
the ocs?
Added.
+ int i; + + /* XXX I assume we use TYPEALIGN as a way to round to whole pages. + * It's a bit misleading to call that "aligned", no? */ + + /* Get number of OS aligned pages */ + shm_ent_page_count + = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size; + + /* + * If we get ever 0xff back from kernel inquiry, then we probably have + * bug in our buffers to OS page mapping code here. + */ + memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);There's obviously no guarantee that shm_ent_page_count is a multiple of
os_page_size. I think it'd be interesting to show in the view when one shmem
allocation shares a page with the prior allocation - that can contribute a bit
to contention. What about showing a start_os_page_id and end_os_page_id or
something? That could be a feature for later though.
I was thinking about it, but it could be done when analyzing this
together with data from pg_shmem_allocations(?) My worry is timing :(
Anyway, we could extend this view in future revisions.
+SELECT NOT(pg_numa_available()) AS skip_test \gset +\if :skip_test +\quit +\endif +-- switch to superuser +\c - +SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa; + ok +---- + t +(1 row)Could it be worthwhile to run the test if !pg_numa_available(), to test that
we do the right thing in that case? We need an alternative output anyway, so
that might be fine?
Added. the meson test passes, but I'm sending it as fast as possible
to avoid a clash with Tomas.
-J.
Attachments:
v25-0006-fixes-for-review-by-Andres.patchapplication/octet-stream; name=v25-0006-fixes-for-review-by-Andres.patchDownload
From 7b279721ae04e823f20e94331dd3b0a634ff3e7f Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Sun, 6 Apr 2025 14:19:41 +0200
Subject: [PATCH v25 6/6] fixes for review by Andres
---
contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 8 --------
doc/src/sgml/system-views.sgml | 8 +++++++-
meson.build | 2 +-
src/include/port/pg_numa.h | 2 +-
src/test/regress/expected/numa.out | 1 +
src/test/regress/expected/numa_1.out | 2 ++
src/test/regress/sql/numa.sql | 1 +
8 files changed, 14 insertions(+), 12 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
index 1230e244a5f..e3b145a1687 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -10,7 +10,7 @@ AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
LANGUAGE C PARALLEL SAFE;
-- Create a view for convenient access.
-CREATE OR REPLACE VIEW pg_buffercache_numa AS
+CREATE VIEW pg_buffercache_numa AS
SELECT P.* FROM pg_buffercache_numa_pages() AS P
(bufferid integer, page_num int4, node_id int4);
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index a3c4a2578d9..df94cc6ef7d 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -375,14 +375,6 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
/* Create a user function context for cross-call persistence */
fctx = (BufferCacheNumaContext *) palloc(sizeof(BufferCacheNumaContext));
- /*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
- */
if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index a83365ae24a..4e853885de6 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -4069,7 +4069,13 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
across NUMA nodes. This includes both memory allocated by
<productname>PostgreSQL</productname> itself and memory allocated
by extensions using the mechanisms detailed in
- <xref linkend="xfunc-shared-addin" />.
+ <xref linkend="xfunc-shared-addin" />. This view will output multiple rows
+ for each of the shared memory segments provided that they are spread accross
+ multiple NUMA nodes. This view should not be queried by monitoring systems
+ as it is very slow and may end up allocating shared memory in case it was not
+ used earlier.
+ Current limitation for this view is that won't show anonymous shared memory
+ allocations.
</para>
<para>
diff --git a/meson.build b/meson.build
index b562a00c588..a1516e54529 100644
--- a/meson.build
+++ b/meson.build
@@ -950,7 +950,7 @@ endif
libnumaopt = get_option('libnuma')
if not libnumaopt.disabled()
# via pkg-config
- libnuma = dependency('numa', required: libnumaopt)
+ libnuma = dependency('numa', required: false)
if not libnuma.found()
libnuma = cc.find_library('numa', required: libnumaopt)
endif
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 3c1b50c1428..7e990d9f776 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -28,7 +28,7 @@ extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
* need to page-fault before move_pages(2) syscall returns valid results.
*/
#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
- ro_volatile_var = *(uint64 *) ptr
+ ro_volatile_var = *(volatile uint64 *) ptr
#else
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
index 668172f7d79..8af5dfeb9a5 100644
--- a/src/test/regress/expected/numa.out
+++ b/src/test/regress/expected/numa.out
@@ -1,5 +1,6 @@
SELECT NOT(pg_numa_available()) AS skip_test \gset
\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
\quit
\endif
-- switch to superuser
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
index 6dd6824b4e4..c90042fa7cc 100644
--- a/src/test/regress/expected/numa_1.out
+++ b/src/test/regress/expected/numa_1.out
@@ -1,3 +1,5 @@
SELECT NOT(pg_numa_available()) AS skip_test \gset
\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+ERROR: libnuma initialization failed or NUMA is not supported on this platform
\quit
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
index 034098783fb..324481c33b7 100644
--- a/src/test/regress/sql/numa.sql
+++ b/src/test/regress/sql/numa.sql
@@ -1,5 +1,6 @@
SELECT NOT(pg_numa_available()) AS skip_test \gset
\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
\quit
\endif
--
2.39.5
v25-0002-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchapplication/octet-stream; name=v25-0002-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchDownload
From 98435a1e46768784f22aa0929a83951ac0a5a965 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 3 Apr 2025 20:21:25 +0200
Subject: [PATCH v25 2/6] Add pg_buffercache_numa view with NUMA node info
Introduces a new view pg_buffercache_numa, showing a NUMA memory node
for each individual buffer.
To determine the NUMA node for a buffer, we first need to touch the
memory pages using pg_numa_touch_mem_if_required, otherwise we might get
status -2 (ENOENT = The page is not present), indicating the page is
either unmapped or unallocated.
The size of a database block and OS memory page may differ. For example
the default block size (BLCKSZ) is 8KB, while the memory page is 4KB,
but it's also possible to make the block size smaller (e.g. 1KB).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 5 +-
.../expected/pg_buffercache_numa.out | 28 ++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 22 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 288 ++++++++++++++++++
.../sql/pg_buffercache_numa.sql | 20 ++
doc/src/sgml/pgbuffercache.sgml | 75 ++++-
src/tools/pgindent/typedefs.list | 2 +
10 files changed, 443 insertions(+), 4 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..5f748543e2e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,10 +8,11 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
-REGRESS = pg_buffercache
+REGRESS = pg_buffercache pg_buffercache_numa
ifdef USE_PGXS
PG_CONFIG = pg_config
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..1230e244a5f
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,22 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, page_num int4, node_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..0b96476c319 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,6 +11,7 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
@@ -20,6 +21,8 @@
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
+#define NUM_BUFFERCACHE_NUMA_ELEM 3
+
PG_MODULE_MAGIC_EXT(
.name = "pg_buffercache",
.version = PG_VERSION
@@ -58,16 +61,44 @@ typedef struct
BufferCachePagesRec *record;
} BufferCachePagesContext;
+/*
+ * Record structure holding the to be exposed cache data.
+ */
+typedef struct
+{
+ uint32 bufferid;
+ int32 numa_page;
+ int32 numa_node;
+} BufferCacheNumaRec;
+
+/*
+ * Function context for data persisting over repeated calls.
+ */
+typedef struct
+{
+ TupleDesc tupdesc;
+ int buffers_per_page;
+ int pages_per_buffer;
+ int os_page_size;
+ BufferCacheNumaRec *record;
+} BufferCacheNumaContext;
+
/*
* Function returning data from the shared buffer cache - buffer number,
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+
Datum
pg_buffercache_pages(PG_FUNCTION_ARGS)
{
@@ -246,6 +277,263 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * Inquire about NUMA memory mappings for shared buffers.
+ *
+ * Returns NUMA node ID for each memory page used by the buffer. Buffers may
+ * be smaller or larger than OS memory pages. For each buffer we return one
+ * entry for each memory page used by the buffer (it fhe buffer is smaller,
+ * it only uses a part of one memory page).
+ *
+ * We expect both sizes (for buffers and memory pages) to be a power-of-2, so
+ * one is always a multiple of the other.
+ *
+ * In order to get reliable results we also need to touch memory pages, so
+ * that the inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ MemoryContext oldcontext;
+ BufferCacheNumaContext *fctx; /* User function context. */
+ TupleDesc tupledesc;
+ TupleDesc expected_tupledesc;
+ HeapTuple tuple;
+ Datum result;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i,
+ j,
+ idx;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_page_status;
+ uint64 os_page_count;
+ int pages_per_buffer;
+ int buffers_per_page;
+ volatile uint64 touch pg_attribute_unused();
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS
+ * memory page size 2. Calculate how many OS pages are used by all
+ * buffer blocks 3. Calculate how many OS pages are contained within
+ * each database block.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+ buffers_per_page = os_page_size / BLCKSZ;
+ pages_per_buffer = BLCKSZ / os_page_size;
+
+ /*
+ * The pages and block size is expected to be 2^k, so one divides the
+ * other (we don't know in which direction).
+ */
+ Assert((os_page_size % BLCKSZ == 0) || (BLCKSZ % os_page_size == 0));
+
+ /*
+ * Either both counts are 1 (when the pages have the same size), or
+ * exacly one of them is zero. Both can't be zero at the same time.
+ */
+ Assert((buffers_per_page > 0) || (pages_per_buffer > 0));
+ Assert(((buffers_per_page == 1) && (pages_per_buffer == 1)) ||
+ ((buffers_per_page == 0) || (pages_per_buffer == 0)));
+
+ /*
+ * How many addresses we are going to query (store) depends on the
+ * relation between BLCKSZ : PAGESIZE. We need at least one status per
+ * buffer - if the memory page is larger than buffer, we still query
+ * it for each buffer. With multiple memory pages per buffer, we need
+ * that many entries.
+ */
+ os_page_count = NBuffers * Max(1, pages_per_buffer);
+
+ elog(DEBUG1, "NUMA: NBuffers=%d os_page_query_count=" UINT64_FORMAT " "
+ "os_page_size=%zu buffers_per_page=%d pages_per_buffer=%d",
+ NBuffers, os_page_count, os_page_size,
+ buffers_per_page, pages_per_buffer);
+
+
+ /* Initialize the multi-call context, load entries about buffers */
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCacheNumaContext *) palloc(sizeof(BufferCacheNumaContext));
+
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. We unfortunately have to get the result type for that... -
+ * we can't use the result type determined by the function definition
+ * without potentially crashing when somebody uses the old (or even
+ * wrong) function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts != NUM_BUFFERCACHE_NUMA_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "page_num",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "node_id",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCacheNumaRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCacheNumaRec) * os_page_count);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+
+
+ /* Used to determine the NUMA node for all OS pages at once */
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+ os_page_status = palloc(sizeof(uint64) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(os_page_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ *
+ * This loop touches and stores addresses into os_page_ptrs[] as input
+ * to one big big move_pages(2) inquiry system call. Basically we ask
+ * for all memory pages for NBuffers.
+ */
+ idx = 0;
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+ uint32 bufferid;
+
+ CHECK_FOR_INTERRUPTS();
+
+ bufHdr = GetBufferDescriptor(i);
+
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+ bufferid = BufferDescriptorGetBuffer(bufHdr);
+
+ UnlockBufHdr(bufHdr, buf_state);
+
+ /*
+ * If we have multiple OS pages per buffer, fill those in too. We
+ * always want at least one OS page, even if there are multiple
+ * buffers per page.
+ *
+ * Altough we could query just once per each OS page, we do it
+ * repeatably for each Buffer and hit the same address as
+ * move_pages(2) requires page aligment. This also simplifies
+ * retrieval code later on. Also NBuffers starts from 1.
+ */
+ for (j = 0; j < Max(1, pages_per_buffer); j++)
+ {
+ char *buffptr = (char *) BufferGetBlock(i + 1);
+
+ fctx->record[idx].bufferid = bufferid;
+ fctx->record[idx].numa_page = j;
+
+ os_page_ptrs[idx]
+ = (char *) TYPEALIGN(os_page_size,
+ buffptr + (os_page_size * j));
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[idx]);
+
+ ++idx;
+ }
+
+ }
+
+ /* We should get exactly the expected number of entrires */
+ Assert(idx == os_page_count);
+
+ /* Query NUMA status for all the pointers */
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_page_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ /*
+ * Update the entries with NUMA node ID. The status array is indexed
+ * the same way as the entry index.
+ */
+ for (i = 0; i < os_page_count; i++)
+ {
+ fctx->record[i].numa_node = os_page_status[i];
+ }
+
+ /* Remember this backend touched the pages */
+ firstNumaTouch = false;
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ uint32 i = funcctx->call_cntr;
+ Datum values[NUM_BUFFERCACHE_NUMA_ELEM];
+ bool nulls[NUM_BUFFERCACHE_NUMA_ELEM];
+
+ values[0] = Int32GetDatum(fctx->record[i].bufferid);
+ nulls[0] = false;
+
+ values[1] = Int32GetDatum(fctx->record[i].numa_page);
+ nulls[1] = false;
+
+ values[2] = Int32GetDatum(fctx->record[i].numa_node);
+ nulls[2] = false;
+
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ result = HeapTupleGetDatum(tuple);
+
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ SRF_RETURN_DONE(funcctx);
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..b01f8e71357 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,15 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides
+ <acronym>NUMA</acronym> node mappings for shared buffer entries. This
+ information is not part of <function>pg_buffercache_pages()</function>
+ itself, as it is much slower to retrieve.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +211,68 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache-numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed by the view are shown in <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>bufferid</structfield> <type>integer</type>
+ </para>
+ <para>
+ ID, in the range 1..<varname>shared_buffers</varname>
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>page_num</structfield> <type>int</type>
+ </para>
+ <para>
+ number of OS memory page for this buffer
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 0c81d03950d..ed74a76a5c7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -341,6 +341,8 @@ BufFile
Buffer
BufferAccessStrategy
BufferAccessStrategyType
+BufferCacheNumaRec
+BufferCacheNumaContext
BufferCachePagesContext
BufferCachePagesRec
BufferDesc
--
2.39.5
v25-0004-Introduce-pg_shmem_allocations_numa-view.patchapplication/octet-stream; name=v25-0004-Introduce-pg_shmem_allocations_numa-view.patchDownload
From 1cabc553ea4a7e185fd83f1ab081521820fe6229 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v25 4/6] Introduce pg_shmem_allocations_numa view
Introduce new pg_shmem_alloctions_numa view with information about how
shared memory is distributed across NUMA nodes.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 79 ++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 152 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 ++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 294 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 4f336ee0adf..a83365ae24a 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -181,6 +181,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -4051,6 +4056,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-allocations-numa">
+ <title><structname>pg_shmem_allocations_numa</structname></title>
+
+ <indexterm zone="view-pg-shmem-allocations-numa">
+ <primary>pg_shmem_allocations_numa</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_allocations_numa</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_allocations_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_allocations_numa</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 273008db37f..08f780a2e63 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_allocations_numa AS
+ SELECT * FROM pg_get_shmem_allocations_numa();
+
+REVOKE ALL ON pg_shmem_allocations_numa FROM PUBLIC;
+GRANT SELECT ON pg_shmem_allocations_numa TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..5d979423bd9 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,152 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+Datum
+pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ *
+ * XXX Isn't this wasteful? But there probably is one large segment of
+ * shared memory, much larger than the rest anyway.
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ /* XXX I assume we use TYPEALIGN as a way to round to whole pages.
+ * It's a bit misleading to call that "aligned", no? */
+
+ /* Get number of OS aligned pages */
+ shm_ent_page_count
+ = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ /*
+ * Setup page_ptrs[] with pointers to all OS pages for this segment,
+ * and get the NUMA status using pg_numa_query_pages.
+ *
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (ENOENT, which indicates unmapped/unallocated pages).
+ */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ /* Count number of NUMA nodes used for this shared memory entry */
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s < 0 || s > max_nodes)
+ {
+ elog(ERROR, "invalid NUMA node id outside of allowed range "
+ "[0, " UINT64_FORMAT "]: %d", max_nodes, s);
+ }
+
+ nodes[s]++;
+ }
+
+ /*
+ * Add one entry for each NUMA node, including those without allocated
+ * memory for this segment.
+ */
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory.
+ *
+ * XXX Not quite sure why this is at the end, and what "output memory"
+ * refers to.
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index dfc59ea0cc8..a93075c675c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8546,6 +8546,14 @@
proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_allocations_numa', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,node_id,size}',
+ prosrc => 'pg_get_shmem_allocations_numa' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..668172f7d79
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 1fddb13b6ae..c25062c288f 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3219,8 +3219,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3242,6 +3242,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
has_table_privilege
@@ -3261,6 +3267,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 673c63b8d1b..abfdc97abc5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1757,6 +1757,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_allocations_numa| SELECT name,
+ node_id,
+ size
+ FROM pg_get_shmem_allocations_numa() pg_get_shmem_allocations_numa(name, node_id, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..034098783fb
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index 85d7280f35f..f337aa67c13 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1947,8 +1947,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1958,12 +1958,14 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.39.5
v25-0005-adjust-page-alignment.patchapplication/octet-stream; name=v25-0005-adjust-page-alignment.patchDownload
From bc2a6e9d279c38afa1ccfd60ec93ad88eaf80b36 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 5 Apr 2025 16:20:13 +0200
Subject: [PATCH v25 5/6] adjust page alignment
---
src/backend/storage/ipc/shmem.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 5d979423bd9..4a9a9606f2e 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -637,13 +637,22 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
{
int i;
+ char *startptr,
+ *endptr;
+ Size total_len;
- /* XXX I assume we use TYPEALIGN as a way to round to whole pages.
- * It's a bit misleading to call that "aligned", no? */
+ /*
+ * Calculate the range of OS pages used by this segment. The segment
+ * may start / end half-way through a page, we want to count these
+ * pages too. So we align the start/end pointers down/up, and then
+ * calculate the number of pages from that.
+ */
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size, ent->location);
+ endptr = (char *) TYPEALIGN(os_page_size,
+ (char *) ent->location + ent->allocated_size);
+ total_len = (endptr - startptr);
- /* Get number of OS aligned pages */
- shm_ent_page_count
- = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
+ shm_ent_page_count = total_len / os_page_size;
/*
* If we get ever 0xff back from kernel inquiry, then we probably have
@@ -663,7 +672,7 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
{
volatile uint64 touch pg_attribute_unused();
- page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ page_ptrs[i] = startptr + (i * os_page_size);
if (firstNumaTouch)
pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
--
2.39.5
v25-0001-Add-support-for-basic-NUMA-awareness.patchapplication/octet-stream; name=v25-0001-Add-support-for-basic-NUMA-awareness.patchDownload
From 40fdd84b2e03c1121c87ee64403d14b75e2a7379 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 2 Apr 2025 12:29:22 +0200
Subject: [PATCH v25 1/6] Add support for basic NUMA awareness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.
A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with NUMA library.
On Linux we use move_pages(2) syscall for speed instead of
get_mempolicy(2).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 2 +
configure | 187 ++++++++++++++++++++++++++++
configure.ac | 14 +++
doc/src/sgml/func.sgml | 13 ++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 6 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 40 ++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 120 ++++++++++++++++++
17 files changed, 442 insertions(+), 2 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
--with-liburing \
\
${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
diff --git a/configure b/configure
index 8f4a5ab28ec..0936010718d 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
XML2_CFLAGS
XML2_CONFIG
with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
@@ -872,6 +875,7 @@ with_liburing
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
LIBURING_LIBS
LIBCURL_CFLAGS
LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
XML2_CONFIG
XML2_CFLAGS
XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -1629,6 +1636,10 @@ Some influential environment variables:
C compiler flags for LIBCURL, overriding pkg-config
LIBCURL_LIBS
linker flags for LIBCURL, overriding pkg-config
+ LIBNUMA_CFLAGS
+ C compiler flags for LIBNUMA, overriding pkg-config
+ LIBNUMA_LIBS
+ linker flags for LIBNUMA, overriding pkg-config
XML2_CONFIG path to xml2-config utility
XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
XML2_LIBS linker flags for XML2, overriding pkg-config
@@ -9063,6 +9074,182 @@ $as_echo "$as_me: WARNING: *** OAuth support tests require --with-python to run"
fi
+#
+# libnuma
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libnuma support" >&5
+$as_echo_n "checking whether to build with libnuma support... " >&6; }
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libnuma" >&5
+$as_echo "$with_libnuma" >&6; }
+
+
+if test "$with_libnuma" = yes ; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'libnuma' is required for NUMA support" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa" >&5
+$as_echo_n "checking for numa... " >&6; }
+
+if test -n "$LIBNUMA_CFLAGS"; then
+ pkg_cv_LIBNUMA_CFLAGS="$LIBNUMA_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_CFLAGS=`$PKG_CONFIG --cflags "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBNUMA_LIBS"; then
+ pkg_cv_LIBNUMA_LIBS="$LIBNUMA_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_LIBS=`$PKG_CONFIG --libs "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "numa" 2>&1`
+ else
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "numa" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBNUMA_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (numa) were not met:
+
+$LIBNUMA_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBNUMA_CFLAGS=$pkg_cv_LIBNUMA_CFLAGS
+ LIBNUMA_LIBS=$pkg_cv_LIBNUMA_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
#
# XML
#
diff --git a/configure.ac b/configure.ac
index fc5f7475d07..2a78cddd825 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,20 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma support],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA support])])
+ PKG_CHECK_MODULES(LIBNUMA, numa)
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0224f93733d..9ab070adffb 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25143,6 +25143,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..8ebf0b03ec0 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname> library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-liburing">
<term><option>--with-liburing</option></term>
<listitem>
@@ -2645,6 +2655,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname> library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 27717ad8976..b562a00c588 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: liburing
@@ -3279,6 +3300,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
liburing,
libxml,
lz4,
@@ -3935,6 +3957,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..06bf5627d3c 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA support')
+
option('liburing', type : 'feature', value: 'auto',
description: 'io_uring support, for asynchronous I/O')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 737b2dd1869..6722fbdf365 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
@@ -223,6 +224,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBNUMA_CFLAGS = @LIBNUMA_CFLAGS@
+LIBNUMA_LIBS = @LIBNUMA_LIBS@
+
LIBURING_CFLAGS = @LIBURING_CFLAGS@
LIBURING_LIBS = @LIBURING_LIBS@
@@ -250,7 +254,7 @@ CPP = @CPP@
CPPFLAGS = @CPPFLAGS@
PG_SYSROOT = @PG_SYSROOT@
-override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBNUMA_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
ifdef PGXS
override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..ea8d796e7c4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5d5be8ba4e1..dfc59ea0cc8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8542,6 +8542,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 9891b9b05c3..1af0b6316dd 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -689,6 +689,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to build with io_uring support. (--with-liburing) */
#undef USE_LIBURING
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..3c1b50c1428
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *) ptr
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
@@ -232,6 +234,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index f11896440d5..4274949dfa4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
path.o \
pg_bitutils.o \
pg_localeconv_r.o \
+ pg_numa.o \
pg_popcount_aarch64.o \
pg_popcount_avx512.o \
pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 48d2dfb7cf3..fc7b059fee5 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_localeconv_r.c',
+ 'pg_numa.c',
'pg_popcount_aarch64.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..5e2523cf798
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+/*
+ * We use move_pages(2) syscall here - instead of get_mempolicy(2) - as the
+ * first one allows us to batch and query about many memory pages in one single
+ * giant system call that is way faster.
+ */
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+#else
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
+
+/* This should be used only after the server is started */
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+ Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
--
2.39.5
v25-0003-adjust-page_num.patchapplication/octet-stream; name=v25-0003-adjust-page_num.patchDownload
From c3c030c3fb3a2c39a164a7ab1b7bea9df5f5a9b7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 5 Apr 2025 16:00:39 +0200
Subject: [PATCH v25 3/6] adjust page_num
---
contrib/pg_buffercache/pg_buffercache_pages.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 0b96476c319..a3c4a2578d9 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -315,6 +315,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
int pages_per_buffer;
int buffers_per_page;
volatile uint64 touch pg_attribute_unused();
+ char *startptr = NULL;
if (pg_numa_init() == -1)
elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
@@ -437,6 +438,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
* to one big big move_pages(2) inquiry system call. Basically we ask
* for all memory pages for NBuffers.
*/
+ startptr = (char *) BufferGetBlock(1);
idx = 0;
for (i = 0; i < NBuffers; i++)
{
@@ -469,11 +471,14 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
char *buffptr = (char *) BufferGetBlock(i + 1);
fctx->record[idx].bufferid = bufferid;
- fctx->record[idx].numa_page = j;
os_page_ptrs[idx]
- = (char *) TYPEALIGN(os_page_size,
- buffptr + (os_page_size * j));
+ = (char *) TYPEALIGN_DOWN(os_page_size,
+ buffptr + (os_page_size * j));
+
+ /* calculate ID of the OS memory page */
+ fctx->record[idx].numa_page
+ = ((char *) os_page_ptrs[idx] - startptr) / os_page_size;
/* Only need to touch memory once per backend process lifetime */
if (firstNumaTouch)
--
2.39.5
On 4/6/25 14:57, Jakub Wartak wrote:
On Sun, Apr 6, 2025 at 12:29 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
Hi Andres/Tomas,
I've noticed that Tomas responded to this while writing this, so I'm
attaching git-am patches based on his v25 (no squash) and there's only
one new (last one contains fixes based on this review) + slight commit
amendment to 0001.
I'm not working on this at the moment. I may have a bit of time in the
evening, but more likely I'll get back to this on Monday.
I just played around with this for a bit. As noted somewhere further down,
pg_buffercache_numa.page_num ends up wonky in different ways for the different
pages.I think page_num is under heavy work in progress... I'm still not sure
is it worth exposing this (is it worth the hassle). If we scratch it
it won't be perfect, but we have everything , otherwise we risk this
feature as we are going into a feature freeze literally tomorrow.
IMHO it's not difficult to change the definition of page_num this way,
it's pretty much a one line change. It's more a question of whether we
actually want to expose this.
[snip]
+ + if (firstNumaTouch) + elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");Over the patchseries the related code is duplicated. Seems like it'd be good
to put it into pg_numa instead? This seems like the thing that's good to
abstract away in one central spot.I hear you, but we are using those statistics for per-localized-code
(shm.c's firstTouch != pgbuffercache.c's firstTouch).
Yeah, I don't moving this is quite possible / useful. We could pass the
flag somewhere, but we still need to check & update it in local code.
+ /* + * If we have multiple OS pages per buffer, fill those in too. We + * always want at least one OS page, even if there are multiple + * buffers per page. + * + * Altough we could query just once per each OS page, we do it + * repeatably for each Buffer and hit the same address as + * move_pages(2) requires page aligment. This also simplifies + * retrieval code later on. Also NBuffers starts from 1. + */ + for (j = 0; j < Max(1, pages_per_buffer); j++) + { + char *buffptr = (char *) BufferGetBlock(i + 1); + + fctx->record[idx].bufferid = bufferid; + fctx->record[idx].numa_page = j; + + os_page_ptrs[idx] + = (char *) TYPEALIGN(os_page_size, + buffptr + (os_page_size * j));FWIW, this bit here is the most expensive part of the function itself, as the
compiler has no choice than to implement it as an actual division, as
os_page_size is runtime variable.It'd be fine to leave it like that, the call to numa_move_pages() is way more
expensive. But it shouldn't be too hard to do that alignment once, rather than
having to do it over and over.TBH, I don't think we should spend lot of time optimizing, after all
it's debugging view (at the start I was actually considering putting
it as developer-only compile time option, but with shm view it is
actually usuable for others too and well... we want to have it as
foundation for real NUMA optimizations)
I agree with this.
FWIW, neither this definition of numa_page, nor the one from "adjust page_num"
works quite right for me.This definition afaict is always 0 when using huge pages and just 0 and 1 for
4k pages. But my understanding of numa_page is that it's the "id" of the numa
pages, which isn't that?With "adjust page_num" I get a number that starts at -1 and then increments
from there. More correct, but doesn't quite seem right either.Apparently handling this special case of splitted buffers edge-cases
was Pandora box ;) Tomas what do we do about it? Does that has chance
to get in before freeze?
I don't think the split buffers are pandora box on their own, it's more
that it made us notice other issues / questions. I don't think handling
it is particularly complex - the most difficult part seems to be
figuring out how many rows we'll return, and mapping them to pages.
But that's not very difficult, IMO.
The bigger question is whether it's safe to do the TYPEALIGN_DOWN(),
which may return a pointer from before the first buffer. But I guess
that's OK, thanks to how memory is allocated - at least, that's what all
the move_pages() examples I found do, so unless those are all broken,
that's OK.
+ <tbody> + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>bufferid</structfield> <type>integer</type> + </para> + <para> + ID, in the range 1..<varname>shared_buffers</varname> + </para></entry> + </row> + + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>page_num</structfield> <type>int</type> + </para> + <para> + number of OS memory page for this buffer + </para></entry> + </row>"page_num" doesn't really seem very descriptive for "number of OS memory page
for this buffer". To me "number of" sounds like it's counting the number of
associated pages, but it's really just a "page id" or something like that.Maybe rename page_num to "os_page_id" and rephrase the comment a bit?
Tomas, are you good with rename? I think I've would also prefer os_page_id.
Fine with me.
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml[..]
+ <para> + The <structname>pg_shmem_allocations_numa</structname> shows how shared + memory allocations in the server's main shared memory segment are distributed + across NUMA nodes. This includes both memory allocated by + <productname>PostgreSQL</productname> itself and memory allocated + by extensions using the mechanisms detailed in + <xref linkend="xfunc-shared-addin" />. + </para>I think it'd be good to describe that the view will include multiple rows for
each name if spread across multiple numa nodes.Added.
Perhaps also that querying this view is expensive and that
pg_shmem_allocations should be used if numa information isn't needed?Already covered by 1st finding fix.
+ /* + * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while + * the OS may have different memory page sizes. + * + * To correctly map between them, we need to: 1. Determine the OS memory + * page size 2. Calculate how many OS pages are used by all buffer blocks + * 3. Calculate how many OS pages are contained within each database + * block. + * + * This information is needed before calling move_pages() for NUMA memory + * node inquiry. + */ + os_page_size = pg_numa_get_pagesize(); + + /* + * Allocate memory for page pointers and status based on total shared + * memory size. This simplified approach allocates enough space for all + * pages in shared memory rather than calculating the exact requirements + * for each segment. + * + * XXX Isn't this wasteful? But there probably is one large segment of + * shared memory, much larger than the rest anyway. + */ + shm_total_page_count = ShmemSegHdr->totalsize / os_page_size; + page_ptrs = palloc0(sizeof(void *) * shm_total_page_count); + pages_status = palloc(sizeof(int) * shm_total_page_count);There's a fair bit of duplicated code here with pg_buffercache_numa_pages(),
could more be moved to a shared helper function?-> Tomas?
I'm not against that in principle, but when I tried it didn't quite help
that much. But maybe it's better with the current patch version.
+ hash_seq_init(&hstat, ShmemIndex); + + /* output all allocated entries */ + memset(nulls, 0, sizeof(nulls)); + while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL) + {One thing that's interesting with using ShmemIndex is that this won't get
anonymous allocations. I think that's ok for now, it'd be complicated to
figure out the unmapped "regions". But I guess it' be good to mention it in
the ocs?Added.
+ int i; + + /* XXX I assume we use TYPEALIGN as a way to round to whole pages. + * It's a bit misleading to call that "aligned", no? */ + + /* Get number of OS aligned pages */ + shm_ent_page_count + = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size; + + /* + * If we get ever 0xff back from kernel inquiry, then we probably have + * bug in our buffers to OS page mapping code here. + */ + memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);There's obviously no guarantee that shm_ent_page_count is a multiple of
os_page_size. I think it'd be interesting to show in the view when one shmem
allocation shares a page with the prior allocation - that can contribute a bit
to contention. What about showing a start_os_page_id and end_os_page_id or
something? That could be a feature for later though.I was thinking about it, but it could be done when analyzing this
together with data from pg_shmem_allocations(?) My worry is timing :(
Anyway, we could extend this view in future revisions.
I'd leave this out for now. It's not difficult, but let's focus on the
other issues.
+SELECT NOT(pg_numa_available()) AS skip_test \gset +\if :skip_test +\quit +\endif +-- switch to superuser +\c - +SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa; + ok +---- + t +(1 row)Could it be worthwhile to run the test if !pg_numa_available(), to test that
we do the right thing in that case? We need an alternative output anyway, so
that might be fine?Added. the meson test passes, but I'm sending it as fast as possible
to avoid a clash with Tomas.
Please keep working on this. I may hava a bit of time in the evening,
but in the worst case I'll merge it into your patch.
regards
--
Tomas Vondra
On Sun, Apr 6, 2025 at 3:52 PM Tomas Vondra <tomas@vondra.me> wrote:
On 4/6/25 14:57, Jakub Wartak wrote:
On Sun, Apr 6, 2025 at 12:29 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
Hi Andres/Tomas,
I've noticed that Tomas responded to this while writing this, so I'm
attaching git-am patches based on his v25 (no squash) and there's only
one new (last one contains fixes based on this review) + slight commit
amendment to 0001.I'm not working on this at the moment. I may have a bit of time in the
evening, but more likely I'll get back to this on Monday.
OK, tried to fix all outstanding issues (except maybe some tiny code
refactors for beatification)
I just played around with this for a bit. As noted somewhere further down,
pg_buffercache_numa.page_num ends up wonky in different ways for the different
pages.I think page_num is under heavy work in progress... I'm still not sure
is it worth exposing this (is it worth the hassle). If we scratch it
it won't be perfect, but we have everything , otherwise we risk this
feature as we are going into a feature freeze literally tomorrow.IMHO it's not difficult to change the definition of page_num this way,
it's pretty much a one line change. It's more a question of whether we
actually want to expose this.
Bertrand noticed this first in
/messages/by-id/Z/FhOOCmTxuB2h0b@ip-10-97-1-34.eu-west-3.compute.internal
:
- startptr = (char *) BufferGetBlock(1);
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char
*) BufferGetBlock(1));
With the above I'm also not getting wonky (-1) results anymore. The
rest of reply assumes we are using this.
[snip]
+ + if (firstNumaTouch) + elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");Over the patchseries the related code is duplicated. Seems like it'd be good
to put it into pg_numa instead? This seems like the thing that's good to
abstract away in one central spot.I hear you, but we are using those statistics for per-localized-code
(shm.c's firstTouch != pgbuffercache.c's firstTouch).Yeah, I don't moving this is quite possible / useful. We could pass the
flag somewhere, but we still need to check & update it in local code.
No idea how such code should be looking, but it would be more code?
Are you thinking about some code still touching local &firstNumaTouch
?
FWIW, neither this definition of numa_page, nor the one from "adjust page_num"
works quite right for me.This definition afaict is always 0 when using huge pages and just 0 and 1 for
4k pages. But my understanding of numa_page is that it's the "id" of the numa
pages, which isn't that?With "adjust page_num" I get a number that starts at -1 and then increments
from there. More correct, but doesn't quite seem right either.Apparently handling this special case of splitted buffers edge-cases
was Pandora box ;) Tomas what do we do about it? Does that has chance
to get in before freeze?I don't think the split buffers are pandora box on their own, it's more
that it made us notice other issues / questions. I don't think handling
it is particularly complex - the most difficult part seems to be
figuring out how many rows we'll return, and mapping them to pages.
But that's not very difficult, IMO.
OK, with a fresh week, and fresh mind and a different name (ospageid)
it looks better to me now.
The bigger question is whether it's safe to do the TYPEALIGN_DOWN(),
which may return a pointer from before the first buffer. But I guess
that's OK, thanks to how memory is allocated - at least, that's what all
the move_pages() examples I found do, so unless those are all broken,
that's OK.
I agree. This took some more time, but in case of
a) pg_buffercache_numa and HP=off view we shouldn't access ptr below
buffercache, because my understanding is that shm memory would be page
aligned anyway as per BufferManagerShmemInit() which uses
TYPEALGIN(PG_IO_ALIGN_SIZE) for it anyway. So when you query
pg_buffercache_numa, one gets the following pages:
# strace -fp 14364 -e move_pages
strace: Process 14364 attached
move_pages(0, 32768, [0x7f8bfe4b5000, 0x7f8bfe4b6000, 0x7f8bfe4b7000,...
(gdb) print BufferBlocks
$1 = 0x7f8bfe4b5000 ""
while BufferBlocks actually starts there, so we are good without HP.
With HP it's move_pages(0, 16384, [0x7ff33c800000, ... vs (gdb) print
BufferBlocks
=> $1 = 0x7ff33c879000
so we are actually accessing earlier pointer (!), but Buffer Blocks is
like 15-th structure there (and there's always going be something
earlier to my understanding):
select row_number() over (order by off),* from pg_shmem_allocations
order by off asc limit 15;
row_number | name | off | size | allocated_size
------------+------------------------+---------+-----------+----------------
[..]
13 | Shared MultiXact State | 5735936 | 1148 | 1152
14 | Buffer Descriptors | 5737088 | 1048576 | 1048576
15 | Buffer Blocks | 6785664 | 134221824 | 134221824
To me this is finding in itself, with HP shared_buffers being located
on page with something else...
b) in pg_shmem_allocations_numa view without HP we don't even get to
as low as ShmemBase for move_pages(), but with HP=on we hit as low as
ShmemBase but not lower, so we are good IMHO.
Wouldn't buildfarm actually tell us this if that's bad as an insurance policy?
+ <tbody> + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>bufferid</structfield> <type>integer</type> + </para> + <para> + ID, in the range 1..<varname>shared_buffers</varname> + </para></entry> + </row> + + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>page_num</structfield> <type>int</type> + </para> + <para> + number of OS memory page for this buffer + </para></entry> + </row>"page_num" doesn't really seem very descriptive for "number of OS memory page
for this buffer". To me "number of" sounds like it's counting the number of
associated pages, but it's really just a "page id" or something like that.Maybe rename page_num to "os_page_id" and rephrase the comment a bit?
Tomas, are you good with rename? I think I've would also prefer os_page_id.
Fine with me.
Done, s/page_num/ospageid/g as whole pg_buffercache does not use "_"
anywhere, so let's stick to that. Done that for nodeid too.
+ /* + * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while + * the OS may have different memory page sizes. + * + * To correctly map between them, we need to: 1. Determine the OS memory + * page size 2. Calculate how many OS pages are used by all buffer blocks + * 3. Calculate how many OS pages are contained within each database + * block. + * + * This information is needed before calling move_pages() for NUMA memory + * node inquiry. + */ + os_page_size = pg_numa_get_pagesize(); + + /* + * Allocate memory for page pointers and status based on total shared + * memory size. This simplified approach allocates enough space for all + * pages in shared memory rather than calculating the exact requirements + * for each segment. + * + * XXX Isn't this wasteful? But there probably is one large segment of + * shared memory, much larger than the rest anyway. + */ + shm_total_page_count = ShmemSegHdr->totalsize / os_page_size; + page_ptrs = palloc0(sizeof(void *) * shm_total_page_count); + pages_status = palloc(sizeof(int) * shm_total_page_count);There's a fair bit of duplicated code here with pg_buffercache_numa_pages(),
could more be moved to a shared helper function?-> Tomas?
I'm not against that in principle, but when I tried it didn't quite help
that much. But maybe it's better with the current patch version.+ hash_seq_init(&hstat, ShmemIndex); + + /* output all allocated entries */ + memset(nulls, 0, sizeof(nulls)); + while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL) + {One thing that's interesting with using ShmemIndex is that this won't get
anonymous allocations. I think that's ok for now, it'd be complicated to
figure out the unmapped "regions". But I guess it' be good to mention it in
the ocs?Added.
BTW: it's also in the function comment too. now.
[..], but I'm sending it as fast as possible
to avoid a clash with Tomas.Please keep working on this. I may hava a bit of time in the evening,
but in the worst case I'll merge it into your patch.
So, attached is still patchset v25, still with one-off patches (please
git am most, but just `patch -p1 < file` last two and just squash it
if you are happy. I've intentionally not squash it to provide
changelog). This LGTM.
-J.
Attachments:
v25-0005-adjust-page-alignment.patchapplication/octet-stream; name=v25-0005-adjust-page-alignment.patchDownload
From bc2a6e9d279c38afa1ccfd60ec93ad88eaf80b36 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 5 Apr 2025 16:20:13 +0200
Subject: [PATCH v25 5/7] adjust page alignment
---
src/backend/storage/ipc/shmem.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 5d979423bd9..4a9a9606f2e 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -637,13 +637,22 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
{
int i;
+ char *startptr,
+ *endptr;
+ Size total_len;
- /* XXX I assume we use TYPEALIGN as a way to round to whole pages.
- * It's a bit misleading to call that "aligned", no? */
+ /*
+ * Calculate the range of OS pages used by this segment. The segment
+ * may start / end half-way through a page, we want to count these
+ * pages too. So we align the start/end pointers down/up, and then
+ * calculate the number of pages from that.
+ */
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size, ent->location);
+ endptr = (char *) TYPEALIGN(os_page_size,
+ (char *) ent->location + ent->allocated_size);
+ total_len = (endptr - startptr);
- /* Get number of OS aligned pages */
- shm_ent_page_count
- = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
+ shm_ent_page_count = total_len / os_page_size;
/*
* If we get ever 0xff back from kernel inquiry, then we probably have
@@ -663,7 +672,7 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
{
volatile uint64 touch pg_attribute_unused();
- page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ page_ptrs[i] = startptr + (i * os_page_size);
if (firstNumaTouch)
pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
--
2.39.5
v25-0004-Introduce-pg_shmem_allocations_numa-view.patchapplication/octet-stream; name=v25-0004-Introduce-pg_shmem_allocations_numa-view.patchDownload
From 1cabc553ea4a7e185fd83f1ab081521820fe6229 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v25 4/7] Introduce pg_shmem_allocations_numa view
Introduce new pg_shmem_alloctions_numa view with information about how
shared memory is distributed across NUMA nodes.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 79 ++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 152 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 ++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 294 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 4f336ee0adf..a83365ae24a 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -181,6 +181,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -4051,6 +4056,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-allocations-numa">
+ <title><structname>pg_shmem_allocations_numa</structname></title>
+
+ <indexterm zone="view-pg-shmem-allocations-numa">
+ <primary>pg_shmem_allocations_numa</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_allocations_numa</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_allocations_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_allocations_numa</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 273008db37f..08f780a2e63 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_allocations_numa AS
+ SELECT * FROM pg_get_shmem_allocations_numa();
+
+REVOKE ALL ON pg_shmem_allocations_numa FROM PUBLIC;
+GRANT SELECT ON pg_shmem_allocations_numa TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..5d979423bd9 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,152 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+Datum
+pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ *
+ * XXX Isn't this wasteful? But there probably is one large segment of
+ * shared memory, much larger than the rest anyway.
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ /* XXX I assume we use TYPEALIGN as a way to round to whole pages.
+ * It's a bit misleading to call that "aligned", no? */
+
+ /* Get number of OS aligned pages */
+ shm_ent_page_count
+ = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ /*
+ * Setup page_ptrs[] with pointers to all OS pages for this segment,
+ * and get the NUMA status using pg_numa_query_pages.
+ *
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (ENOENT, which indicates unmapped/unallocated pages).
+ */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ /* Count number of NUMA nodes used for this shared memory entry */
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s < 0 || s > max_nodes)
+ {
+ elog(ERROR, "invalid NUMA node id outside of allowed range "
+ "[0, " UINT64_FORMAT "]: %d", max_nodes, s);
+ }
+
+ nodes[s]++;
+ }
+
+ /*
+ * Add one entry for each NUMA node, including those without allocated
+ * memory for this segment.
+ */
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory.
+ *
+ * XXX Not quite sure why this is at the end, and what "output memory"
+ * refers to.
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index dfc59ea0cc8..a93075c675c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8546,6 +8546,14 @@
proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_allocations_numa', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,node_id,size}',
+ prosrc => 'pg_get_shmem_allocations_numa' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..668172f7d79
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 1fddb13b6ae..c25062c288f 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3219,8 +3219,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3242,6 +3242,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
has_table_privilege
@@ -3261,6 +3267,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 673c63b8d1b..abfdc97abc5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1757,6 +1757,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_allocations_numa| SELECT name,
+ node_id,
+ size
+ FROM pg_get_shmem_allocations_numa() pg_get_shmem_allocations_numa(name, node_id, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..034098783fb
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index 85d7280f35f..f337aa67c13 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1947,8 +1947,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1958,12 +1958,14 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.39.5
v25-0001-Add-support-for-basic-NUMA-awareness.patchapplication/octet-stream; name=v25-0001-Add-support-for-basic-NUMA-awareness.patchDownload
From 40fdd84b2e03c1121c87ee64403d14b75e2a7379 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 2 Apr 2025 12:29:22 +0200
Subject: [PATCH v25 1/7] Add support for basic NUMA awareness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.
A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with NUMA library.
On Linux we use move_pages(2) syscall for speed instead of
get_mempolicy(2).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 2 +
configure | 187 ++++++++++++++++++++++++++++
configure.ac | 14 +++
doc/src/sgml/func.sgml | 13 ++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 6 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 40 ++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 120 ++++++++++++++++++
17 files changed, 442 insertions(+), 2 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
--with-liburing \
\
${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
diff --git a/configure b/configure
index 8f4a5ab28ec..0936010718d 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
XML2_CFLAGS
XML2_CONFIG
with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
@@ -872,6 +875,7 @@ with_liburing
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
LIBURING_LIBS
LIBCURL_CFLAGS
LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
XML2_CONFIG
XML2_CFLAGS
XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -1629,6 +1636,10 @@ Some influential environment variables:
C compiler flags for LIBCURL, overriding pkg-config
LIBCURL_LIBS
linker flags for LIBCURL, overriding pkg-config
+ LIBNUMA_CFLAGS
+ C compiler flags for LIBNUMA, overriding pkg-config
+ LIBNUMA_LIBS
+ linker flags for LIBNUMA, overriding pkg-config
XML2_CONFIG path to xml2-config utility
XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
XML2_LIBS linker flags for XML2, overriding pkg-config
@@ -9063,6 +9074,182 @@ $as_echo "$as_me: WARNING: *** OAuth support tests require --with-python to run"
fi
+#
+# libnuma
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libnuma support" >&5
+$as_echo_n "checking whether to build with libnuma support... " >&6; }
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libnuma" >&5
+$as_echo "$with_libnuma" >&6; }
+
+
+if test "$with_libnuma" = yes ; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'libnuma' is required for NUMA support" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa" >&5
+$as_echo_n "checking for numa... " >&6; }
+
+if test -n "$LIBNUMA_CFLAGS"; then
+ pkg_cv_LIBNUMA_CFLAGS="$LIBNUMA_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_CFLAGS=`$PKG_CONFIG --cflags "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBNUMA_LIBS"; then
+ pkg_cv_LIBNUMA_LIBS="$LIBNUMA_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_LIBS=`$PKG_CONFIG --libs "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "numa" 2>&1`
+ else
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "numa" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBNUMA_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (numa) were not met:
+
+$LIBNUMA_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBNUMA_CFLAGS=$pkg_cv_LIBNUMA_CFLAGS
+ LIBNUMA_LIBS=$pkg_cv_LIBNUMA_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
#
# XML
#
diff --git a/configure.ac b/configure.ac
index fc5f7475d07..2a78cddd825 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,20 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma support],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA support])])
+ PKG_CHECK_MODULES(LIBNUMA, numa)
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0224f93733d..9ab070adffb 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25143,6 +25143,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..8ebf0b03ec0 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname> library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-liburing">
<term><option>--with-liburing</option></term>
<listitem>
@@ -2645,6 +2655,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname> library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 27717ad8976..b562a00c588 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: libnumaopt)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: liburing
@@ -3279,6 +3300,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
liburing,
libxml,
lz4,
@@ -3935,6 +3957,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..06bf5627d3c 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA support')
+
option('liburing', type : 'feature', value: 'auto',
description: 'io_uring support, for asynchronous I/O')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 737b2dd1869..6722fbdf365 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
@@ -223,6 +224,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBNUMA_CFLAGS = @LIBNUMA_CFLAGS@
+LIBNUMA_LIBS = @LIBNUMA_LIBS@
+
LIBURING_CFLAGS = @LIBURING_CFLAGS@
LIBURING_LIBS = @LIBURING_LIBS@
@@ -250,7 +254,7 @@ CPP = @CPP@
CPPFLAGS = @CPPFLAGS@
PG_SYSROOT = @PG_SYSROOT@
-override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBNUMA_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
ifdef PGXS
override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..ea8d796e7c4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5d5be8ba4e1..dfc59ea0cc8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8542,6 +8542,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 9891b9b05c3..1af0b6316dd 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -689,6 +689,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to build with io_uring support. (--with-liburing) */
#undef USE_LIBURING
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..3c1b50c1428
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(uint64 *) ptr
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
@@ -232,6 +234,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index f11896440d5..4274949dfa4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
path.o \
pg_bitutils.o \
pg_localeconv_r.o \
+ pg_numa.o \
pg_popcount_aarch64.o \
pg_popcount_avx512.o \
pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 48d2dfb7cf3..fc7b059fee5 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_localeconv_r.c',
+ 'pg_numa.c',
'pg_popcount_aarch64.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..5e2523cf798
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+/*
+ * We use move_pages(2) syscall here - instead of get_mempolicy(2) - as the
+ * first one allows us to batch and query about many memory pages in one single
+ * giant system call that is way faster.
+ */
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+#else
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
+
+/* This should be used only after the server is started */
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+ Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
--
2.39.5
v25-0002-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchapplication/octet-stream; name=v25-0002-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchDownload
From 98435a1e46768784f22aa0929a83951ac0a5a965 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 3 Apr 2025 20:21:25 +0200
Subject: [PATCH v25 2/7] Add pg_buffercache_numa view with NUMA node info
Introduces a new view pg_buffercache_numa, showing a NUMA memory node
for each individual buffer.
To determine the NUMA node for a buffer, we first need to touch the
memory pages using pg_numa_touch_mem_if_required, otherwise we might get
status -2 (ENOENT = The page is not present), indicating the page is
either unmapped or unallocated.
The size of a database block and OS memory page may differ. For example
the default block size (BLCKSZ) is 8KB, while the memory page is 4KB,
but it's also possible to make the block size smaller (e.g. 1KB).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 5 +-
.../expected/pg_buffercache_numa.out | 28 ++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 22 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 288 ++++++++++++++++++
.../sql/pg_buffercache_numa.sql | 20 ++
doc/src/sgml/pgbuffercache.sgml | 75 ++++-
src/tools/pgindent/typedefs.list | 2 +
10 files changed, 443 insertions(+), 4 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..5f748543e2e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,10 +8,11 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
-REGRESS = pg_buffercache
+REGRESS = pg_buffercache pg_buffercache_numa
ifdef USE_PGXS
PG_CONFIG = pg_config
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..1230e244a5f
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,22 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, page_num int4, node_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..0b96476c319 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,6 +11,7 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
@@ -20,6 +21,8 @@
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
+#define NUM_BUFFERCACHE_NUMA_ELEM 3
+
PG_MODULE_MAGIC_EXT(
.name = "pg_buffercache",
.version = PG_VERSION
@@ -58,16 +61,44 @@ typedef struct
BufferCachePagesRec *record;
} BufferCachePagesContext;
+/*
+ * Record structure holding the to be exposed cache data.
+ */
+typedef struct
+{
+ uint32 bufferid;
+ int32 numa_page;
+ int32 numa_node;
+} BufferCacheNumaRec;
+
+/*
+ * Function context for data persisting over repeated calls.
+ */
+typedef struct
+{
+ TupleDesc tupdesc;
+ int buffers_per_page;
+ int pages_per_buffer;
+ int os_page_size;
+ BufferCacheNumaRec *record;
+} BufferCacheNumaContext;
+
/*
* Function returning data from the shared buffer cache - buffer number,
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+
Datum
pg_buffercache_pages(PG_FUNCTION_ARGS)
{
@@ -246,6 +277,263 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * Inquire about NUMA memory mappings for shared buffers.
+ *
+ * Returns NUMA node ID for each memory page used by the buffer. Buffers may
+ * be smaller or larger than OS memory pages. For each buffer we return one
+ * entry for each memory page used by the buffer (it fhe buffer is smaller,
+ * it only uses a part of one memory page).
+ *
+ * We expect both sizes (for buffers and memory pages) to be a power-of-2, so
+ * one is always a multiple of the other.
+ *
+ * In order to get reliable results we also need to touch memory pages, so
+ * that the inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ MemoryContext oldcontext;
+ BufferCacheNumaContext *fctx; /* User function context. */
+ TupleDesc tupledesc;
+ TupleDesc expected_tupledesc;
+ HeapTuple tuple;
+ Datum result;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i,
+ j,
+ idx;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_page_status;
+ uint64 os_page_count;
+ int pages_per_buffer;
+ int buffers_per_page;
+ volatile uint64 touch pg_attribute_unused();
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS
+ * memory page size 2. Calculate how many OS pages are used by all
+ * buffer blocks 3. Calculate how many OS pages are contained within
+ * each database block.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+ buffers_per_page = os_page_size / BLCKSZ;
+ pages_per_buffer = BLCKSZ / os_page_size;
+
+ /*
+ * The pages and block size is expected to be 2^k, so one divides the
+ * other (we don't know in which direction).
+ */
+ Assert((os_page_size % BLCKSZ == 0) || (BLCKSZ % os_page_size == 0));
+
+ /*
+ * Either both counts are 1 (when the pages have the same size), or
+ * exacly one of them is zero. Both can't be zero at the same time.
+ */
+ Assert((buffers_per_page > 0) || (pages_per_buffer > 0));
+ Assert(((buffers_per_page == 1) && (pages_per_buffer == 1)) ||
+ ((buffers_per_page == 0) || (pages_per_buffer == 0)));
+
+ /*
+ * How many addresses we are going to query (store) depends on the
+ * relation between BLCKSZ : PAGESIZE. We need at least one status per
+ * buffer - if the memory page is larger than buffer, we still query
+ * it for each buffer. With multiple memory pages per buffer, we need
+ * that many entries.
+ */
+ os_page_count = NBuffers * Max(1, pages_per_buffer);
+
+ elog(DEBUG1, "NUMA: NBuffers=%d os_page_query_count=" UINT64_FORMAT " "
+ "os_page_size=%zu buffers_per_page=%d pages_per_buffer=%d",
+ NBuffers, os_page_count, os_page_size,
+ buffers_per_page, pages_per_buffer);
+
+
+ /* Initialize the multi-call context, load entries about buffers */
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCacheNumaContext *) palloc(sizeof(BufferCacheNumaContext));
+
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. We unfortunately have to get the result type for that... -
+ * we can't use the result type determined by the function definition
+ * without potentially crashing when somebody uses the old (or even
+ * wrong) function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts != NUM_BUFFERCACHE_NUMA_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "page_num",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "node_id",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCacheNumaRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCacheNumaRec) * os_page_count);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+
+
+ /* Used to determine the NUMA node for all OS pages at once */
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+ os_page_status = palloc(sizeof(uint64) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(os_page_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ *
+ * This loop touches and stores addresses into os_page_ptrs[] as input
+ * to one big big move_pages(2) inquiry system call. Basically we ask
+ * for all memory pages for NBuffers.
+ */
+ idx = 0;
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+ uint32 bufferid;
+
+ CHECK_FOR_INTERRUPTS();
+
+ bufHdr = GetBufferDescriptor(i);
+
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+ bufferid = BufferDescriptorGetBuffer(bufHdr);
+
+ UnlockBufHdr(bufHdr, buf_state);
+
+ /*
+ * If we have multiple OS pages per buffer, fill those in too. We
+ * always want at least one OS page, even if there are multiple
+ * buffers per page.
+ *
+ * Altough we could query just once per each OS page, we do it
+ * repeatably for each Buffer and hit the same address as
+ * move_pages(2) requires page aligment. This also simplifies
+ * retrieval code later on. Also NBuffers starts from 1.
+ */
+ for (j = 0; j < Max(1, pages_per_buffer); j++)
+ {
+ char *buffptr = (char *) BufferGetBlock(i + 1);
+
+ fctx->record[idx].bufferid = bufferid;
+ fctx->record[idx].numa_page = j;
+
+ os_page_ptrs[idx]
+ = (char *) TYPEALIGN(os_page_size,
+ buffptr + (os_page_size * j));
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[idx]);
+
+ ++idx;
+ }
+
+ }
+
+ /* We should get exactly the expected number of entrires */
+ Assert(idx == os_page_count);
+
+ /* Query NUMA status for all the pointers */
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_page_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ /*
+ * Update the entries with NUMA node ID. The status array is indexed
+ * the same way as the entry index.
+ */
+ for (i = 0; i < os_page_count; i++)
+ {
+ fctx->record[i].numa_node = os_page_status[i];
+ }
+
+ /* Remember this backend touched the pages */
+ firstNumaTouch = false;
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ uint32 i = funcctx->call_cntr;
+ Datum values[NUM_BUFFERCACHE_NUMA_ELEM];
+ bool nulls[NUM_BUFFERCACHE_NUMA_ELEM];
+
+ values[0] = Int32GetDatum(fctx->record[i].bufferid);
+ nulls[0] = false;
+
+ values[1] = Int32GetDatum(fctx->record[i].numa_page);
+ nulls[1] = false;
+
+ values[2] = Int32GetDatum(fctx->record[i].numa_node);
+ nulls[2] = false;
+
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ result = HeapTupleGetDatum(tuple);
+
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ SRF_RETURN_DONE(funcctx);
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..b01f8e71357 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,15 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides
+ <acronym>NUMA</acronym> node mappings for shared buffer entries. This
+ information is not part of <function>pg_buffercache_pages()</function>
+ itself, as it is much slower to retrieve.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +211,68 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache-numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed by the view are shown in <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>bufferid</structfield> <type>integer</type>
+ </para>
+ <para>
+ ID, in the range 1..<varname>shared_buffers</varname>
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>page_num</structfield> <type>int</type>
+ </para>
+ <para>
+ number of OS memory page for this buffer
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 0c81d03950d..ed74a76a5c7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -341,6 +341,8 @@ BufFile
Buffer
BufferAccessStrategy
BufferAccessStrategyType
+BufferCacheNumaRec
+BufferCacheNumaContext
BufferCachePagesContext
BufferCachePagesRec
BufferDesc
--
2.39.5
v25-0003-adjust-page_num.patchapplication/octet-stream; name=v25-0003-adjust-page_num.patchDownload
From c3c030c3fb3a2c39a164a7ab1b7bea9df5f5a9b7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 5 Apr 2025 16:00:39 +0200
Subject: [PATCH v25 3/7] adjust page_num
---
contrib/pg_buffercache/pg_buffercache_pages.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 0b96476c319..a3c4a2578d9 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -315,6 +315,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
int pages_per_buffer;
int buffers_per_page;
volatile uint64 touch pg_attribute_unused();
+ char *startptr = NULL;
if (pg_numa_init() == -1)
elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
@@ -437,6 +438,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
* to one big big move_pages(2) inquiry system call. Basically we ask
* for all memory pages for NBuffers.
*/
+ startptr = (char *) BufferGetBlock(1);
idx = 0;
for (i = 0; i < NBuffers; i++)
{
@@ -469,11 +471,14 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
char *buffptr = (char *) BufferGetBlock(i + 1);
fctx->record[idx].bufferid = bufferid;
- fctx->record[idx].numa_page = j;
os_page_ptrs[idx]
- = (char *) TYPEALIGN(os_page_size,
- buffptr + (os_page_size * j));
+ = (char *) TYPEALIGN_DOWN(os_page_size,
+ buffptr + (os_page_size * j));
+
+ /* calculate ID of the OS memory page */
+ fctx->record[idx].numa_page
+ = ((char *) os_page_ptrs[idx] - startptr) / os_page_size;
/* Only need to touch memory once per backend process lifetime */
if (firstNumaTouch)
--
2.39.5
v25-0006-fixes-for-review-by-Andres.patchapplication/octet-stream; name=v25-0006-fixes-for-review-by-Andres.patchDownload
From 7b279721ae04e823f20e94331dd3b0a634ff3e7f Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Sun, 6 Apr 2025 14:19:41 +0200
Subject: [PATCH v25 6/7] fixes for review by Andres
---
contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 8 --------
doc/src/sgml/system-views.sgml | 8 +++++++-
meson.build | 2 +-
src/include/port/pg_numa.h | 2 +-
src/test/regress/expected/numa.out | 1 +
src/test/regress/expected/numa_1.out | 2 ++
src/test/regress/sql/numa.sql | 1 +
8 files changed, 14 insertions(+), 12 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
index 1230e244a5f..e3b145a1687 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -10,7 +10,7 @@ AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
LANGUAGE C PARALLEL SAFE;
-- Create a view for convenient access.
-CREATE OR REPLACE VIEW pg_buffercache_numa AS
+CREATE VIEW pg_buffercache_numa AS
SELECT P.* FROM pg_buffercache_numa_pages() AS P
(bufferid integer, page_num int4, node_id int4);
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index a3c4a2578d9..df94cc6ef7d 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -375,14 +375,6 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
/* Create a user function context for cross-call persistence */
fctx = (BufferCacheNumaContext *) palloc(sizeof(BufferCacheNumaContext));
- /*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
- */
if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index a83365ae24a..4e853885de6 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -4069,7 +4069,13 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
across NUMA nodes. This includes both memory allocated by
<productname>PostgreSQL</productname> itself and memory allocated
by extensions using the mechanisms detailed in
- <xref linkend="xfunc-shared-addin" />.
+ <xref linkend="xfunc-shared-addin" />. This view will output multiple rows
+ for each of the shared memory segments provided that they are spread accross
+ multiple NUMA nodes. This view should not be queried by monitoring systems
+ as it is very slow and may end up allocating shared memory in case it was not
+ used earlier.
+ Current limitation for this view is that won't show anonymous shared memory
+ allocations.
</para>
<para>
diff --git a/meson.build b/meson.build
index b562a00c588..a1516e54529 100644
--- a/meson.build
+++ b/meson.build
@@ -950,7 +950,7 @@ endif
libnumaopt = get_option('libnuma')
if not libnumaopt.disabled()
# via pkg-config
- libnuma = dependency('numa', required: libnumaopt)
+ libnuma = dependency('numa', required: false)
if not libnuma.found()
libnuma = cc.find_library('numa', required: libnumaopt)
endif
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 3c1b50c1428..7e990d9f776 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -28,7 +28,7 @@ extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
* need to page-fault before move_pages(2) syscall returns valid results.
*/
#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
- ro_volatile_var = *(uint64 *) ptr
+ ro_volatile_var = *(volatile uint64 *) ptr
#else
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
index 668172f7d79..8af5dfeb9a5 100644
--- a/src/test/regress/expected/numa.out
+++ b/src/test/regress/expected/numa.out
@@ -1,5 +1,6 @@
SELECT NOT(pg_numa_available()) AS skip_test \gset
\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
\quit
\endif
-- switch to superuser
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
index 6dd6824b4e4..c90042fa7cc 100644
--- a/src/test/regress/expected/numa_1.out
+++ b/src/test/regress/expected/numa_1.out
@@ -1,3 +1,5 @@
SELECT NOT(pg_numa_available()) AS skip_test \gset
\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+ERROR: libnuma initialization failed or NUMA is not supported on this platform
\quit
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
index 034098783fb..324481c33b7 100644
--- a/src/test/regress/sql/numa.sql
+++ b/src/test/regress/sql/numa.sql
@@ -1,5 +1,6 @@
SELECT NOT(pg_numa_available()) AS skip_test \gset
\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
\quit
\endif
--
2.39.5
v25-0007-fix-remaining-outstanding-issues-from-Sunday.patchapplication/octet-stream; name=v25-0007-fix-remaining-outstanding-issues-from-Sunday.patchDownload
From 094207663855318891ab23ca927e84ddd37f23b7 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Mon, 7 Apr 2025 10:06:38 +0200
Subject: [PATCH v25 7/7] fix remaining outstanding issues from Sunday
---
contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 6 +++---
doc/src/sgml/pgbuffercache.sgml | 4 ++--
src/backend/storage/ipc/shmem.c | 8 +++++++-
4 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
index e3b145a1687..998289790b7 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -12,7 +12,7 @@ LANGUAGE C PARALLEL SAFE;
-- Create a view for convenient access.
CREATE VIEW pg_buffercache_numa AS
SELECT P.* FROM pg_buffercache_numa_pages() AS P
- (bufferid integer, page_num int4, node_id int4);
+ (bufferid integer, ospageid int4, nodeid int4);
-- Don't want these to be available to public.
REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index df94cc6ef7d..bf96d4cea89 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -385,9 +385,9 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "page_num",
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "ospageid",
INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "node_id",
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "nodeid",
INT4OID, -1, 0);
fctx->tupdesc = BlessTupleDesc(tupledesc);
@@ -430,7 +430,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
* to one big big move_pages(2) inquiry system call. Basically we ask
* for all memory pages for NBuffers.
*/
- startptr = (char *) BufferGetBlock(1);
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char *) BufferGetBlock(1));
idx = 0;
for (i = 0; i < NBuffers; i++)
{
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index b01f8e71357..b39c9849362 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -244,7 +244,7 @@
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>page_num</structfield> <type>int</type>
+ <structfield>ospageid</structfield> <type>int</type>
</para>
<para>
number of OS memory page for this buffer
@@ -253,7 +253,7 @@
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>node_id</structfield> <type>int</type>
+ <structfield>nodeid</structfield> <type>int</type>
</para>
<para>
ID of <acronym>NUMA</acronym> node
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 4a9a9606f2e..75cce5ca8a5 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -572,7 +572,13 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
-/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+/*
+ * SQL SRF showing NUMA memory nodes for allocated shared memory
+ *
+ * Contrary to above one - pg_get_shmem_allocations() - in this function
+ * we don't output information aobut shared anonymous allocations and
+ * unused memory.
+ */
Datum
pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
{
--
2.39.5
Hi,
On Mon, Apr 07, 2025 at 10:09:26AM +0200, Jakub Wartak wrote:
Bertrand noticed this first in
/messages/by-id/Z/FhOOCmTxuB2h0b@ip-10-97-1-34.eu-west-3.compute.internal
:- startptr = (char *) BufferGetBlock(1); + startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char *) BufferGetBlock(1));With the above I'm also not getting wonky (-1) results anymore. The
rest of reply assumes we are using this.
yeah, I can see that you added it in v25-0007. In the same message I mentioned
to "use the actual buffer address when pg_numa_touch_mem_if_required()
is called?"
So, to be extra cautious we could do something like:
@@ -474,7 +474,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
/* Only need to touch memory once per backend process lifetime */
if (firstNumaTouch)
- pg_numa_touch_mem_if_required(touch, os_page_ptrs[idx]);
+ pg_numa_touch_mem_if_required(touch, buffptr + (j * os_page_size));
what do you think?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, Apr 7, 2025 at 11:53 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi,
On Mon, Apr 07, 2025 at 10:09:26AM +0200, Jakub Wartak wrote:
Bertrand noticed this first in
/messages/by-id/Z/FhOOCmTxuB2h0b@ip-10-97-1-34.eu-west-3.compute.internal
:- startptr = (char *) BufferGetBlock(1); + startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char *) BufferGetBlock(1));With the above I'm also not getting wonky (-1) results anymore. The
rest of reply assumes we are using this.yeah, I can see that you added it in v25-0007. In the same message I mentioned
to "use the actual buffer address when pg_numa_touch_mem_if_required()
is called?"So, to be extra cautious we could do something like:
@@ -474,7 +474,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
/* Only need to touch memory once per backend process lifetime */ if (firstNumaTouch) - pg_numa_touch_mem_if_required(touch, os_page_ptrs[idx]); + pg_numa_touch_mem_if_required(touch, buffptr + (j * os_page_size));what do you think?
Yeah, I think we could include this too as it looks safer (sry I've
missed that one). Attached v25 as it was , with this little tweak.
-J.
Attachments:
v25-0003-Introduce-pg_shmem_allocations_numa-view.patchapplication/octet-stream; name=v25-0003-Introduce-pg_shmem_allocations_numa-view.patchDownload
From 1cabc553ea4a7e185fd83f1ab081521820fe6229 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v25 3/6] Introduce pg_shmem_allocations_numa view
Introduce new pg_shmem_alloctions_numa view with information about how
shared memory is distributed across NUMA nodes.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 79 ++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 152 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 12 ++
src/test/regress/expected/numa_1.out | 3 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 9 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 294 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 4f336ee0adf..a83365ae24a 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -181,6 +181,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -4051,6 +4056,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-allocations-numa">
+ <title><structname>pg_shmem_allocations_numa</structname></title>
+
+ <indexterm zone="view-pg-shmem-allocations-numa">
+ <primary>pg_shmem_allocations_numa</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_allocations_numa</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_allocations_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_allocations_numa</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 273008db37f..08f780a2e63 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_allocations_numa AS
+ SELECT * FROM pg_get_shmem_allocations_numa();
+
+REVOKE ALL ON pg_shmem_allocations_numa FROM PUBLIC;
+GRANT SELECT ON pg_shmem_allocations_numa TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..5d979423bd9 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,152 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+Datum
+pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ *
+ * XXX Isn't this wasteful? But there probably is one large segment of
+ * shared memory, much larger than the rest anyway.
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+
+ /* XXX I assume we use TYPEALIGN as a way to round to whole pages.
+ * It's a bit misleading to call that "aligned", no? */
+
+ /* Get number of OS aligned pages */
+ shm_ent_page_count
+ = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ /*
+ * Setup page_ptrs[] with pointers to all OS pages for this segment,
+ * and get the NUMA status using pg_numa_query_pages.
+ *
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (ENOENT, which indicates unmapped/unallocated pages).
+ */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ /* Count number of NUMA nodes used for this shared memory entry */
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s < 0 || s > max_nodes)
+ {
+ elog(ERROR, "invalid NUMA node id outside of allowed range "
+ "[0, " UINT64_FORMAT "]: %d", max_nodes, s);
+ }
+
+ nodes[s]++;
+ }
+
+ /*
+ * Add one entry for each NUMA node, including those without allocated
+ * memory for this segment.
+ */
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory.
+ *
+ * XXX Not quite sure why this is at the end, and what "output memory"
+ * refers to.
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index dfc59ea0cc8..a93075c675c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8546,6 +8546,14 @@
proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_allocations_numa', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,node_id,size}',
+ prosrc => 'pg_get_shmem_allocations_numa' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..668172f7d79
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 1fddb13b6ae..c25062c288f 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3219,8 +3219,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3242,6 +3242,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
has_table_privilege
@@ -3261,6 +3267,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 673c63b8d1b..abfdc97abc5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1757,6 +1757,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_allocations_numa| SELECT name,
+ node_id,
+ size
+ FROM pg_get_shmem_allocations_numa() pg_get_shmem_allocations_numa(name, node_id, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..034098783fb
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index 85d7280f35f..f337aa67c13 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1947,8 +1947,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1958,12 +1958,14 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.39.5
v25-0004-adjust-page-alignment.patchapplication/octet-stream; name=v25-0004-adjust-page-alignment.patchDownload
From bc2a6e9d279c38afa1ccfd60ec93ad88eaf80b36 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 5 Apr 2025 16:20:13 +0200
Subject: [PATCH v25 4/6] adjust page alignment
---
src/backend/storage/ipc/shmem.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 5d979423bd9..4a9a9606f2e 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -637,13 +637,22 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
{
int i;
+ char *startptr,
+ *endptr;
+ Size total_len;
- /* XXX I assume we use TYPEALIGN as a way to round to whole pages.
- * It's a bit misleading to call that "aligned", no? */
+ /*
+ * Calculate the range of OS pages used by this segment. The segment
+ * may start / end half-way through a page, we want to count these
+ * pages too. So we align the start/end pointers down/up, and then
+ * calculate the number of pages from that.
+ */
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size, ent->location);
+ endptr = (char *) TYPEALIGN(os_page_size,
+ (char *) ent->location + ent->allocated_size);
+ total_len = (endptr - startptr);
- /* Get number of OS aligned pages */
- shm_ent_page_count
- = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
+ shm_ent_page_count = total_len / os_page_size;
/*
* If we get ever 0xff back from kernel inquiry, then we probably have
@@ -663,7 +672,7 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
{
volatile uint64 touch pg_attribute_unused();
- page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+ page_ptrs[i] = startptr + (i * os_page_size);
if (firstNumaTouch)
pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
--
2.39.5
v25-0001-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchapplication/octet-stream; name=v25-0001-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchDownload
From 98435a1e46768784f22aa0929a83951ac0a5a965 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 3 Apr 2025 20:21:25 +0200
Subject: [PATCH v25 1/6] Add pg_buffercache_numa view with NUMA node info
Introduces a new view pg_buffercache_numa, showing a NUMA memory node
for each individual buffer.
To determine the NUMA node for a buffer, we first need to touch the
memory pages using pg_numa_touch_mem_if_required, otherwise we might get
status -2 (ENOENT = The page is not present), indicating the page is
either unmapped or unallocated.
The size of a database block and OS memory page may differ. For example
the default block size (BLCKSZ) is 8KB, while the memory page is 4KB,
but it's also possible to make the block size smaller (e.g. 1KB).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 5 +-
.../expected/pg_buffercache_numa.out | 28 ++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 22 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 288 ++++++++++++++++++
.../sql/pg_buffercache_numa.sql | 20 ++
doc/src/sgml/pgbuffercache.sgml | 75 ++++-
src/tools/pgindent/typedefs.list | 2 +
10 files changed, 443 insertions(+), 4 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..5f748543e2e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,10 +8,11 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
-REGRESS = pg_buffercache
+REGRESS = pg_buffercache pg_buffercache_numa
ifdef USE_PGXS
PG_CONFIG = pg_config
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..1230e244a5f
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,22 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, page_num int4, node_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..0b96476c319 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,6 +11,7 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
@@ -20,6 +21,8 @@
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
+#define NUM_BUFFERCACHE_NUMA_ELEM 3
+
PG_MODULE_MAGIC_EXT(
.name = "pg_buffercache",
.version = PG_VERSION
@@ -58,16 +61,44 @@ typedef struct
BufferCachePagesRec *record;
} BufferCachePagesContext;
+/*
+ * Record structure holding the to be exposed cache data.
+ */
+typedef struct
+{
+ uint32 bufferid;
+ int32 numa_page;
+ int32 numa_node;
+} BufferCacheNumaRec;
+
+/*
+ * Function context for data persisting over repeated calls.
+ */
+typedef struct
+{
+ TupleDesc tupdesc;
+ int buffers_per_page;
+ int pages_per_buffer;
+ int os_page_size;
+ BufferCacheNumaRec *record;
+} BufferCacheNumaContext;
+
/*
* Function returning data from the shared buffer cache - buffer number,
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+
Datum
pg_buffercache_pages(PG_FUNCTION_ARGS)
{
@@ -246,6 +277,263 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * Inquire about NUMA memory mappings for shared buffers.
+ *
+ * Returns NUMA node ID for each memory page used by the buffer. Buffers may
+ * be smaller or larger than OS memory pages. For each buffer we return one
+ * entry for each memory page used by the buffer (it fhe buffer is smaller,
+ * it only uses a part of one memory page).
+ *
+ * We expect both sizes (for buffers and memory pages) to be a power-of-2, so
+ * one is always a multiple of the other.
+ *
+ * In order to get reliable results we also need to touch memory pages, so
+ * that the inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ MemoryContext oldcontext;
+ BufferCacheNumaContext *fctx; /* User function context. */
+ TupleDesc tupledesc;
+ TupleDesc expected_tupledesc;
+ HeapTuple tuple;
+ Datum result;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i,
+ j,
+ idx;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_page_status;
+ uint64 os_page_count;
+ int pages_per_buffer;
+ int buffers_per_page;
+ volatile uint64 touch pg_attribute_unused();
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS
+ * memory page size 2. Calculate how many OS pages are used by all
+ * buffer blocks 3. Calculate how many OS pages are contained within
+ * each database block.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+ buffers_per_page = os_page_size / BLCKSZ;
+ pages_per_buffer = BLCKSZ / os_page_size;
+
+ /*
+ * The pages and block size is expected to be 2^k, so one divides the
+ * other (we don't know in which direction).
+ */
+ Assert((os_page_size % BLCKSZ == 0) || (BLCKSZ % os_page_size == 0));
+
+ /*
+ * Either both counts are 1 (when the pages have the same size), or
+ * exacly one of them is zero. Both can't be zero at the same time.
+ */
+ Assert((buffers_per_page > 0) || (pages_per_buffer > 0));
+ Assert(((buffers_per_page == 1) && (pages_per_buffer == 1)) ||
+ ((buffers_per_page == 0) || (pages_per_buffer == 0)));
+
+ /*
+ * How many addresses we are going to query (store) depends on the
+ * relation between BLCKSZ : PAGESIZE. We need at least one status per
+ * buffer - if the memory page is larger than buffer, we still query
+ * it for each buffer. With multiple memory pages per buffer, we need
+ * that many entries.
+ */
+ os_page_count = NBuffers * Max(1, pages_per_buffer);
+
+ elog(DEBUG1, "NUMA: NBuffers=%d os_page_query_count=" UINT64_FORMAT " "
+ "os_page_size=%zu buffers_per_page=%d pages_per_buffer=%d",
+ NBuffers, os_page_count, os_page_size,
+ buffers_per_page, pages_per_buffer);
+
+
+ /* Initialize the multi-call context, load entries about buffers */
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCacheNumaContext *) palloc(sizeof(BufferCacheNumaContext));
+
+ /*
+ * To smoothly support upgrades from version 1.0 of this extension
+ * transparently handle the (non-)existence of the pinning_backends
+ * column. We unfortunately have to get the result type for that... -
+ * we can't use the result type determined by the function definition
+ * without potentially crashing when somebody uses the old (or even
+ * wrong) function definition though.
+ */
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts != NUM_BUFFERCACHE_NUMA_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "page_num",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "node_id",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCacheNumaRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCacheNumaRec) * os_page_count);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+
+
+ /* Used to determine the NUMA node for all OS pages at once */
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+ os_page_status = palloc(sizeof(uint64) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(os_page_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ *
+ * This loop touches and stores addresses into os_page_ptrs[] as input
+ * to one big big move_pages(2) inquiry system call. Basically we ask
+ * for all memory pages for NBuffers.
+ */
+ idx = 0;
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+ uint32 bufferid;
+
+ CHECK_FOR_INTERRUPTS();
+
+ bufHdr = GetBufferDescriptor(i);
+
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+ bufferid = BufferDescriptorGetBuffer(bufHdr);
+
+ UnlockBufHdr(bufHdr, buf_state);
+
+ /*
+ * If we have multiple OS pages per buffer, fill those in too. We
+ * always want at least one OS page, even if there are multiple
+ * buffers per page.
+ *
+ * Altough we could query just once per each OS page, we do it
+ * repeatably for each Buffer and hit the same address as
+ * move_pages(2) requires page aligment. This also simplifies
+ * retrieval code later on. Also NBuffers starts from 1.
+ */
+ for (j = 0; j < Max(1, pages_per_buffer); j++)
+ {
+ char *buffptr = (char *) BufferGetBlock(i + 1);
+
+ fctx->record[idx].bufferid = bufferid;
+ fctx->record[idx].numa_page = j;
+
+ os_page_ptrs[idx]
+ = (char *) TYPEALIGN(os_page_size,
+ buffptr + (os_page_size * j));
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, os_page_ptrs[idx]);
+
+ ++idx;
+ }
+
+ }
+
+ /* We should get exactly the expected number of entrires */
+ Assert(idx == os_page_count);
+
+ /* Query NUMA status for all the pointers */
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_page_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ /*
+ * Update the entries with NUMA node ID. The status array is indexed
+ * the same way as the entry index.
+ */
+ for (i = 0; i < os_page_count; i++)
+ {
+ fctx->record[i].numa_node = os_page_status[i];
+ }
+
+ /* Remember this backend touched the pages */
+ firstNumaTouch = false;
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ uint32 i = funcctx->call_cntr;
+ Datum values[NUM_BUFFERCACHE_NUMA_ELEM];
+ bool nulls[NUM_BUFFERCACHE_NUMA_ELEM];
+
+ values[0] = Int32GetDatum(fctx->record[i].bufferid);
+ nulls[0] = false;
+
+ values[1] = Int32GetDatum(fctx->record[i].numa_page);
+ nulls[1] = false;
+
+ values[2] = Int32GetDatum(fctx->record[i].numa_node);
+ nulls[2] = false;
+
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ result = HeapTupleGetDatum(tuple);
+
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ SRF_RETURN_DONE(funcctx);
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..b01f8e71357 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,15 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides
+ <acronym>NUMA</acronym> node mappings for shared buffer entries. This
+ information is not part of <function>pg_buffercache_pages()</function>
+ itself, as it is much slower to retrieve.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +211,68 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache-numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed by the view are shown in <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>bufferid</structfield> <type>integer</type>
+ </para>
+ <para>
+ ID, in the range 1..<varname>shared_buffers</varname>
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>page_num</structfield> <type>int</type>
+ </para>
+ <para>
+ number of OS memory page for this buffer
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 0c81d03950d..ed74a76a5c7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -341,6 +341,8 @@ BufFile
Buffer
BufferAccessStrategy
BufferAccessStrategyType
+BufferCacheNumaRec
+BufferCacheNumaContext
BufferCachePagesContext
BufferCachePagesRec
BufferDesc
--
2.39.5
v25-0005-fixes-for-review-by-Andres.patchapplication/octet-stream; name=v25-0005-fixes-for-review-by-Andres.patchDownload
From 7b279721ae04e823f20e94331dd3b0a634ff3e7f Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Sun, 6 Apr 2025 14:19:41 +0200
Subject: [PATCH v25 5/6] fixes for review by Andres
---
contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 8 --------
doc/src/sgml/system-views.sgml | 8 +++++++-
meson.build | 2 +-
src/include/port/pg_numa.h | 2 +-
src/test/regress/expected/numa.out | 1 +
src/test/regress/expected/numa_1.out | 2 ++
src/test/regress/sql/numa.sql | 1 +
8 files changed, 14 insertions(+), 12 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
index 1230e244a5f..e3b145a1687 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -10,7 +10,7 @@ AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
LANGUAGE C PARALLEL SAFE;
-- Create a view for convenient access.
-CREATE OR REPLACE VIEW pg_buffercache_numa AS
+CREATE VIEW pg_buffercache_numa AS
SELECT P.* FROM pg_buffercache_numa_pages() AS P
(bufferid integer, page_num int4, node_id int4);
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index a3c4a2578d9..df94cc6ef7d 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -375,14 +375,6 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
/* Create a user function context for cross-call persistence */
fctx = (BufferCacheNumaContext *) palloc(sizeof(BufferCacheNumaContext));
- /*
- * To smoothly support upgrades from version 1.0 of this extension
- * transparently handle the (non-)existence of the pinning_backends
- * column. We unfortunately have to get the result type for that... -
- * we can't use the result type determined by the function definition
- * without potentially crashing when somebody uses the old (or even
- * wrong) function definition though.
- */
if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index a83365ae24a..4e853885de6 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -4069,7 +4069,13 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
across NUMA nodes. This includes both memory allocated by
<productname>PostgreSQL</productname> itself and memory allocated
by extensions using the mechanisms detailed in
- <xref linkend="xfunc-shared-addin" />.
+ <xref linkend="xfunc-shared-addin" />. This view will output multiple rows
+ for each of the shared memory segments provided that they are spread accross
+ multiple NUMA nodes. This view should not be queried by monitoring systems
+ as it is very slow and may end up allocating shared memory in case it was not
+ used earlier.
+ Current limitation for this view is that won't show anonymous shared memory
+ allocations.
</para>
<para>
diff --git a/meson.build b/meson.build
index b562a00c588..a1516e54529 100644
--- a/meson.build
+++ b/meson.build
@@ -950,7 +950,7 @@ endif
libnumaopt = get_option('libnuma')
if not libnumaopt.disabled()
# via pkg-config
- libnuma = dependency('numa', required: libnumaopt)
+ libnuma = dependency('numa', required: false)
if not libnuma.found()
libnuma = cc.find_library('numa', required: libnumaopt)
endif
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 3c1b50c1428..7e990d9f776 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -28,7 +28,7 @@ extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
* need to page-fault before move_pages(2) syscall returns valid results.
*/
#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
- ro_volatile_var = *(uint64 *) ptr
+ ro_volatile_var = *(volatile uint64 *) ptr
#else
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
index 668172f7d79..8af5dfeb9a5 100644
--- a/src/test/regress/expected/numa.out
+++ b/src/test/regress/expected/numa.out
@@ -1,5 +1,6 @@
SELECT NOT(pg_numa_available()) AS skip_test \gset
\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
\quit
\endif
-- switch to superuser
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
index 6dd6824b4e4..c90042fa7cc 100644
--- a/src/test/regress/expected/numa_1.out
+++ b/src/test/regress/expected/numa_1.out
@@ -1,3 +1,5 @@
SELECT NOT(pg_numa_available()) AS skip_test \gset
\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+ERROR: libnuma initialization failed or NUMA is not supported on this platform
\quit
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
index 034098783fb..324481c33b7 100644
--- a/src/test/regress/sql/numa.sql
+++ b/src/test/regress/sql/numa.sql
@@ -1,5 +1,6 @@
SELECT NOT(pg_numa_available()) AS skip_test \gset
\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
\quit
\endif
--
2.39.5
v25-0006-fix-remaining-outstanding-issues-from-Sunday.patchapplication/octet-stream; name=v25-0006-fix-remaining-outstanding-issues-from-Sunday.patchDownload
From 4ecdd9133eb33ec7d993cf8808502bd88f8e9417 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Mon, 7 Apr 2025 10:06:38 +0200
Subject: [PATCH v25 6/6] fix remaining outstanding issues from Sunday
---
contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 9 +++++----
doc/src/sgml/pgbuffercache.sgml | 4 ++--
src/backend/storage/ipc/shmem.c | 10 ++++++++--
4 files changed, 16 insertions(+), 9 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
index e3b145a1687..998289790b7 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -12,7 +12,7 @@ LANGUAGE C PARALLEL SAFE;
-- Create a view for convenient access.
CREATE VIEW pg_buffercache_numa AS
SELECT P.* FROM pg_buffercache_numa_pages() AS P
- (bufferid integer, page_num int4, node_id int4);
+ (bufferid integer, ospageid int4, nodeid int4);
-- Don't want these to be available to public.
REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index df94cc6ef7d..fe2ffadcb3a 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -385,9 +385,9 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 2, "page_num",
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "ospageid",
INT4OID, -1, 0);
- TupleDescInitEntry(tupledesc, (AttrNumber) 3, "node_id",
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "nodeid",
INT4OID, -1, 0);
fctx->tupdesc = BlessTupleDesc(tupledesc);
@@ -430,7 +430,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
* to one big big move_pages(2) inquiry system call. Basically we ask
* for all memory pages for NBuffers.
*/
- startptr = (char *) BufferGetBlock(1);
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char *) BufferGetBlock(1));
idx = 0;
for (i = 0; i < NBuffers; i++)
{
@@ -474,7 +474,8 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
/* Only need to touch memory once per backend process lifetime */
if (firstNumaTouch)
- pg_numa_touch_mem_if_required(touch, os_page_ptrs[idx]);
+ pg_numa_touch_mem_if_required(touch,
+ buffptr + (os_page_size * j));
++idx;
}
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index b01f8e71357..b39c9849362 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -244,7 +244,7 @@
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>page_num</structfield> <type>int</type>
+ <structfield>ospageid</structfield> <type>int</type>
</para>
<para>
number of OS memory page for this buffer
@@ -253,7 +253,7 @@
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>node_id</structfield> <type>int</type>
+ <structfield>nodeid</structfield> <type>int</type>
</para>
<para>
ID of <acronym>NUMA</acronym> node
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 4a9a9606f2e..69eb5bb738d 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -572,7 +572,13 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
-/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+/*
+ * SQL SRF showing NUMA memory nodes for allocated shared memory
+ *
+ * Contrary to above one - pg_get_shmem_allocations() - in this function
+ * we don't output information aobut shared anonymous allocations and
+ * unused memory.
+ */
Datum
pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
{
@@ -694,7 +700,7 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
if (s < 0 || s > max_nodes)
{
elog(ERROR, "invalid NUMA node id outside of allowed range "
- "[0, " UINT64_FORMAT "]: %d", max_nodes, s);
+ "[0, " UINT64_FORMAT "]: %d", max_nodes, s);
}
nodes[s]++;
--
2.39.5
v25-0002-adjust-page_num.patchapplication/octet-stream; name=v25-0002-adjust-page_num.patchDownload
From c3c030c3fb3a2c39a164a7ab1b7bea9df5f5a9b7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 5 Apr 2025 16:00:39 +0200
Subject: [PATCH v25 2/6] adjust page_num
---
contrib/pg_buffercache/pg_buffercache_pages.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 0b96476c319..a3c4a2578d9 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -315,6 +315,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
int pages_per_buffer;
int buffers_per_page;
volatile uint64 touch pg_attribute_unused();
+ char *startptr = NULL;
if (pg_numa_init() == -1)
elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
@@ -437,6 +438,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
* to one big big move_pages(2) inquiry system call. Basically we ask
* for all memory pages for NBuffers.
*/
+ startptr = (char *) BufferGetBlock(1);
idx = 0;
for (i = 0; i < NBuffers; i++)
{
@@ -469,11 +471,14 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
char *buffptr = (char *) BufferGetBlock(i + 1);
fctx->record[idx].bufferid = bufferid;
- fctx->record[idx].numa_page = j;
os_page_ptrs[idx]
- = (char *) TYPEALIGN(os_page_size,
- buffptr + (os_page_size * j));
+ = (char *) TYPEALIGN_DOWN(os_page_size,
+ buffptr + (os_page_size * j));
+
+ /* calculate ID of the OS memory page */
+ fctx->record[idx].numa_page
+ = ((char *) os_page_ptrs[idx] - startptr) / os_page_size;
/* Only need to touch memory once per backend process lifetime */
if (firstNumaTouch)
--
2.39.5
Hi,
Here's a v26 of this patch series, merging the various fixup patches.
I've also reordered the patches so that the pg_buffercache part is last.
The two other patches are ready to go, and it seems better to push the
built-in catalog before the pg_buffercache contrib module.
For 0001 and 0002, I only have some minor tweaks - comment rewordings
etc. I kept them in separate patches to make it obvious, but I think
it's fine. The one specific tweak is to account for the +1 page, which
might happen if the buffers/pages are shifted in some way.
For 0003, I made some more serious changes - I reworked how buffers and
pages are mapped. I concluded that relying on pages_per_buffer and
buffer_per_pages is way too fiddly, if we can't guarantee the buffers
are not "shifted" in some way. Which I think we can't.
So I reworked that so that os_page_ptrs always points to actual pages
without duplicate pointers. And then for each buffer we calculate the
first page / last page, and iterate over those. A comment suggested the
old code made the retrieval simpler, but I find the new approach much
easier. And it doesn't have to worry how are the pages/buffers shifted
and so on.
The one caveat is that we can't know how many entries the function will
produce until after going through the buffers. We can say only calculate
the maximum number of entries. I think that's fine - we may allocate a
bit more space than needed, but we also save space in os_page_status.
I intend to push 0001 and 0002 shortly, and 0003 after a bit more review
and testing, unless I hear objections.
regards
--
Tomas Vondra
Attachments:
v26-0001-Add-support-for-basic-NUMA-awareness.patchtext/x-patch; charset=UTF-8; name=v26-0001-Add-support-for-basic-NUMA-awareness.patchDownload
From fcc4fc2ada33cbbc962d561ddeea6966f0d55492 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 2 Apr 2025 12:29:22 +0200
Subject: [PATCH v26 1/7] Add support for basic NUMA awareness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.
A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with NUMA library.
On Linux we use move_pages(2) syscall for speed instead of
get_mempolicy(2).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 2 +
configure | 187 ++++++++++++++++++++++++++++
configure.ac | 14 +++
doc/src/sgml/func.sgml | 13 ++
doc/src/sgml/installation.sgml | 21 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 6 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 40 ++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 120 ++++++++++++++++++
17 files changed, 442 insertions(+), 2 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
--with-liburing \
\
${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
diff --git a/configure b/configure
index 8f4a5ab28ec..0936010718d 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
XML2_CFLAGS
XML2_CONFIG
with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
@@ -872,6 +875,7 @@ with_liburing
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
LIBURING_LIBS
LIBCURL_CFLAGS
LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
XML2_CONFIG
XML2_CFLAGS
XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -1629,6 +1636,10 @@ Some influential environment variables:
C compiler flags for LIBCURL, overriding pkg-config
LIBCURL_LIBS
linker flags for LIBCURL, overriding pkg-config
+ LIBNUMA_CFLAGS
+ C compiler flags for LIBNUMA, overriding pkg-config
+ LIBNUMA_LIBS
+ linker flags for LIBNUMA, overriding pkg-config
XML2_CONFIG path to xml2-config utility
XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
XML2_LIBS linker flags for XML2, overriding pkg-config
@@ -9063,6 +9074,182 @@ $as_echo "$as_me: WARNING: *** OAuth support tests require --with-python to run"
fi
+#
+# libnuma
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libnuma support" >&5
+$as_echo_n "checking whether to build with libnuma support... " >&6; }
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libnuma" >&5
+$as_echo "$with_libnuma" >&6; }
+
+
+if test "$with_libnuma" = yes ; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'libnuma' is required for NUMA support" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa" >&5
+$as_echo_n "checking for numa... " >&6; }
+
+if test -n "$LIBNUMA_CFLAGS"; then
+ pkg_cv_LIBNUMA_CFLAGS="$LIBNUMA_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_CFLAGS=`$PKG_CONFIG --cflags "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBNUMA_LIBS"; then
+ pkg_cv_LIBNUMA_LIBS="$LIBNUMA_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_LIBS=`$PKG_CONFIG --libs "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "numa" 2>&1`
+ else
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "numa" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBNUMA_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (numa) were not met:
+
+$LIBNUMA_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBNUMA_CFLAGS=$pkg_cv_LIBNUMA_CFLAGS
+ LIBNUMA_LIBS=$pkg_cv_LIBNUMA_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
#
# XML
#
diff --git a/configure.ac b/configure.ac
index fc5f7475d07..2a78cddd825 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,20 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma support],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA support])])
+ PKG_CHECK_MODULES(LIBNUMA, numa)
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0224f93733d..9ab070adffb 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25143,6 +25143,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..8ebf0b03ec0 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname> library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-liburing">
<term><option>--with-liburing</option></term>
<listitem>
@@ -2645,6 +2655,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname> library is implemented.
+ The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 27717ad8976..a1516e54529 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: false)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: liburing
@@ -3279,6 +3300,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
liburing,
libxml,
lz4,
@@ -3935,6 +3957,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..06bf5627d3c 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA support')
+
option('liburing', type : 'feature', value: 'auto',
description: 'io_uring support, for asynchronous I/O')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 737b2dd1869..6722fbdf365 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
@@ -223,6 +224,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBNUMA_CFLAGS = @LIBNUMA_CFLAGS@
+LIBNUMA_LIBS = @LIBNUMA_LIBS@
+
LIBURING_CFLAGS = @LIBURING_CFLAGS@
LIBURING_LIBS = @LIBURING_LIBS@
@@ -250,7 +254,7 @@ CPP = @CPP@
CPPFLAGS = @CPPFLAGS@
PG_SYSROOT = @PG_SYSROOT@
-override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBNUMA_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
ifdef PGXS
override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..ea8d796e7c4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5d5be8ba4e1..dfc59ea0cc8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8542,6 +8542,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+ proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 9891b9b05c3..1af0b6316dd 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -689,6 +689,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to build with io_uring support. (--with-liburing) */
#undef USE_LIBURING
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..7e990d9f776
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(volatile uint64 *) ptr
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
@@ -232,6 +234,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index f11896440d5..4274949dfa4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
path.o \
pg_bitutils.o \
pg_localeconv_r.o \
+ pg_numa.o \
pg_popcount_aarch64.o \
pg_popcount_avx512.o \
pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 48d2dfb7cf3..fc7b059fee5 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_localeconv_r.c',
+ 'pg_numa.c',
'pg_popcount_aarch64.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..5e2523cf798
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+/*
+ * We use move_pages(2) syscall here - instead of get_mempolicy(2) - as the
+ * first one allows us to batch and query about many memory pages in one single
+ * giant system call that is way faster.
+ */
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+#else
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
+
+/* This should be used only after the server is started */
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+ Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
--
2.49.0
v26-0002-review.patchtext/x-patch; charset=UTF-8; name=v26-0002-review.patchDownload
From 522d6f7045b0194eca44177c13d18e7f4865a79f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 7 Apr 2025 15:22:43 +0200
Subject: [PATCH v26 2/7] review
---
doc/src/sgml/installation.sgml | 7 ++++---
src/include/catalog/pg_proc.dat | 2 +-
2 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 8ebf0b03ec0..077bcc20759 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1161,7 +1161,8 @@ build-postgresql:
<listitem>
<para>
Build with libnuma support for basic NUMA support.
- Only supported on platforms for which the <productname>libnuma</productname> library is implemented.
+ Only supported on platforms for which the <productname>libnuma</productname>
+ library is implemented.
</para>
</listitem>
</varlistentry>
@@ -2660,8 +2661,8 @@ ninja install
<listitem>
<para>
Build with libnuma support for basic NUMA support.
- Only supported on platforms for which the <productname>libnuma</productname> library is implemented.
- The default for this option is auto.
+ Only supported on platforms for which the <productname>libnuma</productname>
+ library is implemented. The default for this option is auto.
</para>
</listitem>
</varlistentry>
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index dfc59ea0cc8..04834d130f9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8542,7 +8542,7 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
-{ oid => '9685', descr => 'Is NUMA compilation available?',
+{ oid => '9685', descr => 'Is NUMA support available?',
proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
--
2.49.0
v26-0003-Introduce-pg_shmem_allocations_numa-view.patchtext/x-patch; charset=UTF-8; name=v26-0003-Introduce-pg_shmem_allocations_numa-view.patchDownload
From c434c9f7062da8a78eeedd0a7576b8b58a8e3442 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v26 3/7] Introduce pg_shmem_allocations_numa view
Introduce new pg_shmem_alloctions_numa view with information about how
shared memory is distributed across NUMA nodes.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 85 ++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 167 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 13 ++
src/test/regress/expected/numa_1.out | 5 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 10 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 319 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 4f336ee0adf..4e853885de6 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -181,6 +181,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -4051,6 +4056,86 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-allocations-numa">
+ <title><structname>pg_shmem_allocations_numa</structname></title>
+
+ <indexterm zone="view-pg-shmem-allocations-numa">
+ <primary>pg_shmem_allocations_numa</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_allocations_numa</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />. This view will output multiple rows
+ for each of the shared memory segments provided that they are spread accross
+ multiple NUMA nodes. This view should not be queried by monitoring systems
+ as it is very slow and may end up allocating shared memory in case it was not
+ used earlier.
+ Current limitation for this view is that won't show anonymous shared memory
+ allocations.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_allocations_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_allocations_numa</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 273008db37f..08f780a2e63 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_allocations_numa AS
+ SELECT * FROM pg_get_shmem_allocations_numa();
+
+REVOKE ALL ON pg_shmem_allocations_numa FROM PUBLIC;
+GRANT SELECT ON pg_shmem_allocations_numa TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..69eb5bb738d 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,167 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/*
+ * SQL SRF showing NUMA memory nodes for allocated shared memory
+ *
+ * Contrary to above one - pg_get_shmem_allocations() - in this function
+ * we don't output information aobut shared anonymous allocations and
+ * unused memory.
+ */
+Datum
+pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ *
+ * XXX Isn't this wasteful? But there probably is one large segment of
+ * shared memory, much larger than the rest anyway.
+ */
+ shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+ char *startptr,
+ *endptr;
+ Size total_len;
+
+ /*
+ * Calculate the range of OS pages used by this segment. The segment
+ * may start / end half-way through a page, we want to count these
+ * pages too. So we align the start/end pointers down/up, and then
+ * calculate the number of pages from that.
+ */
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size, ent->location);
+ endptr = (char *) TYPEALIGN(os_page_size,
+ (char *) ent->location + ent->allocated_size);
+ total_len = (endptr - startptr);
+
+ shm_ent_page_count = total_len / os_page_size;
+
+ /*
+ * If we get ever 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ /*
+ * Setup page_ptrs[] with pointers to all OS pages for this segment,
+ * and get the NUMA status using pg_numa_query_pages.
+ *
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (ENOENT, which indicates unmapped/unallocated pages).
+ */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = startptr + (i * os_page_size);
+
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ /* Count number of NUMA nodes used for this shared memory entry */
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s < 0 || s > max_nodes)
+ {
+ elog(ERROR, "invalid NUMA node id outside of allowed range "
+ "[0, " UINT64_FORMAT "]: %d", max_nodes, s);
+ }
+
+ nodes[s]++;
+ }
+
+ /*
+ * Add one entry for each NUMA node, including those without allocated
+ * memory for this segment.
+ */
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory.
+ *
+ * XXX Not quite sure why this is at the end, and what "output memory"
+ * refers to.
+ */
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 04834d130f9..653258fd100 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8546,6 +8546,14 @@
proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_allocations_numa', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,node_id,size}',
+ prosrc => 'pg_get_shmem_allocations_numa' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..8af5dfeb9a5
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,13 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..c90042fa7cc
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,5 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+ERROR: libnuma initialization failed or NUMA is not supported on this platform
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 1fddb13b6ae..c25062c288f 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3219,8 +3219,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3242,6 +3242,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
has_table_privilege
@@ -3261,6 +3267,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 673c63b8d1b..abfdc97abc5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1757,6 +1757,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_allocations_numa| SELECT name,
+ node_id,
+ size
+ FROM pg_get_shmem_allocations_numa() pg_get_shmem_allocations_numa(name, node_id, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..324481c33b7
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,10 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index 85d7280f35f..f337aa67c13 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1947,8 +1947,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1958,12 +1958,14 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.49.0
v26-0004-review.patchtext/x-patch; charset=UTF-8; name=v26-0004-review.patchDownload
From 249513cf517da3b7c07234d0f1545e517ebdd084 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 7 Apr 2025 15:42:43 +0200
Subject: [PATCH v26 4/7] review
---
src/backend/storage/ipc/shmem.c | 26 +++++++++-----------------
1 file changed, 9 insertions(+), 17 deletions(-)
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 69eb5bb738d..e10b380e5c7 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -575,9 +575,8 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
/*
* SQL SRF showing NUMA memory nodes for allocated shared memory
*
- * Contrary to above one - pg_get_shmem_allocations() - in this function
- * we don't output information aobut shared anonymous allocations and
- * unused memory.
+ * Compared to pg_get_shmem_allocations(), this function does not return
+ * information about shared anonymous allocations and unused shared memory.
*/
Datum
pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
@@ -624,10 +623,12 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
* pages in shared memory rather than calculating the exact requirements
* for each segment.
*
- * XXX Isn't this wasteful? But there probably is one large segment of
- * shared memory, much larger than the rest anyway.
+ * Add 1, because we don't know how exactly the segments align to OS
+ * pages, so the allocation might use one more memory page. In practice
+ * this is not very likely, and moreover we have more entries, each of
+ * them using only fraction of the total pages.
*/
- shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+ shm_total_page_count = (ShmemSegHdr->totalsize / os_page_size) + 1;
page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
pages_status = palloc(sizeof(int) * shm_total_page_count);
@@ -661,8 +662,8 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
shm_ent_page_count = total_len / os_page_size;
/*
- * If we get ever 0xff back from kernel inquiry, then we probably have
- * bug in our buffers to OS page mapping code here.
+ * If we ever get 0xff (-1) back from kernel inquiry, then we probably
+ * have a bug in mapping buffers to OS pages.
*/
memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
@@ -721,15 +722,6 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
}
}
- /*
- * We are ignoring the following memory regions (as compared to
- * pg_get_shmem_allocations()): 1. output shared memory allocated but not
- * counted via the shmem index 2. output as-of-yet unused shared memory.
- *
- * XXX Not quite sure why this is at the end, and what "output memory"
- * refers to.
- */
-
LWLockRelease(ShmemIndexLock);
firstNumaTouch = false;
--
2.49.0
v26-0005-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchtext/x-patch; charset=UTF-8; name=v26-0005-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchDownload
From 7d6745e371dfc82c57e0168d04551e83bac7e3b4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 3 Apr 2025 20:21:25 +0200
Subject: [PATCH v26 5/7] Add pg_buffercache_numa view with NUMA node info
Introduces a new view pg_buffercache_numa, showing a NUMA memory node
for each individual buffer.
To determine the NUMA node for a buffer, we first need to touch the
memory pages using pg_numa_touch_mem_if_required, otherwise we might get
status -2 (ENOENT = The page is not present), indicating the page is
either unmapped or unallocated.
The size of a database block and OS memory page may differ. For example
the default block size (BLCKSZ) is 8KB, while the memory page is 4KB,
but it's also possible to make the block size smaller (e.g. 1KB).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 5 +-
.../expected/pg_buffercache_numa.out | 28 ++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 22 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 286 ++++++++++++++++++
.../sql/pg_buffercache_numa.sql | 20 ++
doc/src/sgml/pgbuffercache.sgml | 75 ++++-
src/tools/pgindent/typedefs.list | 2 +
10 files changed, 441 insertions(+), 4 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..5f748543e2e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,10 +8,11 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
-REGRESS = pg_buffercache
+REGRESS = pg_buffercache pg_buffercache_numa
ifdef USE_PGXS
PG_CONFIG = pg_config
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..998289790b7
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,22 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, ospageid int4, nodeid int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..fe2ffadcb3a 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,6 +11,7 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
@@ -20,6 +21,8 @@
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
+#define NUM_BUFFERCACHE_NUMA_ELEM 3
+
PG_MODULE_MAGIC_EXT(
.name = "pg_buffercache",
.version = PG_VERSION
@@ -58,16 +61,44 @@ typedef struct
BufferCachePagesRec *record;
} BufferCachePagesContext;
+/*
+ * Record structure holding the to be exposed cache data.
+ */
+typedef struct
+{
+ uint32 bufferid;
+ int32 numa_page;
+ int32 numa_node;
+} BufferCacheNumaRec;
+
+/*
+ * Function context for data persisting over repeated calls.
+ */
+typedef struct
+{
+ TupleDesc tupdesc;
+ int buffers_per_page;
+ int pages_per_buffer;
+ int os_page_size;
+ BufferCacheNumaRec *record;
+} BufferCacheNumaContext;
+
/*
* Function returning data from the shared buffer cache - buffer number,
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+
Datum
pg_buffercache_pages(PG_FUNCTION_ARGS)
{
@@ -246,6 +277,261 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * Inquire about NUMA memory mappings for shared buffers.
+ *
+ * Returns NUMA node ID for each memory page used by the buffer. Buffers may
+ * be smaller or larger than OS memory pages. For each buffer we return one
+ * entry for each memory page used by the buffer (it fhe buffer is smaller,
+ * it only uses a part of one memory page).
+ *
+ * We expect both sizes (for buffers and memory pages) to be a power-of-2, so
+ * one is always a multiple of the other.
+ *
+ * In order to get reliable results we also need to touch memory pages, so
+ * that the inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ MemoryContext oldcontext;
+ BufferCacheNumaContext *fctx; /* User function context. */
+ TupleDesc tupledesc;
+ TupleDesc expected_tupledesc;
+ HeapTuple tuple;
+ Datum result;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i,
+ j,
+ idx;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_page_status;
+ uint64 os_page_count;
+ int pages_per_buffer;
+ int buffers_per_page;
+ volatile uint64 touch pg_attribute_unused();
+ char *startptr = NULL;
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS
+ * memory page size 2. Calculate how many OS pages are used by all
+ * buffer blocks 3. Calculate how many OS pages are contained within
+ * each database block.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+ buffers_per_page = os_page_size / BLCKSZ;
+ pages_per_buffer = BLCKSZ / os_page_size;
+
+ /*
+ * The pages and block size is expected to be 2^k, so one divides the
+ * other (we don't know in which direction).
+ */
+ Assert((os_page_size % BLCKSZ == 0) || (BLCKSZ % os_page_size == 0));
+
+ /*
+ * Either both counts are 1 (when the pages have the same size), or
+ * exacly one of them is zero. Both can't be zero at the same time.
+ */
+ Assert((buffers_per_page > 0) || (pages_per_buffer > 0));
+ Assert(((buffers_per_page == 1) && (pages_per_buffer == 1)) ||
+ ((buffers_per_page == 0) || (pages_per_buffer == 0)));
+
+ /*
+ * How many addresses we are going to query (store) depends on the
+ * relation between BLCKSZ : PAGESIZE. We need at least one status per
+ * buffer - if the memory page is larger than buffer, we still query
+ * it for each buffer. With multiple memory pages per buffer, we need
+ * that many entries.
+ */
+ os_page_count = NBuffers * Max(1, pages_per_buffer);
+
+ elog(DEBUG1, "NUMA: NBuffers=%d os_page_query_count=" UINT64_FORMAT " "
+ "os_page_size=%zu buffers_per_page=%d pages_per_buffer=%d",
+ NBuffers, os_page_count, os_page_size,
+ buffers_per_page, pages_per_buffer);
+
+
+ /* Initialize the multi-call context, load entries about buffers */
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCacheNumaContext *) palloc(sizeof(BufferCacheNumaContext));
+
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts != NUM_BUFFERCACHE_NUMA_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "ospageid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "nodeid",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ fctx->record = (BufferCacheNumaRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCacheNumaRec) * os_page_count);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = NBuffers;
+ funcctx->user_fctx = fctx;
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+
+
+ /* Used to determine the NUMA node for all OS pages at once */
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+ os_page_status = palloc(sizeof(uint64) * os_page_count);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(os_page_status, 0xff, sizeof(int) * os_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ *
+ * This loop touches and stores addresses into os_page_ptrs[] as input
+ * to one big big move_pages(2) inquiry system call. Basically we ask
+ * for all memory pages for NBuffers.
+ */
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char *) BufferGetBlock(1));
+ idx = 0;
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+ uint32 bufferid;
+
+ CHECK_FOR_INTERRUPTS();
+
+ bufHdr = GetBufferDescriptor(i);
+
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+ bufferid = BufferDescriptorGetBuffer(bufHdr);
+
+ UnlockBufHdr(bufHdr, buf_state);
+
+ /*
+ * If we have multiple OS pages per buffer, fill those in too. We
+ * always want at least one OS page, even if there are multiple
+ * buffers per page.
+ *
+ * Altough we could query just once per each OS page, we do it
+ * repeatably for each Buffer and hit the same address as
+ * move_pages(2) requires page aligment. This also simplifies
+ * retrieval code later on. Also NBuffers starts from 1.
+ */
+ for (j = 0; j < Max(1, pages_per_buffer); j++)
+ {
+ char *buffptr = (char *) BufferGetBlock(i + 1);
+
+ fctx->record[idx].bufferid = bufferid;
+
+ os_page_ptrs[idx]
+ = (char *) TYPEALIGN_DOWN(os_page_size,
+ buffptr + (os_page_size * j));
+
+ /* calculate ID of the OS memory page */
+ fctx->record[idx].numa_page
+ = ((char *) os_page_ptrs[idx] - startptr) / os_page_size;
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch,
+ buffptr + (os_page_size * j));
+
+ ++idx;
+ }
+
+ }
+
+ /* We should get exactly the expected number of entrires */
+ Assert(idx == os_page_count);
+
+ /* Query NUMA status for all the pointers */
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_page_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ /*
+ * Update the entries with NUMA node ID. The status array is indexed
+ * the same way as the entry index.
+ */
+ for (i = 0; i < os_page_count; i++)
+ {
+ fctx->record[i].numa_node = os_page_status[i];
+ }
+
+ /* Remember this backend touched the pages */
+ firstNumaTouch = false;
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ uint32 i = funcctx->call_cntr;
+ Datum values[NUM_BUFFERCACHE_NUMA_ELEM];
+ bool nulls[NUM_BUFFERCACHE_NUMA_ELEM];
+
+ values[0] = Int32GetDatum(fctx->record[i].bufferid);
+ nulls[0] = false;
+
+ values[1] = Int32GetDatum(fctx->record[i].numa_page);
+ nulls[1] = false;
+
+ values[2] = Int32GetDatum(fctx->record[i].numa_node);
+ nulls[2] = false;
+
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ result = HeapTupleGetDatum(tuple);
+
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ SRF_RETURN_DONE(funcctx);
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..b39c9849362 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,15 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides
+ <acronym>NUMA</acronym> node mappings for shared buffer entries. This
+ information is not part of <function>pg_buffercache_pages()</function>
+ itself, as it is much slower to retrieve.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +211,68 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache-numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed by the view are shown in <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>bufferid</structfield> <type>integer</type>
+ </para>
+ <para>
+ ID, in the range 1..<varname>shared_buffers</varname>
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>ospageid</structfield> <type>int</type>
+ </para>
+ <para>
+ number of OS memory page for this buffer
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>nodeid</structfield> <type>int</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d42b943ef94..f7ba0ec809e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -341,6 +341,8 @@ BufFile
Buffer
BufferAccessStrategy
BufferAccessStrategyType
+BufferCacheNumaRec
+BufferCacheNumaContext
BufferCachePagesContext
BufferCachePagesRec
BufferDesc
--
2.49.0
v26-0006-reworks.patchtext/x-patch; charset=UTF-8; name=v26-0006-reworks.patchDownload
From e7d2c281c12425012882263857f76d0d395f3abc Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 7 Apr 2025 16:43:32 +0200
Subject: [PATCH v26 6/7] reworks
---
contrib/pg_buffercache/pg_buffercache_pages.c | 169 +++++++++---------
1 file changed, 83 insertions(+), 86 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index fe2ffadcb3a..03fc6574a52 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -306,64 +306,85 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
if (SRF_IS_FIRSTCALL())
{
int i,
- j,
idx;
Size os_page_size = 0;
void **os_page_ptrs = NULL;
int *os_page_status;
uint64 os_page_count;
int pages_per_buffer;
- int buffers_per_page;
+ int max_entries;
volatile uint64 touch pg_attribute_unused();
- char *startptr = NULL;
+ char *startptr,
+ *endptr;
if (pg_numa_init() == -1)
elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
/*
- * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
- * while the OS may have different memory page sizes.
+ * The database block size and OS memory page size are unlikely to be
+ * the same. The block size is 1-32KB, the memory page size depends on
+ * platform. On x86 it's usually 4KB, on ARM it's 4KB or 64KB, but
+ * there are also features like THP etc. Moreover, we don't quite know
+ * how the pages and buffers "align" in memory - the buffers may be
+ * shifted in some way, using more memory pages than necessary.
*
- * To correctly map between them, we need to: 1. Determine the OS
- * memory page size 2. Calculate how many OS pages are used by all
- * buffer blocks 3. Calculate how many OS pages are contained within
- * each database block.
+ * So we need to be careful about mappping buffers to memory pages. We
+ * calculate the maximum number of pages a buffer might use, so that
+ * we allocate enough space for the entries. And then we count the
+ * actual number of entries as we scan the buffers.
*
* This information is needed before calling move_pages() for NUMA
* node id inquiry.
*/
os_page_size = pg_numa_get_pagesize();
- buffers_per_page = os_page_size / BLCKSZ;
- pages_per_buffer = BLCKSZ / os_page_size;
/*
* The pages and block size is expected to be 2^k, so one divides the
- * other (we don't know in which direction).
+ * other (we don't know in which direction). This does not say
+ * anything about relative alignment of pages/buffers.
*/
Assert((os_page_size % BLCKSZ == 0) || (BLCKSZ % os_page_size == 0));
/*
- * Either both counts are 1 (when the pages have the same size), or
- * exacly one of them is zero. Both can't be zero at the same time.
+ * How many addresses we are going to query? Simply get the page for
+ * the first buffer, and first page after the last buffer, and count
+ * the pages from that.
*/
- Assert((buffers_per_page > 0) || (pages_per_buffer > 0));
- Assert(((buffers_per_page == 1) && (pages_per_buffer == 1)) ||
- ((buffers_per_page == 0) || (pages_per_buffer == 0)));
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size,
+ BufferGetBlock(1));
+ endptr = (char *) TYPEALIGN_DOWN(os_page_size,
+ (char *) BufferGetBlock(NBuffers) + BLCKSZ);
+ os_page_count = (endptr - startptr) / os_page_size;
+
+ /* Used to determine the NUMA node for all OS pages at once */
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+ os_page_status = palloc(sizeof(uint64) * os_page_count);
+
+ /* Fill pointers for all the memory pages. */
+ idx = 0;
+ for (char *ptr = startptr; ptr < endptr; ptr += os_page_size)
+ {
+ os_page_ptrs[idx++] = ptr;
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, ptr);
+ }
+
+ Assert(idx == os_page_count);
+
+ elog(DEBUG1, "NUMA: NBuffers=%d os_page_count=" UINT64_FORMAT " "
+ "os_page_size=%zu", NBuffers, os_page_count, os_page_size);
/*
- * How many addresses we are going to query (store) depends on the
- * relation between BLCKSZ : PAGESIZE. We need at least one status per
- * buffer - if the memory page is larger than buffer, we still query
- * it for each buffer. With multiple memory pages per buffer, we need
- * that many entries.
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
*/
- os_page_count = NBuffers * Max(1, pages_per_buffer);
-
- elog(DEBUG1, "NUMA: NBuffers=%d os_page_query_count=" UINT64_FORMAT " "
- "os_page_size=%zu buffers_per_page=%d pages_per_buffer=%d",
- NBuffers, os_page_count, os_page_size,
- buffers_per_page, pages_per_buffer);
+ memset(os_page_status, 0xff, sizeof(int) * os_page_count);
+ /* Query NUMA status for all the pointers */
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_page_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
/* Initialize the multi-call context, load entries about buffers */
@@ -392,29 +413,24 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
fctx->tupdesc = BlessTupleDesc(tupledesc);
- /* Allocate NBuffers worth of BufferCachePagesRec records. */
+ /*
+ * Each buffer needs at least one entry, but it might be offset in
+ * some way, and use one extra entry. So we allocate space for the
+ * maximum number of entries we might need, and then count the exact
+ * number as we're walking buffers. That way we can do it in one pass,
+ * without reallocating memory.
+ */
+ pages_per_buffer = Max(1, BLCKSZ / os_page_size) + 1;
+ max_entries = NBuffers * pages_per_buffer;
+
+ /* Allocate entries for BufferCachePagesRec records. */
fctx->record = (BufferCacheNumaRec *)
MemoryContextAllocHuge(CurrentMemoryContext,
- sizeof(BufferCacheNumaRec) * os_page_count);
-
- /* Set max calls and remember the user function context. */
- funcctx->max_calls = NBuffers;
- funcctx->user_fctx = fctx;
+ sizeof(BufferCacheNumaRec) * max_entries);
/* Return to original context when allocating transient memory */
MemoryContextSwitchTo(oldcontext);
-
- /* Used to determine the NUMA node for all OS pages at once */
- os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
- os_page_status = palloc(sizeof(uint64) * os_page_count);
-
- /*
- * If we ever get 0xff back from kernel inquiry, then we probably have
- * bug in our buffers to OS page mapping code here.
- */
- memset(os_page_status, 0xff, sizeof(int) * os_page_count);
-
if (firstNumaTouch)
elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
@@ -434,9 +450,13 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
idx = 0;
for (i = 0; i < NBuffers; i++)
{
+ char *buffptr = (char *) BufferGetBlock(i + 1);
BufferDesc *bufHdr;
uint32 buf_state;
uint32 bufferid;
+ int32 ospageid;
+ char *startptr_buff,
+ *endptr_buff;
CHECK_FOR_INTERRUPTS();
@@ -445,58 +465,35 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
/* Lock each buffer header before inspecting. */
buf_state = LockBufHdr(bufHdr);
bufferid = BufferDescriptorGetBuffer(bufHdr);
-
UnlockBufHdr(bufHdr, buf_state);
- /*
- * If we have multiple OS pages per buffer, fill those in too. We
- * always want at least one OS page, even if there are multiple
- * buffers per page.
- *
- * Altough we could query just once per each OS page, we do it
- * repeatably for each Buffer and hit the same address as
- * move_pages(2) requires page aligment. This also simplifies
- * retrieval code later on. Also NBuffers starts from 1.
- */
- for (j = 0; j < Max(1, pages_per_buffer); j++)
- {
- char *buffptr = (char *) BufferGetBlock(i + 1);
-
- fctx->record[idx].bufferid = bufferid;
+ /* start of the first page of this buffer */
+ startptr_buff = (char *) TYPEALIGN_DOWN(os_page_size, buffptr);
- os_page_ptrs[idx]
- = (char *) TYPEALIGN_DOWN(os_page_size,
- buffptr + (os_page_size * j));
+ /* start of the page right after this buffer */
+ endptr_buff = (char *) TYPEALIGN_DOWN(os_page_size, buffptr + BLCKSZ);
- /* calculate ID of the OS memory page */
- fctx->record[idx].numa_page
- = ((char *) os_page_ptrs[idx] - startptr) / os_page_size;
+ /* calculate ID of the first page for this buffer */
+ ospageid = (startptr_buff - startptr) / os_page_size;
- /* Only need to touch memory once per backend process lifetime */
- if (firstNumaTouch)
- pg_numa_touch_mem_if_required(touch,
- buffptr + (os_page_size * j));
+ /* Add an entry for each OS page overlapping with this buffer. */
+ for (char *ptr = startptr_buff; ptr < endptr_buff; ptr += os_page_size)
+ {
+ fctx->record[idx].bufferid = bufferid;
+ fctx->record[idx].numa_page = ospageid;
+ fctx->record[idx].numa_node = os_page_status[ospageid];
+ /* advance to the next entry/page */
++idx;
+ ++ospageid;
}
-
}
- /* We should get exactly the expected number of entrires */
- Assert(idx == os_page_count);
-
- /* Query NUMA status for all the pointers */
- if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_page_status) == -1)
- elog(ERROR, "failed NUMA pages inquiry: %m");
+ Assert((idx >= os_page_count) && (idx <= max_entries));
- /*
- * Update the entries with NUMA node ID. The status array is indexed
- * the same way as the entry index.
- */
- for (i = 0; i < os_page_count; i++)
- {
- fctx->record[i].numa_node = os_page_status[i];
- }
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = idx;
+ funcctx->user_fctx = fctx;
/* Remember this backend touched the pages */
firstNumaTouch = false;
--
2.49.0
v26-0007-fixup.patchtext/x-patch; charset=UTF-8; name=v26-0007-fixup.patchDownload
From 3478b7dfb45bb02469030b330f7f79b987dc0025 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 7 Apr 2025 17:10:06 +0200
Subject: [PATCH v26 7/7] fixup
---
contrib/pg_buffercache/expected/pg_buffercache_numa.out | 7 ++++---
contrib/pg_buffercache/sql/pg_buffercache_numa.sql | 7 ++++---
2 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
index d4de5ea52fc..a10b331a552 100644
--- a/contrib/pg_buffercache/expected/pg_buffercache_numa.out
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -2,9 +2,10 @@ SELECT NOT(pg_numa_available()) AS skip_test \gset
\if :skip_test
\quit
\endif
-select count(*) = (select setting::bigint
- from pg_settings
- where name = 'shared_buffers')
+-- We expect at least one entry for each buffer
+select count(*) >= (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
from pg_buffercache_numa;
?column?
----------
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
index 2225b879f58..837f3d64e21 100644
--- a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -3,9 +3,10 @@ SELECT NOT(pg_numa_available()) AS skip_test \gset
\quit
\endif
-select count(*) = (select setting::bigint
- from pg_settings
- where name = 'shared_buffers')
+-- We expect at least one entry for each buffer
+select count(*) >= (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
from pg_buffercache_numa;
-- Check that the functions / views can't be accessed by default. To avoid
--
2.49.0
Hi,
On 2025-04-06 13:56:54 +0200, Tomas Vondra wrote:
On 4/6/25 01:00, Andres Freund wrote:
On 2025-04-05 18:29:22 -0400, Andres Freund wrote:
I think one thing that the docs should mention is that calling the numa
functions/views will force the pages to be allocated, even if they're
currently unused.Newly started server, with s_b of 32GB an 2MB huge pages:
grep ^Huge /proc/meminfo
HugePages_Total: 34802
HugePages_Free: 34448
HugePages_Rsvd: 16437
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 76517376 kBrun
SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id;Now the pages that previously were marked as reserved are actually allocated:
grep ^Huge /proc/meminfo
HugePages_Total: 34802
HugePages_Free: 18012
HugePages_Rsvd: 1
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 76517376 kBI don't see how we can avoid that right now, but at the very least we ought to
document it.The only allocation where that really matters is shared_buffers. I wonder if
we could special case the logic for that, by only probing if at least one of
the buffers in the range is valid.Then we could treat a page status of -ENOENT as "page is not mapped" and
display NULL for the node_id?Of course that would mean that we'd always need to
pg_numa_touch_mem_if_required(), not just the first time round, because we
previously might not have for a page that is now valid. But compared to the
cost of actually allocating pages, the cost for that seems small.I don't think this would be a good trade off. The buffers already have a
NUMA node, and users would be interested in that.
The thing is that the buffer might *NOT* have a numa node. That's e.g. the
case in the above example - otherwise we wouldn't initially have seen the
large HugePages_Rsvd.
Forcing all those pages to be allocated via pg_numa_touch_mem_if_required()
itself wouldn't be too bad - in fact I'd rather like to have an explicit way
of doing that. The problem is that that leads to all those allocations to
happen on the *current* numa node (unless you have started postgres with
numactl --interleave=all or such), rather than the node where the normal first
use woul have allocated it.
It's just that we don't have the memory mapped in the current backend, so
I'd bet people would not be happy with NULL, and would proceed to force the
allocation in some other way (say, a large query of some sort). Which
obviously causes a lot of other problems.
I don't think that really would be the case with what I proposed? If any
buffer in the region were valid, we would force the allocation to become known
to the current backend.
Greetings,
Andres Freund
Hi,
On 2025-04-06 13:51:34 +0200, Tomas Vondra wrote:
On 4/6/25 00:29, Andres Freund wrote:
+ + if (firstNumaTouch) + elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");Over the patchseries the related code is duplicated. Seems like it'd be good
to put it into pg_numa instead? This seems like the thing that's good to
abstract away in one central spot.Abstract away which part, exactly? I thought about moving some of the
code to port/pg_numa.c, but it didn't seem worth it.
The easiest would be to just have a single function that does this for the
whole shared memory allocation, without having to integrate it with the
per-allocation or per-buffer loop.
+ /* + * If we have multiple OS pages per buffer, fill those in too. We + * always want at least one OS page, even if there are multiple + * buffers per page. + * + * Altough we could query just once per each OS page, we do it + * repeatably for each Buffer and hit the same address as + * move_pages(2) requires page aligment. This also simplifies + * retrieval code later on. Also NBuffers starts from 1. + */ + for (j = 0; j < Max(1, pages_per_buffer); j++) + { + char *buffptr = (char *) BufferGetBlock(i + 1); + + fctx->record[idx].bufferid = bufferid; + fctx->record[idx].numa_page = j; + + os_page_ptrs[idx] + = (char *) TYPEALIGN(os_page_size, + buffptr + (os_page_size * j));FWIW, this bit here is the most expensive part of the function itself, as the
compiler has no choice than to implement it as an actual division, as
os_page_size is runtime variable.It'd be fine to leave it like that, the call to numa_move_pages() is way more
expensive. But it shouldn't be too hard to do that alignment once, rather than
having to do it over and over.Division? It's entirely possible I'm missing something obvious, but I
don't see any divisions in this code.
Oops. The division was only added in a subsequent patch, not the quoted
code. At the time it was:
+ /* calculate ID of the OS memory page */
+ fctx->record[idx].numa_page
+ = ((char *) os_page_ptrs[idx] - startptr) / os_page_size;
Greetings,
Andres Freund
On 4/7/25 17:51, Andres Freund wrote:
Hi,
On 2025-04-06 13:56:54 +0200, Tomas Vondra wrote:
On 4/6/25 01:00, Andres Freund wrote:
On 2025-04-05 18:29:22 -0400, Andres Freund wrote:
I think one thing that the docs should mention is that calling the numa
functions/views will force the pages to be allocated, even if they're
currently unused.Newly started server, with s_b of 32GB an 2MB huge pages:
grep ^Huge /proc/meminfo
HugePages_Total: 34802
HugePages_Free: 34448
HugePages_Rsvd: 16437
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 76517376 kBrun
SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id;Now the pages that previously were marked as reserved are actually allocated:
grep ^Huge /proc/meminfo
HugePages_Total: 34802
HugePages_Free: 18012
HugePages_Rsvd: 1
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 76517376 kBI don't see how we can avoid that right now, but at the very least we ought to
document it.The only allocation where that really matters is shared_buffers. I wonder if
we could special case the logic for that, by only probing if at least one of
the buffers in the range is valid.Then we could treat a page status of -ENOENT as "page is not mapped" and
display NULL for the node_id?Of course that would mean that we'd always need to
pg_numa_touch_mem_if_required(), not just the first time round, because we
previously might not have for a page that is now valid. But compared to the
cost of actually allocating pages, the cost for that seems small.I don't think this would be a good trade off. The buffers already have a
NUMA node, and users would be interested in that.The thing is that the buffer might *NOT* have a numa node. That's e.g. the
case in the above example - otherwise we wouldn't initially have seen the
large HugePages_Rsvd.Forcing all those pages to be allocated via pg_numa_touch_mem_if_required()
itself wouldn't be too bad - in fact I'd rather like to have an explicit way
of doing that. The problem is that that leads to all those allocations to
happen on the *current* numa node (unless you have started postgres with
numactl --interleave=all or such), rather than the node where the normal first
use woul have allocated it.
I agree, forcing those allocations to happen on a single node seems
rather unfortunate. But really, how likely is it that someone will run
this function on a cluster that hasn't already allocated this memory?
I'm not saying it can't happen, but we already have this issue if you
start and do a warmup from a single connection ...
It's just that we don't have the memory mapped in the current backend, so
I'd bet people would not be happy with NULL, and would proceed to force the
allocation in some other way (say, a large query of some sort). Which
obviously causes a lot of other problems.I don't think that really would be the case with what I proposed? If any
buffer in the region were valid, we would force the allocation to become known
to the current backend.
It's not quite clear to me what exactly are you proposing :-(
I believe you're referring to this:
The only allocation where that really matters is shared_buffers. I wonder if
we could special case the logic for that, by only probing if at least one of
the buffers in the range is valid.Then we could treat a page status of -ENOENT as "page is not mapped" and
display NULL for the node_id?Of course that would mean that we'd always need to
pg_numa_touch_mem_if_required(), not just the first time round, because we
previously might not have for a page that is now valid. But compared to the
cost of actually allocating pages, the cost for that seems small.
I suppose by "range" you mean buffers on a given memory page, and
"valid" means BufferIsValid. Yeah, that probably means the memory page
is allocated. But if the buffer is invalid, it does not mean the memory
is not allocated, right? So does it make the buffer not interesting?
I'd find this ambiguity rather confusing, i.e. we'd never know if NULL
means just "invalid buffer" or "not allocated". Maybe we should simply
return rows only for valid buffers, to make it more explicit that we say
nothing about NUMA nodes for the invalid ones.
I think we need to decide whether the current patches are good enough
for PG18, with the current behavior, and then maybe improve that in
PG19. Or whether this is so serious we have to leave all of it for PG19.
I'd go with the former, but perhaps I'm wrong. I don't feel like I want
to be reworking this less than a day before the feature freeze.
Attached is v27, which I planned to push, but I'll hold off.
regards
--
Tomas Vondra
Attachments:
v27-0001-Add-support-for-basic-NUMA-awareness.patchtext/x-patch; charset=UTF-8; name=v27-0001-Add-support-for-basic-NUMA-awareness.patchDownload
From 88a9afc48e5ddb081ed1209e56d5a038b02fc3bb Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Mon, 7 Apr 2025 17:31:17 +0200
Subject: [PATCH v27 1/3] Add support for basic NUMA awareness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.
A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with NUMA library.
The main function introduced is pg_numa_query_pages(), which allows
determining NUMA node for individual memory pages. Internally the
function uses move_pages(2) syscall, as it allows batching, and is
more efficient than get_mempolicy(2).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 2 +
configure | 187 ++++++++++++++++++++++++++++
configure.ac | 14 +++
doc/src/sgml/func.sgml | 13 ++
doc/src/sgml/installation.sgml | 22 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 6 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 40 ++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 120 ++++++++++++++++++
17 files changed, 443 insertions(+), 2 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
--with-liburing \
\
${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
diff --git a/configure b/configure
index 8f4a5ab28ec..0936010718d 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
XML2_CFLAGS
XML2_CONFIG
with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
@@ -872,6 +875,7 @@ with_liburing
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
LIBURING_LIBS
LIBCURL_CFLAGS
LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
XML2_CONFIG
XML2_CFLAGS
XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -1629,6 +1636,10 @@ Some influential environment variables:
C compiler flags for LIBCURL, overriding pkg-config
LIBCURL_LIBS
linker flags for LIBCURL, overriding pkg-config
+ LIBNUMA_CFLAGS
+ C compiler flags for LIBNUMA, overriding pkg-config
+ LIBNUMA_LIBS
+ linker flags for LIBNUMA, overriding pkg-config
XML2_CONFIG path to xml2-config utility
XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
XML2_LIBS linker flags for XML2, overriding pkg-config
@@ -9063,6 +9074,182 @@ $as_echo "$as_me: WARNING: *** OAuth support tests require --with-python to run"
fi
+#
+# libnuma
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libnuma support" >&5
+$as_echo_n "checking whether to build with libnuma support... " >&6; }
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libnuma" >&5
+$as_echo "$with_libnuma" >&6; }
+
+
+if test "$with_libnuma" = yes ; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'libnuma' is required for NUMA support" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa" >&5
+$as_echo_n "checking for numa... " >&6; }
+
+if test -n "$LIBNUMA_CFLAGS"; then
+ pkg_cv_LIBNUMA_CFLAGS="$LIBNUMA_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_CFLAGS=`$PKG_CONFIG --cflags "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBNUMA_LIBS"; then
+ pkg_cv_LIBNUMA_LIBS="$LIBNUMA_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_LIBS=`$PKG_CONFIG --libs "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "numa" 2>&1`
+ else
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "numa" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBNUMA_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (numa) were not met:
+
+$LIBNUMA_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBNUMA_CFLAGS=$pkg_cv_LIBNUMA_CFLAGS
+ LIBNUMA_LIBS=$pkg_cv_LIBNUMA_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
#
# XML
#
diff --git a/configure.ac b/configure.ac
index fc5f7475d07..2a78cddd825 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,20 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma support],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA support])])
+ PKG_CHECK_MODULES(LIBNUMA, numa)
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0224f93733d..9ab070adffb 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25143,6 +25143,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..077bcc20759 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,17 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname>
+ library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-liburing">
<term><option>--with-liburing</option></term>
<listitem>
@@ -2645,6 +2656,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname>
+ library is implemented. The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 27717ad8976..a1516e54529 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: false)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: liburing
@@ -3279,6 +3300,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
liburing,
libxml,
lz4,
@@ -3935,6 +3957,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..06bf5627d3c 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA support')
+
option('liburing', type : 'feature', value: 'auto',
description: 'io_uring support, for asynchronous I/O')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 737b2dd1869..6722fbdf365 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
@@ -223,6 +224,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBNUMA_CFLAGS = @LIBNUMA_CFLAGS@
+LIBNUMA_LIBS = @LIBNUMA_LIBS@
+
LIBURING_CFLAGS = @LIBURING_CFLAGS@
LIBURING_LIBS = @LIBURING_LIBS@
@@ -250,7 +254,7 @@ CPP = @CPP@
CPPFLAGS = @CPPFLAGS@
PG_SYSROOT = @PG_SYSROOT@
-override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBNUMA_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
ifdef PGXS
override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f596fda568c..d54df555fba 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5d5be8ba4e1..04834d130f9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8542,6 +8542,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA support available?',
+ proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 9891b9b05c3..1af0b6316dd 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -689,6 +689,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to build with io_uring support. (--with-liburing) */
#undef USE_LIBURING
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..7e990d9f776
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(volatile uint64 *) ptr
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
@@ -232,6 +234,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index f11896440d5..4274949dfa4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
path.o \
pg_bitutils.o \
pg_localeconv_r.o \
+ pg_numa.o \
pg_popcount_aarch64.o \
pg_popcount_avx512.o \
pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 48d2dfb7cf3..fc7b059fee5 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_localeconv_r.c',
+ 'pg_numa.c',
'pg_popcount_aarch64.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..5e2523cf798
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+/*
+ * We use move_pages(2) syscall here - instead of get_mempolicy(2) - as the
+ * first one allows us to batch and query about many memory pages in one single
+ * giant system call that is way faster.
+ */
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+#else
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
+
+/* This should be used only after the server is started */
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+ Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
--
2.49.0
v27-0002-Introduce-pg_shmem_allocations_numa-view.patchtext/x-patch; charset=UTF-8; name=v27-0002-Introduce-pg_shmem_allocations_numa-view.patchDownload
From 6c45aeec2ed7ac527a9b31c39e560df02f803f90 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Mon, 7 Apr 2025 17:52:53 +0200
Subject: [PATCH v27 2/3] Introduce pg_shmem_allocations_numa view
Introduce new pg_shmem_alloctions_numa view with information about how
shared memory is distributed across NUMA nodes. For each shared memory
segment, the view returns one row for each NUMA node backing it, with
the total amount of memory allocated from that node.
The view may be relatively expensive, especially when executed for the
first time in a backend, as it has to touch all the memory pages to get
reliable information about the NUMA node where the page resides. This
may in force allocation of the shared memory.
Unlike pg_shmem_allocations, the view does not show anonymous shared
memory allocations. It also does not show memory allocated using the
dynamic shared memory infrastructure.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 85 ++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 159 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 13 ++
src/test/regress/expected/numa_1.out | 5 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 10 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 311 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 4f336ee0adf..4e853885de6 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -181,6 +181,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -4051,6 +4056,86 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-allocations-numa">
+ <title><structname>pg_shmem_allocations_numa</structname></title>
+
+ <indexterm zone="view-pg-shmem-allocations-numa">
+ <primary>pg_shmem_allocations_numa</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_allocations_numa</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />. This view will output multiple rows
+ for each of the shared memory segments provided that they are spread accross
+ multiple NUMA nodes. This view should not be queried by monitoring systems
+ as it is very slow and may end up allocating shared memory in case it was not
+ used earlier.
+ Current limitation for this view is that won't show anonymous shared memory
+ allocations.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <table>
+ <title><structname>pg_shmem_allocations_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>node_id</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_allocations_numa</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 273008db37f..08f780a2e63 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_allocations_numa AS
+ SELECT * FROM pg_get_shmem_allocations_numa();
+
+REVOKE ALL ON pg_shmem_allocations_numa FROM PUBLIC;
+GRANT SELECT ON pg_shmem_allocations_numa TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..e10b380e5c7 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,159 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/*
+ * SQL SRF showing NUMA memory nodes for allocated shared memory
+ *
+ * Compared to pg_get_shmem_allocations(), this function does not return
+ * information about shared anonymous allocations and unused shared memory.
+ */
+Datum
+pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ *
+ * Add 1, because we don't know how exactly the segments align to OS
+ * pages, so the allocation might use one more memory page. In practice
+ * this is not very likely, and moreover we have more entries, each of
+ * them using only fraction of the total pages.
+ */
+ shm_total_page_count = (ShmemSegHdr->totalsize / os_page_size) + 1;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+ char *startptr,
+ *endptr;
+ Size total_len;
+
+ /*
+ * Calculate the range of OS pages used by this segment. The segment
+ * may start / end half-way through a page, we want to count these
+ * pages too. So we align the start/end pointers down/up, and then
+ * calculate the number of pages from that.
+ */
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size, ent->location);
+ endptr = (char *) TYPEALIGN(os_page_size,
+ (char *) ent->location + ent->allocated_size);
+ total_len = (endptr - startptr);
+
+ shm_ent_page_count = total_len / os_page_size;
+
+ /*
+ * If we ever get 0xff (-1) back from kernel inquiry, then we probably
+ * have a bug in mapping buffers to OS pages.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ /*
+ * Setup page_ptrs[] with pointers to all OS pages for this segment,
+ * and get the NUMA status using pg_numa_query_pages.
+ *
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (ENOENT, which indicates unmapped/unallocated pages).
+ */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = startptr + (i * os_page_size);
+
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ /* Count number of NUMA nodes used for this shared memory entry */
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s < 0 || s > max_nodes)
+ {
+ elog(ERROR, "invalid NUMA node id outside of allowed range "
+ "[0, " UINT64_FORMAT "]: %d", max_nodes, s);
+ }
+
+ nodes[s]++;
+ }
+
+ /*
+ * Add one entry for each NUMA node, including those without allocated
+ * memory for this segment.
+ */
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 04834d130f9..653258fd100 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8546,6 +8546,14 @@
proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_allocations_numa', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,node_id,size}',
+ prosrc => 'pg_get_shmem_allocations_numa' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..8af5dfeb9a5
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,13 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..c90042fa7cc
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,5 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+ERROR: libnuma initialization failed or NUMA is not supported on this platform
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 1fddb13b6ae..c25062c288f 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3219,8 +3219,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3242,6 +3242,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
has_table_privilege
@@ -3261,6 +3267,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 673c63b8d1b..abfdc97abc5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1757,6 +1757,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_allocations_numa| SELECT name,
+ node_id,
+ size
+ FROM pg_get_shmem_allocations_numa() pg_get_shmem_allocations_numa(name, node_id, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..324481c33b7
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,10 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index 85d7280f35f..f337aa67c13 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1947,8 +1947,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1958,12 +1958,14 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.49.0
v27-0003-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchtext/x-patch; charset=UTF-8; name=v27-0003-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchDownload
From fadeef67e14cc0db792a0ef81e8c0425eeb5b860 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 3 Apr 2025 20:21:25 +0200
Subject: [PATCH v27 3/3] Add pg_buffercache_numa view with NUMA node info
Introduces a new view pg_buffercache_numa, showing a NUMA memory node
for each individual buffer.
To determine the NUMA node for a buffer, we first need to touch the
memory pages using pg_numa_touch_mem_if_required, otherwise we might get
status -2 (ENOENT = The page is not present), indicating the page is
either unmapped or unallocated.
The size of a database block and OS memory page may differ. For example
the default block size (BLCKSZ) is 8KB, while the memory page is 4KB,
but it's also possible to make the block size smaller (e.g. 1KB).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 5 +-
.../expected/pg_buffercache_numa.out | 29 ++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 22 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 283 ++++++++++++++++++
.../sql/pg_buffercache_numa.sql | 21 ++
doc/src/sgml/pgbuffercache.sgml | 75 ++++-
src/tools/pgindent/typedefs.list | 2 +
10 files changed, 440 insertions(+), 4 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..5f748543e2e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,10 +8,11 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
-REGRESS = pg_buffercache
+REGRESS = pg_buffercache pg_buffercache_numa
ifdef USE_PGXS
PG_CONFIG = pg_config
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..a10b331a552
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,29 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- We expect at least one entry for each buffer
+select count(*) >= (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..998289790b7
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,22 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, ospageid int4, nodeid int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..03fc6574a52 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,6 +11,7 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
@@ -20,6 +21,8 @@
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
+#define NUM_BUFFERCACHE_NUMA_ELEM 3
+
PG_MODULE_MAGIC_EXT(
.name = "pg_buffercache",
.version = PG_VERSION
@@ -58,16 +61,44 @@ typedef struct
BufferCachePagesRec *record;
} BufferCachePagesContext;
+/*
+ * Record structure holding the to be exposed cache data.
+ */
+typedef struct
+{
+ uint32 bufferid;
+ int32 numa_page;
+ int32 numa_node;
+} BufferCacheNumaRec;
+
+/*
+ * Function context for data persisting over repeated calls.
+ */
+typedef struct
+{
+ TupleDesc tupdesc;
+ int buffers_per_page;
+ int pages_per_buffer;
+ int os_page_size;
+ BufferCacheNumaRec *record;
+} BufferCacheNumaContext;
+
/*
* Function returning data from the shared buffer cache - buffer number,
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+
Datum
pg_buffercache_pages(PG_FUNCTION_ARGS)
{
@@ -246,6 +277,258 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * Inquire about NUMA memory mappings for shared buffers.
+ *
+ * Returns NUMA node ID for each memory page used by the buffer. Buffers may
+ * be smaller or larger than OS memory pages. For each buffer we return one
+ * entry for each memory page used by the buffer (it fhe buffer is smaller,
+ * it only uses a part of one memory page).
+ *
+ * We expect both sizes (for buffers and memory pages) to be a power-of-2, so
+ * one is always a multiple of the other.
+ *
+ * In order to get reliable results we also need to touch memory pages, so
+ * that the inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ MemoryContext oldcontext;
+ BufferCacheNumaContext *fctx; /* User function context. */
+ TupleDesc tupledesc;
+ TupleDesc expected_tupledesc;
+ HeapTuple tuple;
+ Datum result;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i,
+ idx;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_page_status;
+ uint64 os_page_count;
+ int pages_per_buffer;
+ int max_entries;
+ volatile uint64 touch pg_attribute_unused();
+ char *startptr,
+ *endptr;
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ /*
+ * The database block size and OS memory page size are unlikely to be
+ * the same. The block size is 1-32KB, the memory page size depends on
+ * platform. On x86 it's usually 4KB, on ARM it's 4KB or 64KB, but
+ * there are also features like THP etc. Moreover, we don't quite know
+ * how the pages and buffers "align" in memory - the buffers may be
+ * shifted in some way, using more memory pages than necessary.
+ *
+ * So we need to be careful about mappping buffers to memory pages. We
+ * calculate the maximum number of pages a buffer might use, so that
+ * we allocate enough space for the entries. And then we count the
+ * actual number of entries as we scan the buffers.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * The pages and block size is expected to be 2^k, so one divides the
+ * other (we don't know in which direction). This does not say
+ * anything about relative alignment of pages/buffers.
+ */
+ Assert((os_page_size % BLCKSZ == 0) || (BLCKSZ % os_page_size == 0));
+
+ /*
+ * How many addresses we are going to query? Simply get the page for
+ * the first buffer, and first page after the last buffer, and count
+ * the pages from that.
+ */
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size,
+ BufferGetBlock(1));
+ endptr = (char *) TYPEALIGN_DOWN(os_page_size,
+ (char *) BufferGetBlock(NBuffers) + BLCKSZ);
+ os_page_count = (endptr - startptr) / os_page_size;
+
+ /* Used to determine the NUMA node for all OS pages at once */
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+ os_page_status = palloc(sizeof(uint64) * os_page_count);
+
+ /* Fill pointers for all the memory pages. */
+ idx = 0;
+ for (char *ptr = startptr; ptr < endptr; ptr += os_page_size)
+ {
+ os_page_ptrs[idx++] = ptr;
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, ptr);
+ }
+
+ Assert(idx == os_page_count);
+
+ elog(DEBUG1, "NUMA: NBuffers=%d os_page_count=" UINT64_FORMAT " "
+ "os_page_size=%zu", NBuffers, os_page_count, os_page_size);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(os_page_status, 0xff, sizeof(int) * os_page_count);
+
+ /* Query NUMA status for all the pointers */
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_page_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ /* Initialize the multi-call context, load entries about buffers */
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCacheNumaContext *) palloc(sizeof(BufferCacheNumaContext));
+
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts != NUM_BUFFERCACHE_NUMA_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "ospageid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "nodeid",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /*
+ * Each buffer needs at least one entry, but it might be offset in
+ * some way, and use one extra entry. So we allocate space for the
+ * maximum number of entries we might need, and then count the exact
+ * number as we're walking buffers. That way we can do it in one pass,
+ * without reallocating memory.
+ */
+ pages_per_buffer = Max(1, BLCKSZ / os_page_size) + 1;
+ max_entries = NBuffers * pages_per_buffer;
+
+ /* Allocate entries for BufferCachePagesRec records. */
+ fctx->record = (BufferCacheNumaRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCacheNumaRec) * max_entries);
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ *
+ * This loop touches and stores addresses into os_page_ptrs[] as input
+ * to one big big move_pages(2) inquiry system call. Basically we ask
+ * for all memory pages for NBuffers.
+ */
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char *) BufferGetBlock(1));
+ idx = 0;
+ for (i = 0; i < NBuffers; i++)
+ {
+ char *buffptr = (char *) BufferGetBlock(i + 1);
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+ uint32 bufferid;
+ int32 ospageid;
+ char *startptr_buff,
+ *endptr_buff;
+
+ CHECK_FOR_INTERRUPTS();
+
+ bufHdr = GetBufferDescriptor(i);
+
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+ bufferid = BufferDescriptorGetBuffer(bufHdr);
+ UnlockBufHdr(bufHdr, buf_state);
+
+ /* start of the first page of this buffer */
+ startptr_buff = (char *) TYPEALIGN_DOWN(os_page_size, buffptr);
+
+ /* start of the page right after this buffer */
+ endptr_buff = (char *) TYPEALIGN_DOWN(os_page_size, buffptr + BLCKSZ);
+
+ /* calculate ID of the first page for this buffer */
+ ospageid = (startptr_buff - startptr) / os_page_size;
+
+ /* Add an entry for each OS page overlapping with this buffer. */
+ for (char *ptr = startptr_buff; ptr < endptr_buff; ptr += os_page_size)
+ {
+ fctx->record[idx].bufferid = bufferid;
+ fctx->record[idx].numa_page = ospageid;
+ fctx->record[idx].numa_node = os_page_status[ospageid];
+
+ /* advance to the next entry/page */
+ ++idx;
+ ++ospageid;
+ }
+ }
+
+ Assert((idx >= os_page_count) && (idx <= max_entries));
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = idx;
+ funcctx->user_fctx = fctx;
+
+ /* Remember this backend touched the pages */
+ firstNumaTouch = false;
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ uint32 i = funcctx->call_cntr;
+ Datum values[NUM_BUFFERCACHE_NUMA_ELEM];
+ bool nulls[NUM_BUFFERCACHE_NUMA_ELEM];
+
+ values[0] = Int32GetDatum(fctx->record[i].bufferid);
+ nulls[0] = false;
+
+ values[1] = Int32GetDatum(fctx->record[i].numa_page);
+ nulls[1] = false;
+
+ values[2] = Int32GetDatum(fctx->record[i].numa_node);
+ nulls[2] = false;
+
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ result = HeapTupleGetDatum(tuple);
+
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ SRF_RETURN_DONE(funcctx);
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..837f3d64e21
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,21 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- We expect at least one entry for each buffer
+select count(*) >= (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..b39c9849362 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,15 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides
+ <acronym>NUMA</acronym> node mappings for shared buffer entries. This
+ information is not part of <function>pg_buffercache_pages()</function>
+ itself, as it is much slower to retrieve.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +211,68 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache-numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed by the view are shown in <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>bufferid</structfield> <type>integer</type>
+ </para>
+ <para>
+ ID, in the range 1..<varname>shared_buffers</varname>
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>ospageid</structfield> <type>int</type>
+ </para>
+ <para>
+ number of OS memory page for this buffer
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>nodeid</structfield> <type>int</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d42b943ef94..f7ba0ec809e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -341,6 +341,8 @@ BufFile
Buffer
BufferAccessStrategy
BufferAccessStrategyType
+BufferCacheNumaRec
+BufferCacheNumaContext
BufferCachePagesContext
BufferCachePagesRec
BufferDesc
--
2.49.0
Hi,
On 2025-04-07 18:36:24 +0200, Tomas Vondra wrote:
Forcing all those pages to be allocated via pg_numa_touch_mem_if_required()
itself wouldn't be too bad - in fact I'd rather like to have an explicit way
of doing that. The problem is that that leads to all those allocations to
happen on the *current* numa node (unless you have started postgres with
numactl --interleave=all or such), rather than the node where the normal first
use woul have allocated it.I agree, forcing those allocations to happen on a single node seems
rather unfortunate. But really, how likely is it that someone will run
this function on a cluster that hasn't already allocated this memory?
I think it's not at all unlikely to have parts of shared buffers unused at the
start of a benchmark, e.g. because the table sizes grow over time.
I'm not saying it can't happen, but we already have this issue if you
start and do a warmup from a single connection ...
Indeed! We really need to fix this...
It's just that we don't have the memory mapped in the current backend, so
I'd bet people would not be happy with NULL, and would proceed to force the
allocation in some other way (say, a large query of some sort). Which
obviously causes a lot of other problems.I don't think that really would be the case with what I proposed? If any
buffer in the region were valid, we would force the allocation to become known
to the current backend.It's not quite clear to me what exactly are you proposing :-(
I believe you're referring to this:
The only allocation where that really matters is shared_buffers. I wonder if
we could special case the logic for that, by only probing if at least one of
the buffers in the range is valid.Then we could treat a page status of -ENOENT as "page is not mapped" and
display NULL for the node_id?Of course that would mean that we'd always need to
pg_numa_touch_mem_if_required(), not just the first time round, because we
previously might not have for a page that is now valid. But compared to the
cost of actually allocating pages, the cost for that seems small.I suppose by "range" you mean buffers on a given memory page
Correct.
and "valid" means BufferIsValid.
I was thinking of checking if the BufferDesc indicates BM_VALID or
BM_TAG_VALID.
BufferIsValid() just does a range check :(.
Yeah, that probably means the memory page is allocated. But if the buffer is
invalid, it does not mean the memory is not allocated, right? So does it
make the buffer not interesting?
Well, you don't have contents in it it can't really affect performance. But
yea, I agree, it's not perfect either.
I think we need to decide whether the current patches are good enough
for PG18, with the current behavior, and then maybe improve that in
PG19.
I think as long as the docs mention this with <note> or <warning> it's ok for
now.
Greetings,
Andres Freund
On 4/7/25 18:42, Andres Freund wrote:
...
Of course that would mean that we'd always need to
pg_numa_touch_mem_if_required(), not just the first time round, because we
previously might not have for a page that is now valid. But compared to the
cost of actually allocating pages, the cost for that seems small.I suppose by "range" you mean buffers on a given memory page
Correct.
and "valid" means BufferIsValid.
I was thinking of checking if the BufferDesc indicates BM_VALID or
BM_TAG_VALID.BufferIsValid() just does a range check :(.
Well, I guess BufferIsValid() seems a tad confusing ...
Yeah, that probably means the memory page is allocated. But if the buffer is
invalid, it does not mean the memory is not allocated, right? So does it
make the buffer not interesting?Well, you don't have contents in it it can't really affect performance. But
yea, I agree, it's not perfect either.I think we need to decide whether the current patches are good enough
for PG18, with the current behavior, and then maybe improve that in
PG19.I think as long as the docs mention this with <note> or <warning> it's ok for
now.
OK, I'll add a warning explaining this.
regards
--
Tomas Vondra
On 2025-04-04 19:07:12 +0200, Jakub Wartak wrote:
They actually look good to me. We've discussed earlier dropping
s/numa_//g for column names (after all views contain it already) so
they are fine in this regard.
There's also the question of consistency: (bufferid, page_num,
node_id) -- maybe should just drop "_" and that's it?
Well I would even possibly consider page_num -> ospagenumber, but that's ugly.
I'd go for os_page_num.
On 4/7/25 19:24, Andres Freund wrote:
On 2025-04-04 19:07:12 +0200, Jakub Wartak wrote:
They actually look good to me. We've discussed earlier dropping
s/numa_//g for column names (after all views contain it already) so
they are fine in this regard.
There's also the question of consistency: (bufferid, page_num,
node_id) -- maybe should just drop "_" and that's it?
Well I would even possibly consider page_num -> ospagenumber, but that's ugly.I'd go for os_page_num.
WFM. I've renamed "ospageid" to "os_page_num" in 0003, and I've also
renamed "node_id" to "numa_node" in 0002+0003, to make it clearer what
kind of node this is.
This reminds me whether it's fine to have "os_page_num" as int. Should
we make it bigint, perhaps?
Attached is v28, with the commit messages updated, added <warning> about
allocation of the memory, etc. I'll let the CI run the tests on it, and
then will push, unless someone has more comments.
regards
--
Tomas Vondra
Attachments:
v28-0001-Add-support-for-basic-NUMA-awareness.patchtext/x-patch; charset=UTF-8; name=v28-0001-Add-support-for-basic-NUMA-awareness.patchDownload
From 9a222c77de2ee4a0b32d97c3d8bab2bb33f066de Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Mon, 7 Apr 2025 17:31:17 +0200
Subject: [PATCH v28 1/3] Add support for basic NUMA awareness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.
A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with NUMA library.
The main function introduced is pg_numa_query_pages(), which allows
determining NUMA node for individual memory pages. Internally the
function uses move_pages(2) syscall, as it allows batching, and is
more efficient than get_mempolicy(2).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 2 +
configure | 187 ++++++++++++++++++++++++++++
configure.ac | 14 +++
doc/src/sgml/func.sgml | 13 ++
doc/src/sgml/installation.sgml | 22 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 6 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 40 ++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 120 ++++++++++++++++++
17 files changed, 443 insertions(+), 2 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
--with-liburing \
\
${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
diff --git a/configure b/configure
index 8f4a5ab28ec..0936010718d 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
XML2_CFLAGS
XML2_CONFIG
with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
@@ -872,6 +875,7 @@ with_liburing
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
LIBURING_LIBS
LIBCURL_CFLAGS
LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
XML2_CONFIG
XML2_CFLAGS
XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -1629,6 +1636,10 @@ Some influential environment variables:
C compiler flags for LIBCURL, overriding pkg-config
LIBCURL_LIBS
linker flags for LIBCURL, overriding pkg-config
+ LIBNUMA_CFLAGS
+ C compiler flags for LIBNUMA, overriding pkg-config
+ LIBNUMA_LIBS
+ linker flags for LIBNUMA, overriding pkg-config
XML2_CONFIG path to xml2-config utility
XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
XML2_LIBS linker flags for XML2, overriding pkg-config
@@ -9063,6 +9074,182 @@ $as_echo "$as_me: WARNING: *** OAuth support tests require --with-python to run"
fi
+#
+# libnuma
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libnuma support" >&5
+$as_echo_n "checking whether to build with libnuma support... " >&6; }
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libnuma" >&5
+$as_echo "$with_libnuma" >&6; }
+
+
+if test "$with_libnuma" = yes ; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'libnuma' is required for NUMA support" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa" >&5
+$as_echo_n "checking for numa... " >&6; }
+
+if test -n "$LIBNUMA_CFLAGS"; then
+ pkg_cv_LIBNUMA_CFLAGS="$LIBNUMA_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_CFLAGS=`$PKG_CONFIG --cflags "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBNUMA_LIBS"; then
+ pkg_cv_LIBNUMA_LIBS="$LIBNUMA_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_LIBS=`$PKG_CONFIG --libs "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "numa" 2>&1`
+ else
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "numa" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBNUMA_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (numa) were not met:
+
+$LIBNUMA_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBNUMA_CFLAGS=$pkg_cv_LIBNUMA_CFLAGS
+ LIBNUMA_LIBS=$pkg_cv_LIBNUMA_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
#
# XML
#
diff --git a/configure.ac b/configure.ac
index fc5f7475d07..2a78cddd825 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,20 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma support],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA support])])
+ PKG_CHECK_MODULES(LIBNUMA, numa)
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0224f93733d..9ab070adffb 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25143,6 +25143,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..077bcc20759 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,17 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname>
+ library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-liburing">
<term><option>--with-liburing</option></term>
<listitem>
@@ -2645,6 +2656,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname>
+ library is implemented. The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 27717ad8976..a1516e54529 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: false)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: liburing
@@ -3279,6 +3300,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
liburing,
libxml,
lz4,
@@ -3935,6 +3957,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..06bf5627d3c 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA support')
+
option('liburing', type : 'feature', value: 'auto',
description: 'io_uring support, for asynchronous I/O')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 737b2dd1869..6722fbdf365 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
@@ -223,6 +224,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBNUMA_CFLAGS = @LIBNUMA_CFLAGS@
+LIBNUMA_LIBS = @LIBNUMA_LIBS@
+
LIBURING_CFLAGS = @LIBURING_CFLAGS@
LIBURING_LIBS = @LIBURING_LIBS@
@@ -250,7 +254,7 @@ CPP = @CPP@
CPPFLAGS = @CPPFLAGS@
PG_SYSROOT = @PG_SYSROOT@
-override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBNUMA_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
ifdef PGXS
override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f596fda568c..d54df555fba 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5d5be8ba4e1..04834d130f9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8542,6 +8542,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA support available?',
+ proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 9891b9b05c3..1af0b6316dd 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -689,6 +689,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to build with io_uring support. (--with-liburing) */
#undef USE_LIBURING
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..7e990d9f776
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(volatile uint64 *) ptr
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
@@ -232,6 +234,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index f11896440d5..4274949dfa4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
path.o \
pg_bitutils.o \
pg_localeconv_r.o \
+ pg_numa.o \
pg_popcount_aarch64.o \
pg_popcount_avx512.o \
pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 48d2dfb7cf3..fc7b059fee5 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_localeconv_r.c',
+ 'pg_numa.c',
'pg_popcount_aarch64.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..5e2523cf798
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+/*
+ * We use move_pages(2) syscall here - instead of get_mempolicy(2) - as the
+ * first one allows us to batch and query about many memory pages in one single
+ * giant system call that is way faster.
+ */
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+#else
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
+
+/* This should be used only after the server is started */
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+ Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
--
2.49.0
v28-0002-Introduce-pg_shmem_allocations_numa-view.patchtext/x-patch; charset=UTF-8; name=v28-0002-Introduce-pg_shmem_allocations_numa-view.patchDownload
From 231b370dbbfc2b311637e9076e6d4850a11951ae Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Mon, 7 Apr 2025 19:32:39 +0200
Subject: [PATCH v28 2/3] Introduce pg_shmem_allocations_numa view
Introduce new pg_shmem_alloctions_numa view with information about how
shared memory is distributed across NUMA nodes. For each shared memory
segment, the view returns one row for each NUMA node backing it, with
the total amount of memory allocated from that node.
The view may be relatively expensive, especially when executed for the
first time in a backend, as it has to touch all memory pages to get
reliable information about the NUMA node. This may also force allocation
of the shared memory.
Unlike pg_shmem_allocations, the view does not show anonymous shared
memory allocations. It also does not show memory allocated using the
dynamic shared memory infrastructure.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 95 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 159 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 13 ++
src/test/regress/expected/numa_1.out | 5 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 10 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 321 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 4f336ee0adf..0eba37268bf 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -181,6 +181,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -4051,6 +4056,96 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-allocations-numa">
+ <title><structname>pg_shmem_allocations_numa</structname></title>
+
+ <indexterm zone="view-pg-shmem-allocations-numa">
+ <primary>pg_shmem_allocations_numa</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_allocations_numa</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />. This view will output multiple rows
+ for each of the shared memory segments provided that they are spread accross
+ multiple NUMA nodes. This view should not be queried by monitoring systems
+ as it is very slow and may end up allocating shared memory in case it was not
+ used earlier.
+ Current limitation for this view is that won't show anonymous shared memory
+ allocations.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <warning>
+ <para>
+ When determining the <acronym>NUMA</acronym> node, the view touches
+ all memory pages for the shared memory segment. This will force
+ allocation of the shared memory, if it wasn't allocated already,
+ and the memory may get allocated in a single <acronym>NUMA</acronym>
+ node (depending on system configuration).
+ </para>
+ </warning>
+
+ <table>
+ <title><structname>pg_shmem_allocations_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_node</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_allocations_numa</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 273008db37f..08f780a2e63 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_allocations_numa AS
+ SELECT * FROM pg_get_shmem_allocations_numa();
+
+REVOKE ALL ON pg_shmem_allocations_numa FROM PUBLIC;
+GRANT SELECT ON pg_shmem_allocations_numa TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..e10b380e5c7 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,159 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/*
+ * SQL SRF showing NUMA memory nodes for allocated shared memory
+ *
+ * Compared to pg_get_shmem_allocations(), this function does not return
+ * information about shared anonymous allocations and unused shared memory.
+ */
+Datum
+pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ *
+ * Add 1, because we don't know how exactly the segments align to OS
+ * pages, so the allocation might use one more memory page. In practice
+ * this is not very likely, and moreover we have more entries, each of
+ * them using only fraction of the total pages.
+ */
+ shm_total_page_count = (ShmemSegHdr->totalsize / os_page_size) + 1;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+ char *startptr,
+ *endptr;
+ Size total_len;
+
+ /*
+ * Calculate the range of OS pages used by this segment. The segment
+ * may start / end half-way through a page, we want to count these
+ * pages too. So we align the start/end pointers down/up, and then
+ * calculate the number of pages from that.
+ */
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size, ent->location);
+ endptr = (char *) TYPEALIGN(os_page_size,
+ (char *) ent->location + ent->allocated_size);
+ total_len = (endptr - startptr);
+
+ shm_ent_page_count = total_len / os_page_size;
+
+ /*
+ * If we ever get 0xff (-1) back from kernel inquiry, then we probably
+ * have a bug in mapping buffers to OS pages.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ /*
+ * Setup page_ptrs[] with pointers to all OS pages for this segment,
+ * and get the NUMA status using pg_numa_query_pages.
+ *
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (ENOENT, which indicates unmapped/unallocated pages).
+ */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = startptr + (i * os_page_size);
+
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ /* Count number of NUMA nodes used for this shared memory entry */
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s < 0 || s > max_nodes)
+ {
+ elog(ERROR, "invalid NUMA node id outside of allowed range "
+ "[0, " UINT64_FORMAT "]: %d", max_nodes, s);
+ }
+
+ nodes[s]++;
+ }
+
+ /*
+ * Add one entry for each NUMA node, including those without allocated
+ * memory for this segment.
+ */
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 04834d130f9..8597981d6b3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8546,6 +8546,14 @@
proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_allocations_numa', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,numa_node,size}',
+ prosrc => 'pg_get_shmem_allocations_numa' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..8af5dfeb9a5
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,13 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..c90042fa7cc
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,5 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+ERROR: libnuma initialization failed or NUMA is not supported on this platform
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 1fddb13b6ae..c25062c288f 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3219,8 +3219,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3242,6 +3242,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
has_table_privilege
@@ -3261,6 +3267,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 673c63b8d1b..6cf828ca8d0 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1757,6 +1757,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_allocations_numa| SELECT name,
+ numa_node,
+ size
+ FROM pg_get_shmem_allocations_numa() pg_get_shmem_allocations_numa(name, numa_node, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..324481c33b7
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,10 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index 85d7280f35f..f337aa67c13 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1947,8 +1947,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1958,12 +1958,14 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.49.0
v28-0003-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchtext/x-patch; charset=UTF-8; name=v28-0003-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchDownload
From 9dd48a22119f0d5bc8c88641a91e8bed37eea566 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Mon, 7 Apr 2025 19:39:31 +0200
Subject: [PATCH v28 3/3] Add pg_buffercache_numa view with NUMA node info
Introduces a new view pg_buffercache_numa, showing NUMA memory nodes
for individual buffers. For each buffer the view returns an entry for
each memory page, with the associated NUMA node.
The database blocks and OS memory pages may have different size - the
default block size is 8KB, while the memory page is 4K (on x86). But
other combinations are possible, depending on configure parameters,
platform, etc. This means buffers may overlap with multiple memory
pages, each associated with a different NUMA node.
To determine the NUMA node for a buffer, we first need to touch the
memory pages using pg_numa_touch_mem_if_required, otherwise we might get
status -2 (ENOENT = The page is not present), indicating the page is
either unmapped or unallocated.
The view may be relatively expensive, especially when accessed for the
first time in a backend, as it touches all memory pages to get reliable
information about the NUMA node. This may also force allocation of the
shared memory.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 5 +-
.../expected/pg_buffercache_numa.out | 29 ++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 22 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 283 ++++++++++++++++++
.../sql/pg_buffercache_numa.sql | 21 ++
doc/src/sgml/pgbuffercache.sgml | 85 +++++-
src/tools/pgindent/typedefs.list | 2 +
10 files changed, 450 insertions(+), 4 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..5f748543e2e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,10 +8,11 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
-REGRESS = pg_buffercache
+REGRESS = pg_buffercache pg_buffercache_numa
ifdef USE_PGXS
PG_CONFIG = pg_config
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..a10b331a552
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,29 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- We expect at least one entry for each buffer
+select count(*) >= (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..f6668e41b37
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,22 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, os_page_num int4, numa_node int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..b9fdf87bcc4 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,6 +11,7 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
@@ -20,6 +21,8 @@
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
+#define NUM_BUFFERCACHE_NUMA_ELEM 3
+
PG_MODULE_MAGIC_EXT(
.name = "pg_buffercache",
.version = PG_VERSION
@@ -58,16 +61,44 @@ typedef struct
BufferCachePagesRec *record;
} BufferCachePagesContext;
+/*
+ * Record structure holding the to be exposed cache data.
+ */
+typedef struct
+{
+ uint32 bufferid;
+ int32 page_num;
+ int32 numa_node;
+} BufferCacheNumaRec;
+
+/*
+ * Function context for data persisting over repeated calls.
+ */
+typedef struct
+{
+ TupleDesc tupdesc;
+ int buffers_per_page;
+ int pages_per_buffer;
+ int os_page_size;
+ BufferCacheNumaRec *record;
+} BufferCacheNumaContext;
+
/*
* Function returning data from the shared buffer cache - buffer number,
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+
Datum
pg_buffercache_pages(PG_FUNCTION_ARGS)
{
@@ -246,6 +277,258 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * Inquire about NUMA memory mappings for shared buffers.
+ *
+ * Returns NUMA node ID for each memory page used by the buffer. Buffers may
+ * be smaller or larger than OS memory pages. For each buffer we return one
+ * entry for each memory page used by the buffer (it fhe buffer is smaller,
+ * it only uses a part of one memory page).
+ *
+ * We expect both sizes (for buffers and memory pages) to be a power-of-2, so
+ * one is always a multiple of the other.
+ *
+ * In order to get reliable results we also need to touch memory pages, so
+ * that the inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ MemoryContext oldcontext;
+ BufferCacheNumaContext *fctx; /* User function context. */
+ TupleDesc tupledesc;
+ TupleDesc expected_tupledesc;
+ HeapTuple tuple;
+ Datum result;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i,
+ idx;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_page_status;
+ uint64 os_page_count;
+ int pages_per_buffer;
+ int max_entries;
+ volatile uint64 touch pg_attribute_unused();
+ char *startptr,
+ *endptr;
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ /*
+ * The database block size and OS memory page size are unlikely to be
+ * the same. The block size is 1-32KB, the memory page size depends on
+ * platform. On x86 it's usually 4KB, on ARM it's 4KB or 64KB, but
+ * there are also features like THP etc. Moreover, we don't quite know
+ * how the pages and buffers "align" in memory - the buffers may be
+ * shifted in some way, using more memory pages than necessary.
+ *
+ * So we need to be careful about mappping buffers to memory pages. We
+ * calculate the maximum number of pages a buffer might use, so that
+ * we allocate enough space for the entries. And then we count the
+ * actual number of entries as we scan the buffers.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * The pages and block size is expected to be 2^k, so one divides the
+ * other (we don't know in which direction). This does not say
+ * anything about relative alignment of pages/buffers.
+ */
+ Assert((os_page_size % BLCKSZ == 0) || (BLCKSZ % os_page_size == 0));
+
+ /*
+ * How many addresses we are going to query? Simply get the page for
+ * the first buffer, and first page after the last buffer, and count
+ * the pages from that.
+ */
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size,
+ BufferGetBlock(1));
+ endptr = (char *) TYPEALIGN_DOWN(os_page_size,
+ (char *) BufferGetBlock(NBuffers) + BLCKSZ);
+ os_page_count = (endptr - startptr) / os_page_size;
+
+ /* Used to determine the NUMA node for all OS pages at once */
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+ os_page_status = palloc(sizeof(uint64) * os_page_count);
+
+ /* Fill pointers for all the memory pages. */
+ idx = 0;
+ for (char *ptr = startptr; ptr < endptr; ptr += os_page_size)
+ {
+ os_page_ptrs[idx++] = ptr;
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, ptr);
+ }
+
+ Assert(idx == os_page_count);
+
+ elog(DEBUG1, "NUMA: NBuffers=%d os_page_count=" UINT64_FORMAT " "
+ "os_page_size=%zu", NBuffers, os_page_count, os_page_size);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(os_page_status, 0xff, sizeof(int) * os_page_count);
+
+ /* Query NUMA status for all the pointers */
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_page_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ /* Initialize the multi-call context, load entries about buffers */
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCacheNumaContext *) palloc(sizeof(BufferCacheNumaContext));
+
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts != NUM_BUFFERCACHE_NUMA_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "os_page_num",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "numa_node",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /*
+ * Each buffer needs at least one entry, but it might be offset in
+ * some way, and use one extra entry. So we allocate space for the
+ * maximum number of entries we might need, and then count the exact
+ * number as we're walking buffers. That way we can do it in one pass,
+ * without reallocating memory.
+ */
+ pages_per_buffer = Max(1, BLCKSZ / os_page_size) + 1;
+ max_entries = NBuffers * pages_per_buffer;
+
+ /* Allocate entries for BufferCachePagesRec records. */
+ fctx->record = (BufferCacheNumaRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCacheNumaRec) * max_entries);
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ *
+ * This loop touches and stores addresses into os_page_ptrs[] as input
+ * to one big big move_pages(2) inquiry system call. Basically we ask
+ * for all memory pages for NBuffers.
+ */
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char *) BufferGetBlock(1));
+ idx = 0;
+ for (i = 0; i < NBuffers; i++)
+ {
+ char *buffptr = (char *) BufferGetBlock(i + 1);
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+ uint32 bufferid;
+ int32 page_num;
+ char *startptr_buff,
+ *endptr_buff;
+
+ CHECK_FOR_INTERRUPTS();
+
+ bufHdr = GetBufferDescriptor(i);
+
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+ bufferid = BufferDescriptorGetBuffer(bufHdr);
+ UnlockBufHdr(bufHdr, buf_state);
+
+ /* start of the first page of this buffer */
+ startptr_buff = (char *) TYPEALIGN_DOWN(os_page_size, buffptr);
+
+ /* start of the page right after this buffer */
+ endptr_buff = (char *) TYPEALIGN_DOWN(os_page_size, buffptr + BLCKSZ);
+
+ /* calculate ID of the first page for this buffer */
+ page_num = (startptr_buff - startptr) / os_page_size;
+
+ /* Add an entry for each OS page overlapping with this buffer. */
+ for (char *ptr = startptr_buff; ptr < endptr_buff; ptr += os_page_size)
+ {
+ fctx->record[idx].bufferid = bufferid;
+ fctx->record[idx].page_num = page_num;
+ fctx->record[idx].numa_node = os_page_status[page_num];
+
+ /* advance to the next entry/page */
+ ++idx;
+ ++page_num;
+ }
+ }
+
+ Assert((idx >= os_page_count) && (idx <= max_entries));
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = idx;
+ funcctx->user_fctx = fctx;
+
+ /* Remember this backend touched the pages */
+ firstNumaTouch = false;
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ uint32 i = funcctx->call_cntr;
+ Datum values[NUM_BUFFERCACHE_NUMA_ELEM];
+ bool nulls[NUM_BUFFERCACHE_NUMA_ELEM];
+
+ values[0] = Int32GetDatum(fctx->record[i].bufferid);
+ nulls[0] = false;
+
+ values[1] = Int32GetDatum(fctx->record[i].page_num);
+ nulls[1] = false;
+
+ values[2] = Int32GetDatum(fctx->record[i].numa_node);
+ nulls[2] = false;
+
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ result = HeapTupleGetDatum(tuple);
+
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ SRF_RETURN_DONE(funcctx);
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..837f3d64e21
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,21 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- We expect at least one entry for each buffer
+select count(*) >= (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..b5050cd7343 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,15 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides
+ <acronym>NUMA</acronym> node mappings for shared buffer entries. This
+ information is not part of <function>pg_buffercache_pages()</function>
+ itself, as it is much slower to retrieve.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +211,78 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache-numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed by the view are shown in <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>bufferid</structfield> <type>integer</type>
+ </para>
+ <para>
+ ID, in the range 1..<varname>shared_buffers</varname>
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>os_page_num</structfield> <type>int</type>
+ </para>
+ <para>
+ number of OS memory page for this buffer
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_node</structfield> <type>int</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ <warning>
+ <para>
+ When determining the <acronym>NUMA</acronym> node, the view touches
+ all memory pages for the shared memory segment. This will force
+ allocation of the shared memory, if it wasn't allocated already,
+ and the memory may get allocated in a single <acronym>NUMA</acronym>
+ node (depending on system configuration).
+ </para>
+ </warning>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d42b943ef94..f7ba0ec809e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -341,6 +341,8 @@ BufFile
Buffer
BufferAccessStrategy
BufferAccessStrategyType
+BufferCacheNumaRec
+BufferCacheNumaContext
BufferCachePagesContext
BufferCachePagesRec
BufferDesc
--
2.49.0
Hi,
On 2025-04-07 19:59:59 +0200, Tomas Vondra wrote:
This reminds me whether it's fine to have "os_page_num" as int. Should
we make it bigint, perhaps?
Yes, that's better. Seems very unlikely anybody will encounter this in the
next few years, but it's basically free to use the larger range here.
Greetings,
Andres Freund
Hi,
On Mon, Apr 07, 2025 at 12:42:21PM -0400, Andres Freund wrote:
Hi,
On 2025-04-07 18:36:24 +0200, Tomas Vondra wrote:
I was thinking of checking if the BufferDesc indicates BM_VALID or
BM_TAG_VALID.
Yeah, that's what I did propose in [1]/messages/by-id/Z64Pr8CTG0RTrGR3@ip-10-97-1-34.eu-west-3.compute.internal (when we were speaking about get_mempolicy())
and I think that would make sense as future improvement.
I think we need to decide whether the current patches are good enough
for PG18, with the current behavior, and then maybe improve that in
PG19.I think as long as the docs mention this with <note> or <warning> it's ok for
now.
+1
A few comments on v27:
=== 1
pg_buffercache_numa() reports the node ID as "nodeid" while pg_shmem_allocations_numa()
reports it as node_id. Maybe we should use the same "naming" in both.
=== 2
postgres=# select count(*) from pg_buffercache;
count
-------
65536
(1 row)
but
postgres=# select count(*) from pg_buffercache_numa;
count
-------
64
(1 row)
with:
postgres=# show block_size;
block_size
------------
2048
and Hugepagesize: 2048 kB.
and
postgres=# show shared_buffers;
shared_buffers
----------------
128MB
(1 row)
And even if for testing I set:
- funcctx->max_calls = idx;
+ funcctx->max_calls = 65536;
then I start to see weird results:
postgres=# select count(*) from pg_buffercache_numa where bufferid not in (select bufferid from pg_buffercache);
count
-------
65472
(1 row)
So it looks like that the new way to iterate on the buffers that has been introduced
in v26/v27 has some issue?
[1]: /messages/by-id/Z64Pr8CTG0RTrGR3@ip-10-97-1-34.eu-west-3.compute.internal
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On 4/7/25 20:11, Bertrand Drouvot wrote:
Hi,
On Mon, Apr 07, 2025 at 12:42:21PM -0400, Andres Freund wrote:
Hi,
On 2025-04-07 18:36:24 +0200, Tomas Vondra wrote:
I was thinking of checking if the BufferDesc indicates BM_VALID or
BM_TAG_VALID.Yeah, that's what I did propose in [1] (when we were speaking about get_mempolicy())
and I think that would make sense as future improvement.I think we need to decide whether the current patches are good enough
for PG18, with the current behavior, and then maybe improve that in
PG19.I think as long as the docs mention this with <note> or <warning> it's ok for
now.+1
A few comments on v27:
=== 1
pg_buffercache_numa() reports the node ID as "nodeid" while pg_shmem_allocations_numa()
reports it as node_id. Maybe we should use the same "naming" in both.
This was renamed in v28 to "numa_node" in both parts.
=== 2
postgres=# select count(*) from pg_buffercache;
count
-------
65536
(1 row)but
postgres=# select count(*) from pg_buffercache_numa;
count
-------
64
(1 row)with:
postgres=# show block_size;
block_size
------------
2048and Hugepagesize: 2048 kB.
and
postgres=# show shared_buffers;
shared_buffers
----------------
128MB
(1 row)And even if for testing I set:
- funcctx->max_calls = idx; + funcctx->max_calls = 65536;then I start to see weird results:
postgres=# select count(*) from pg_buffercache_numa where bufferid not in (select bufferid from pg_buffercache);
count
-------
65472
(1 row)So it looks like that the new way to iterate on the buffers that has been introduced
in v26/v27 has some issue?
Yeah, the calculations of the end pointers were wrong - we need to round
up (using TYPEALIGN()) when calculating number of pages, and just add
BLCKSZ (without any rounding) when calculating end of buffer. The 0004
fixes this for me (I tried this with various blocksizes / page sizes).
Thanks for noticing this!
regards
--
Tomas Vondra
Attachments:
v29-0001-Add-support-for-basic-NUMA-awareness.patchtext/x-patch; charset=UTF-8; name=v29-0001-Add-support-for-basic-NUMA-awareness.patchDownload
From eac1dc9acbcbde8746d4fa48bc6f5647fd7dfa50 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Mon, 7 Apr 2025 17:31:17 +0200
Subject: [PATCH v29 1/4] Add support for basic NUMA awareness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.
A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with NUMA library.
The main function introduced is pg_numa_query_pages(), which allows
determining NUMA node for individual memory pages. Internally the
function uses move_pages(2) syscall, as it allows batching, and is
more efficient than get_mempolicy(2).
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
.cirrus.tasks.yml | 2 +
configure | 187 ++++++++++++++++++++++++++++
configure.ac | 14 +++
doc/src/sgml/func.sgml | 13 ++
doc/src/sgml/installation.sgml | 22 ++++
meson.build | 23 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 6 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/catalog/pg_proc.dat | 4 +
src/include/pg_config.h.in | 3 +
src/include/port/pg_numa.h | 40 ++++++
src/include/storage/pg_shmem.h | 1 +
src/makefiles/meson.build | 3 +
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_numa.c | 120 ++++++++++++++++++
17 files changed, 443 insertions(+), 2 deletions(-)
create mode 100644 src/include/port/pg_numa.h
create mode 100644 src/port/pg_numa.c
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-libnuma \
--with-liburing \
\
${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
-Dllvm=disabled \
--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
-DPERL=perl5.36-i386-linux-gnu \
+ -Dlibnuma=disabled \
build-32
EOF
diff --git a/configure b/configure
index 11615d1122d..e27badd83c3 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
XML2_CFLAGS
XML2_CONFIG
with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
@@ -872,6 +875,7 @@ with_liburing
with_uuid
with_ossp_uuid
with_libcurl
+with_libnuma
with_libxml
with_libxslt
with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
LIBURING_LIBS
LIBCURL_CFLAGS
LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
XML2_CONFIG
XML2_CFLAGS
XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libcurl build with libcurl support
+ --with-libnuma build with libnuma support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-system-tzdata=DIR
@@ -1629,6 +1636,10 @@ Some influential environment variables:
C compiler flags for LIBCURL, overriding pkg-config
LIBCURL_LIBS
linker flags for LIBCURL, overriding pkg-config
+ LIBNUMA_CFLAGS
+ C compiler flags for LIBNUMA, overriding pkg-config
+ LIBNUMA_LIBS
+ linker flags for LIBNUMA, overriding pkg-config
XML2_CONFIG path to xml2-config utility
XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
XML2_LIBS linker flags for XML2, overriding pkg-config
@@ -9063,6 +9074,182 @@ $as_echo "$as_me: WARNING: *** OAuth support tests require --with-python to run"
fi
+#
+# libnuma
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libnuma support" >&5
+$as_echo_n "checking whether to build with libnuma support... " >&6; }
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libnuma" >&5
+$as_echo "$with_libnuma" >&6; }
+
+
+if test "$with_libnuma" = yes ; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_available=yes
+else
+ ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'libnuma' is required for NUMA support" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa" >&5
+$as_echo_n "checking for numa... " >&6; }
+
+if test -n "$LIBNUMA_CFLAGS"; then
+ pkg_cv_LIBNUMA_CFLAGS="$LIBNUMA_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_CFLAGS=`$PKG_CONFIG --cflags "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBNUMA_LIBS"; then
+ pkg_cv_LIBNUMA_LIBS="$LIBNUMA_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBNUMA_LIBS=`$PKG_CONFIG --libs "numa" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "numa" 2>&1`
+ else
+ LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "numa" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBNUMA_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (numa) were not met:
+
+$LIBNUMA_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBNUMA_CFLAGS=$pkg_cv_LIBNUMA_CFLAGS
+ LIBNUMA_LIBS=$pkg_cv_LIBNUMA_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
#
# XML
#
diff --git a/configure.ac b/configure.ac
index debdf165044..d365a486d3d 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,20 @@ if test "$with_libcurl" = yes ; then
fi
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma support],
+ [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA support])])
+ PKG_CHECK_MODULES(LIBNUMA, numa)
+fi
+
#
# XML
#
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0224f93733d..9ab070adffb 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25143,6 +25143,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_numa_available</primary>
+ </indexterm>
+ <function>pg_numa_available</function> ()
+ <returnvalue>boolean</returnvalue>
+ </para>
+ <para>
+ Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..077bcc20759 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,17 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libnuma</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname>
+ library is implemented.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-liburing">
<term><option>--with-liburing</option></term>
<listitem>
@@ -2645,6 +2656,17 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libnuma-meson">
+ <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libnuma support for basic NUMA support.
+ Only supported on platforms for which the <productname>libnuma</productname>
+ library is implemented. The default for this option is auto.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libxml-meson">
<term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 454ed81f5ea..0a625047f33 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
endif
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+ # via pkg-config
+ libnuma = dependency('numa', required: false)
+ if not libnuma.found()
+ libnuma = cc.find_library('numa', required: libnumaopt)
+ endif
+ if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+ libnuma = not_found_dep
+ endif
+ if libnuma.found()
+ cdata.set('USE_LIBNUMA', 1)
+ endif
+else
+ libnuma = not_found_dep
+endif
+
###############################################################
# Library: liburing
@@ -3243,6 +3264,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ libnuma,
liburing,
libxml,
lz4,
@@ -3899,6 +3921,7 @@ if meson.version().version_compare('>=0.57')
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..06bf5627d3c 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('libnuma', type: 'feature', value: 'auto',
+ description: 'NUMA support')
+
option('liburing', type : 'feature', value: 'auto',
description: 'io_uring support, for asynchronous I/O')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 737b2dd1869..6722fbdf365 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
with_libcurl = @with_libcurl@
+with_libnuma = @with_libnuma@
with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
@@ -223,6 +224,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBNUMA_CFLAGS = @LIBNUMA_CFLAGS@
+LIBNUMA_LIBS = @LIBNUMA_LIBS@
+
LIBURING_CFLAGS = @LIBURING_CFLAGS@
LIBURING_LIBS = @LIBURING_LIBS@
@@ -250,7 +254,7 @@ CPP = @CPP@
CPPFLAGS = @CPPFLAGS@
PG_SYSROOT = @PG_SYSROOT@
-override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBNUMA_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
ifdef PGXS
override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..ea8d796e7c4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;
/*
* These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5d5be8ba4e1..04834d130f9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8542,6 +8542,10 @@
proargnames => '{name,off,size,allocated_size}',
prosrc => 'pg_get_shmem_allocations' },
+{ oid => '9685', descr => 'Is NUMA support available?',
+ proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
+ proargtypes => '', prosrc => 'pg_numa_available' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c2f1241b234..b3166ec8f42 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -686,6 +686,9 @@
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
+/* Define to 1 to build with NUMA support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
/* Define to build with io_uring support. (--with-liburing) */
#undef USE_LIBURING
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..7e990d9f776
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ ro_volatile_var = *(volatile uint64 *) ptr
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+ do {} while(0)
+
+#endif
+
+#endif /* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
/* Possible values for huge_pages and huge_pages_status */
typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'ICU_LIBS',
+ 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
@@ -232,6 +234,7 @@ pgxs_deps = {
'icu': icu,
'ldap': ldap,
'libcurl': libcurl,
+ 'libnuma': libnuma,
'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index f11896440d5..4274949dfa4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
path.o \
pg_bitutils.o \
pg_localeconv_r.o \
+ pg_numa.o \
pg_popcount_aarch64.o \
pg_popcount_avx512.o \
pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 51041e75609..228888b2f66 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_localeconv_r.c',
+ 'pg_numa.c',
'pg_popcount_aarch64.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..5e2523cf798
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+ int r = numa_available();
+
+ return r;
+}
+
+/*
+ * We use move_pages(2) syscall here - instead of get_mempolicy(2) - as the
+ * first one allows us to batch and query about many memory pages in one single
+ * giant system call that is way faster.
+ */
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return numa_max_node();
+}
+
+#else
+
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+ /* We state that NUMA is not available */
+ return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+ return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+ return 0;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
+
+/* This should be used only after the server is started */
+Size
+pg_numa_get_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+ Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
--
2.49.0
v29-0002-Introduce-pg_shmem_allocations_numa-view.patchtext/x-patch; charset=UTF-8; name=v29-0002-Introduce-pg_shmem_allocations_numa-view.patchDownload
From b02e2f2cc0770633b565888936a4b9eb95232a19 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Mon, 7 Apr 2025 19:32:39 +0200
Subject: [PATCH v29 2/4] Introduce pg_shmem_allocations_numa view
Introduce new pg_shmem_alloctions_numa view with information about how
shared memory is distributed across NUMA nodes. For each shared memory
segment, the view returns one row for each NUMA node backing it, with
the total amount of memory allocated from that node.
The view may be relatively expensive, especially when executed for the
first time in a backend, as it has to touch all memory pages to get
reliable information about the NUMA node. This may also force allocation
of the shared memory.
Unlike pg_shmem_allocations, the view does not show anonymous shared
memory allocations. It also does not show memory allocated using the
dynamic shared memory infrastructure.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
doc/src/sgml/system-views.sgml | 95 ++++++++++++++
src/backend/catalog/system_views.sql | 8 ++
src/backend/storage/ipc/shmem.c | 159 +++++++++++++++++++++++
src/include/catalog/pg_proc.dat | 8 ++
src/test/regress/expected/numa.out | 13 ++
src/test/regress/expected/numa_1.out | 5 +
src/test/regress/expected/privileges.out | 16 ++-
src/test/regress/expected/rules.out | 4 +
src/test/regress/parallel_schedule | 2 +-
src/test/regress/sql/numa.sql | 10 ++
src/test/regress/sql/privileges.sql | 6 +-
11 files changed, 321 insertions(+), 5 deletions(-)
create mode 100644 src/test/regress/expected/numa.out
create mode 100644 src/test/regress/expected/numa_1.out
create mode 100644 src/test/regress/sql/numa.sql
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 4f336ee0adf..0eba37268bf 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -181,6 +181,11 @@
<entry>shared memory allocations</entry>
</row>
+ <row>
+ <entry><link linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry>
+ <entry>NUMA node mappings for shared memory allocations</entry>
+ </row>
+
<row>
<entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
<entry>planner statistics</entry>
@@ -4051,6 +4056,96 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
</para>
</sect1>
+ <sect1 id="view-pg-shmem-allocations-numa">
+ <title><structname>pg_shmem_allocations_numa</structname></title>
+
+ <indexterm zone="view-pg-shmem-allocations-numa">
+ <primary>pg_shmem_allocations_numa</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_allocations_numa</structname> shows how shared
+ memory allocations in the server's main shared memory segment are distributed
+ across NUMA nodes. This includes both memory allocated by
+ <productname>PostgreSQL</productname> itself and memory allocated
+ by extensions using the mechanisms detailed in
+ <xref linkend="xfunc-shared-addin" />. This view will output multiple rows
+ for each of the shared memory segments provided that they are spread accross
+ multiple NUMA nodes. This view should not be queried by monitoring systems
+ as it is very slow and may end up allocating shared memory in case it was not
+ used earlier.
+ Current limitation for this view is that won't show anonymous shared memory
+ allocations.
+ </para>
+
+ <para>
+ Note that this view does not include memory allocated using the dynamic
+ shared memory infrastructure.
+ </para>
+
+ <warning>
+ <para>
+ When determining the <acronym>NUMA</acronym> node, the view touches
+ all memory pages for the shared memory segment. This will force
+ allocation of the shared memory, if it wasn't allocated already,
+ and the memory may get allocated in a single <acronym>NUMA</acronym>
+ node (depending on system configuration).
+ </para>
+ </warning>
+
+ <table>
+ <title><structname>pg_shmem_allocations_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>name</structfield> <type>text</type>
+ </para>
+ <para>
+ The name of the shared memory allocation.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_node</structfield> <type>int4</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>size</structfield> <type>int4</type>
+ </para>
+ <para>
+ Size of the allocation on this particular NUMA memory node in bytes
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ By default, the <structname>pg_shmem_allocations_numa</structname> view can be
+ read only by superusers or roles with privileges of the
+ <literal>pg_read_all_stats</literal> role.
+ </para>
+ </sect1>
+
<sect1 id="view-pg-stats">
<title><structname>pg_stats</structname></title>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 273008db37f..08f780a2e63 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
+CREATE VIEW pg_shmem_allocations_numa AS
+ SELECT * FROM pg_get_shmem_allocations_numa();
+
+REVOKE ALL ON pg_shmem_allocations_numa FROM PUBLIC;
+GRANT SELECT ON pg_shmem_allocations_numa TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() TO pg_read_all_stats;
+
CREATE VIEW pg_backend_memory_contexts AS
SELECT * FROM pg_get_backend_memory_contexts();
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..e10b380e5c7 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
#include "fmgr.h"
#include "funcapi.h"
#include "miscadmin.h"
+#include "port/pg_numa.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t *ShmemLock; /* spinlock for shared memory and LWLock
static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,159 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/*
+ * SQL SRF showing NUMA memory nodes for allocated shared memory
+ *
+ * Compared to pg_get_shmem_allocations(), this function does not return
+ * information about shared anonymous allocations and unused shared memory.
+ */
+Datum
+pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ HASH_SEQ_STATUS hstat;
+ ShmemIndexEnt *ent;
+ Datum values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ bool nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+ Size os_page_size;
+ void **page_ptrs;
+ int *pages_status;
+ uint64 shm_total_page_count,
+ shm_ent_page_count,
+ max_nodes;
+ Size *nodes;
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ InitMaterializedSRF(fcinfo, 0);
+
+ max_nodes = pg_numa_get_max_node();
+ nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+ /*
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+ * the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: 1. Determine the OS memory
+ * page size 2. Calculate how many OS pages are used by all buffer blocks
+ * 3. Calculate how many OS pages are contained within each database
+ * block.
+ *
+ * This information is needed before calling move_pages() for NUMA memory
+ * node inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * Allocate memory for page pointers and status based on total shared
+ * memory size. This simplified approach allocates enough space for all
+ * pages in shared memory rather than calculating the exact requirements
+ * for each segment.
+ *
+ * Add 1, because we don't know how exactly the segments align to OS
+ * pages, so the allocation might use one more memory page. In practice
+ * this is not very likely, and moreover we have more entries, each of
+ * them using only fraction of the total pages.
+ */
+ shm_total_page_count = (ShmemSegHdr->totalsize / os_page_size) + 1;
+ page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+ pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+ LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+ hash_seq_init(&hstat, ShmemIndex);
+
+ /* output all allocated entries */
+ memset(nulls, 0, sizeof(nulls));
+ while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+ {
+ int i;
+ char *startptr,
+ *endptr;
+ Size total_len;
+
+ /*
+ * Calculate the range of OS pages used by this segment. The segment
+ * may start / end half-way through a page, we want to count these
+ * pages too. So we align the start/end pointers down/up, and then
+ * calculate the number of pages from that.
+ */
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size, ent->location);
+ endptr = (char *) TYPEALIGN(os_page_size,
+ (char *) ent->location + ent->allocated_size);
+ total_len = (endptr - startptr);
+
+ shm_ent_page_count = total_len / os_page_size;
+
+ /*
+ * If we ever get 0xff (-1) back from kernel inquiry, then we probably
+ * have a bug in mapping buffers to OS pages.
+ */
+ memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+ /*
+ * Setup page_ptrs[] with pointers to all OS pages for this segment,
+ * and get the NUMA status using pg_numa_query_pages.
+ *
+ * In order to get reliable results we also need to touch memory
+ * pages, so that inquiry about NUMA memory node doesn't return -2
+ * (ENOENT, which indicates unmapped/unallocated pages).
+ */
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ volatile uint64 touch pg_attribute_unused();
+
+ page_ptrs[i] = startptr + (i * os_page_size);
+
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+ /* Count number of NUMA nodes used for this shared memory entry */
+ memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+
+ for (i = 0; i < shm_ent_page_count; i++)
+ {
+ int s = pages_status[i];
+
+ /* Ensure we are adding only valid index to the array */
+ if (s < 0 || s > max_nodes)
+ {
+ elog(ERROR, "invalid NUMA node id outside of allowed range "
+ "[0, " UINT64_FORMAT "]: %d", max_nodes, s);
+ }
+
+ nodes[s]++;
+ }
+
+ /*
+ * Add one entry for each NUMA node, including those without allocated
+ * memory for this segment.
+ */
+ for (i = 0; i <= max_nodes; i++)
+ {
+ values[0] = CStringGetTextDatum(ent->key);
+ values[1] = i;
+ values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+ values, nulls);
+ }
+ }
+
+ LWLockRelease(ShmemIndexLock);
+ firstNumaTouch = false;
+
+ return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 04834d130f9..8597981d6b3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8546,6 +8546,14 @@
proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_numa_available' },
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+ proname => 'pg_get_shmem_allocations_numa', prorows => '50', proretset => 't',
+ provolatile => 'v', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+ proargnames => '{name,numa_node,size}',
+ prosrc => 'pg_get_shmem_allocations_numa' },
+
# memory context of local backend
{ oid => '2282',
descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..8af5dfeb9a5
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,13 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
+ ok
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..c90042fa7cc
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,5 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+ERROR: libnuma initialization failed or NUMA is not supported on this platform
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 1fddb13b6ae..c25062c288f 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3219,8 +3219,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
-- clean up
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
CREATE ROLE regress_readallstats;
@@ -3242,6 +3242,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
f
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
+ has_table_privilege
+---------------------
+ f
+(1 row)
+
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
has_table_privilege
@@ -3261,6 +3267,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
t
(1 row)
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
+ has_table_privilege
+---------------------
+ t
+(1 row)
+
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 673c63b8d1b..6cf828ca8d0 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1757,6 +1757,10 @@ pg_shmem_allocations| SELECT name,
size,
allocated_size
FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_allocations_numa| SELECT name,
+ numa_node,
+ size
+ FROM pg_get_shmem_allocations_numa() pg_get_shmem_allocations_numa(name, numa_node, size);
pg_stat_activity| SELECT s.datid,
d.datname,
s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# The stats test resets stats, so nothing else needing stats access can be in
# this group.
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
# event_trigger depends on create_am and cannot run concurrently with
# any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..324481c33b7
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,10 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index 85d7280f35f..f337aa67c13 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1947,8 +1947,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
DROP TABLE lock_table;
DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
-- switch to superuser
\c -
@@ -1958,12 +1958,14 @@ CREATE ROLE regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
GRANT pg_read_all_stats TO regress_readallstats;
SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
-- run query to ensure that functions within views can be executed
SET ROLE regress_readallstats;
--
2.49.0
v29-0003-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchtext/x-patch; charset=UTF-8; name=v29-0003-Add-pg_buffercache_numa-view-with-NUMA-node-info.patchDownload
From b5c6dc19a198977929b176573678a84da1c12fbd Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Mon, 7 Apr 2025 19:39:31 +0200
Subject: [PATCH v29 3/4] Add pg_buffercache_numa view with NUMA node info
Introduces a new view pg_buffercache_numa, showing NUMA memory nodes
for individual buffers. For each buffer the view returns an entry for
each memory page, with the associated NUMA node.
The database blocks and OS memory pages may have different size - the
default block size is 8KB, while the memory page is 4K (on x86). But
other combinations are possible, depending on configure parameters,
platform, etc. This means buffers may overlap with multiple memory
pages, each associated with a different NUMA node.
To determine the NUMA node for a buffer, we first need to touch the
memory pages using pg_numa_touch_mem_if_required, otherwise we might get
status -2 (ENOENT = The page is not present), indicating the page is
either unmapped or unallocated.
The view may be relatively expensive, especially when accessed for the
first time in a backend, as it touches all memory pages to get reliable
information about the NUMA node. This may also force allocation of the
shared memory.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
contrib/pg_buffercache/Makefile | 5 +-
.../expected/pg_buffercache_numa.out | 29 ++
.../expected/pg_buffercache_numa_1.out | 3 +
contrib/pg_buffercache/meson.build | 2 +
.../pg_buffercache--1.5--1.6.sql | 22 ++
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 283 ++++++++++++++++++
.../sql/pg_buffercache_numa.sql | 21 ++
doc/src/sgml/pgbuffercache.sgml | 85 +++++-
src/tools/pgindent/typedefs.list | 2 +
10 files changed, 450 insertions(+), 4 deletions(-)
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..5f748543e2e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,10 +8,11 @@ OBJS = \
EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
- pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+ pg_buffercache--1.5--1.6.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
-REGRESS = pg_buffercache
+REGRESS = pg_buffercache pg_buffercache_numa
ifdef USE_PGXS
PG_CONFIG = pg_config
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..a10b331a552
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,29 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- We expect at least one entry for each buffer
+select count(*) >= (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR: permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column?
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
'pg_buffercache--1.2.sql',
'pg_buffercache--1.3--1.4.sql',
'pg_buffercache--1.4--1.5.sql',
+ 'pg_buffercache--1.5--1.6.sql',
'pg_buffercache.control',
kwargs: contrib_data_args,
)
@@ -34,6 +35,7 @@ tests += {
'regress': {
'sql': [
'pg_buffercache',
+ 'pg_buffercache_numa',
],
},
}
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..f6668e41b37
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,22 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_numa AS
+ SELECT P.* FROM pg_buffercache_numa_pages() AS P
+ (bufferid integer, os_page_num int4, numa_node int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..b9fdf87bcc4 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,6 +11,7 @@
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
@@ -20,6 +21,8 @@
#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
#define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
+#define NUM_BUFFERCACHE_NUMA_ELEM 3
+
PG_MODULE_MAGIC_EXT(
.name = "pg_buffercache",
.version = PG_VERSION
@@ -58,16 +61,44 @@ typedef struct
BufferCachePagesRec *record;
} BufferCachePagesContext;
+/*
+ * Record structure holding the to be exposed cache data.
+ */
+typedef struct
+{
+ uint32 bufferid;
+ int32 page_num;
+ int32 numa_node;
+} BufferCacheNumaRec;
+
+/*
+ * Function context for data persisting over repeated calls.
+ */
+typedef struct
+{
+ TupleDesc tupdesc;
+ int buffers_per_page;
+ int pages_per_buffer;
+ int os_page_size;
+ BufferCacheNumaRec *record;
+} BufferCacheNumaContext;
+
/*
* Function returning data from the shared buffer cache - buffer number,
* relation node/tablespace/database/blocknum and dirty indicator.
*/
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
+
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+
Datum
pg_buffercache_pages(PG_FUNCTION_ARGS)
{
@@ -246,6 +277,258 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * Inquire about NUMA memory mappings for shared buffers.
+ *
+ * Returns NUMA node ID for each memory page used by the buffer. Buffers may
+ * be smaller or larger than OS memory pages. For each buffer we return one
+ * entry for each memory page used by the buffer (it fhe buffer is smaller,
+ * it only uses a part of one memory page).
+ *
+ * We expect both sizes (for buffers and memory pages) to be a power-of-2, so
+ * one is always a multiple of the other.
+ *
+ * In order to get reliable results we also need to touch memory pages, so
+ * that the inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ MemoryContext oldcontext;
+ BufferCacheNumaContext *fctx; /* User function context. */
+ TupleDesc tupledesc;
+ TupleDesc expected_tupledesc;
+ HeapTuple tuple;
+ Datum result;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ int i,
+ idx;
+ Size os_page_size = 0;
+ void **os_page_ptrs = NULL;
+ int *os_page_status;
+ uint64 os_page_count;
+ int pages_per_buffer;
+ int max_entries;
+ volatile uint64 touch pg_attribute_unused();
+ char *startptr,
+ *endptr;
+
+ if (pg_numa_init() == -1)
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+ /*
+ * The database block size and OS memory page size are unlikely to be
+ * the same. The block size is 1-32KB, the memory page size depends on
+ * platform. On x86 it's usually 4KB, on ARM it's 4KB or 64KB, but
+ * there are also features like THP etc. Moreover, we don't quite know
+ * how the pages and buffers "align" in memory - the buffers may be
+ * shifted in some way, using more memory pages than necessary.
+ *
+ * So we need to be careful about mappping buffers to memory pages. We
+ * calculate the maximum number of pages a buffer might use, so that
+ * we allocate enough space for the entries. And then we count the
+ * actual number of entries as we scan the buffers.
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * node id inquiry.
+ */
+ os_page_size = pg_numa_get_pagesize();
+
+ /*
+ * The pages and block size is expected to be 2^k, so one divides the
+ * other (we don't know in which direction). This does not say
+ * anything about relative alignment of pages/buffers.
+ */
+ Assert((os_page_size % BLCKSZ == 0) || (BLCKSZ % os_page_size == 0));
+
+ /*
+ * How many addresses we are going to query? Simply get the page for
+ * the first buffer, and first page after the last buffer, and count
+ * the pages from that.
+ */
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size,
+ BufferGetBlock(1));
+ endptr = (char *) TYPEALIGN_DOWN(os_page_size,
+ (char *) BufferGetBlock(NBuffers) + BLCKSZ);
+ os_page_count = (endptr - startptr) / os_page_size;
+
+ /* Used to determine the NUMA node for all OS pages at once */
+ os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+ os_page_status = palloc(sizeof(uint64) * os_page_count);
+
+ /* Fill pointers for all the memory pages. */
+ idx = 0;
+ for (char *ptr = startptr; ptr < endptr; ptr += os_page_size)
+ {
+ os_page_ptrs[idx++] = ptr;
+
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
+ pg_numa_touch_mem_if_required(touch, ptr);
+ }
+
+ Assert(idx == os_page_count);
+
+ elog(DEBUG1, "NUMA: NBuffers=%d os_page_count=" UINT64_FORMAT " "
+ "os_page_size=%zu", NBuffers, os_page_count, os_page_size);
+
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here.
+ */
+ memset(os_page_status, 0xff, sizeof(int) * os_page_count);
+
+ /* Query NUMA status for all the pointers */
+ if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_page_status) == -1)
+ elog(ERROR, "failed NUMA pages inquiry: %m");
+
+ /* Initialize the multi-call context, load entries about buffers */
+
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ /* Create a user function context for cross-call persistence */
+ fctx = (BufferCacheNumaContext *) palloc(sizeof(BufferCacheNumaContext));
+
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts != NUM_BUFFERCACHE_NUMA_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "os_page_num",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "numa_node",
+ INT4OID, -1, 0);
+
+ fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+ /*
+ * Each buffer needs at least one entry, but it might be offset in
+ * some way, and use one extra entry. So we allocate space for the
+ * maximum number of entries we might need, and then count the exact
+ * number as we're walking buffers. That way we can do it in one pass,
+ * without reallocating memory.
+ */
+ pages_per_buffer = Max(1, BLCKSZ / os_page_size) + 1;
+ max_entries = NBuffers * pages_per_buffer;
+
+ /* Allocate entries for BufferCachePagesRec records. */
+ fctx->record = (BufferCacheNumaRec *)
+ MemoryContextAllocHuge(CurrentMemoryContext,
+ sizeof(BufferCacheNumaRec) * max_entries);
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+
+ if (firstNumaTouch)
+ elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+ /*
+ * Scan through all the buffers, saving the relevant fields in the
+ * fctx->record structure.
+ *
+ * We don't hold the partition locks, so we don't get a consistent
+ * snapshot across all buffers, but we do grab the buffer header
+ * locks, so the information of each buffer is self-consistent.
+ *
+ * This loop touches and stores addresses into os_page_ptrs[] as input
+ * to one big big move_pages(2) inquiry system call. Basically we ask
+ * for all memory pages for NBuffers.
+ */
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char *) BufferGetBlock(1));
+ idx = 0;
+ for (i = 0; i < NBuffers; i++)
+ {
+ char *buffptr = (char *) BufferGetBlock(i + 1);
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+ uint32 bufferid;
+ int32 page_num;
+ char *startptr_buff,
+ *endptr_buff;
+
+ CHECK_FOR_INTERRUPTS();
+
+ bufHdr = GetBufferDescriptor(i);
+
+ /* Lock each buffer header before inspecting. */
+ buf_state = LockBufHdr(bufHdr);
+ bufferid = BufferDescriptorGetBuffer(bufHdr);
+ UnlockBufHdr(bufHdr, buf_state);
+
+ /* start of the first page of this buffer */
+ startptr_buff = (char *) TYPEALIGN_DOWN(os_page_size, buffptr);
+
+ /* start of the page right after this buffer */
+ endptr_buff = (char *) TYPEALIGN_DOWN(os_page_size, buffptr + BLCKSZ);
+
+ /* calculate ID of the first page for this buffer */
+ page_num = (startptr_buff - startptr) / os_page_size;
+
+ /* Add an entry for each OS page overlapping with this buffer. */
+ for (char *ptr = startptr_buff; ptr < endptr_buff; ptr += os_page_size)
+ {
+ fctx->record[idx].bufferid = bufferid;
+ fctx->record[idx].page_num = page_num;
+ fctx->record[idx].numa_node = os_page_status[page_num];
+
+ /* advance to the next entry/page */
+ ++idx;
+ ++page_num;
+ }
+ }
+
+ Assert((idx >= os_page_count) && (idx <= max_entries));
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = idx;
+ funcctx->user_fctx = fctx;
+
+ /* Remember this backend touched the pages */
+ firstNumaTouch = false;
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ /* Get the saved state */
+ fctx = funcctx->user_fctx;
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ uint32 i = funcctx->call_cntr;
+ Datum values[NUM_BUFFERCACHE_NUMA_ELEM];
+ bool nulls[NUM_BUFFERCACHE_NUMA_ELEM];
+
+ values[0] = Int32GetDatum(fctx->record[i].bufferid);
+ nulls[0] = false;
+
+ values[1] = Int32GetDatum(fctx->record[i].page_num);
+ nulls[1] = false;
+
+ values[2] = Int32GetDatum(fctx->record[i].numa_node);
+ nulls[2] = false;
+
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+ result = HeapTupleGetDatum(tuple);
+
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ SRF_RETURN_DONE(funcctx);
+}
+
Datum
pg_buffercache_summary(PG_FUNCTION_ARGS)
{
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..837f3d64e21
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,21 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- We expect at least one entry for each buffer
+select count(*) >= (select setting::bigint
+ from pg_settings
+ where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..b5050cd7343 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
<para>
This module provides the <function>pg_buffercache_pages()</function>
function (wrapped in the <structname>pg_buffercache</structname> view),
- the <function>pg_buffercache_summary()</function> function, the
+ <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+ <structname>pg_buffercache_numa</structname> view), the
+ <function>pg_buffercache_summary()</function> function, the
<function>pg_buffercache_usage_counts()</function> function and
the <function>pg_buffercache_evict()</function> function.
</para>
@@ -42,6 +44,15 @@
convenient use.
</para>
+ <para>
+ The <function>pg_buffercache_numa_pages()</function> provides
+ <acronym>NUMA</acronym> node mappings for shared buffer entries. This
+ information is not part of <function>pg_buffercache_pages()</function>
+ itself, as it is much slower to retrieve.
+ The <structname>pg_buffercache_numa</structname> view wraps the function for
+ convenient use.
+ </para>
+
<para>
The <function>pg_buffercache_summary()</function> function returns a single
row summarizing the state of the shared buffer cache.
@@ -200,6 +211,78 @@
</para>
</sect2>
+ <sect2 id="pgbuffercache-pg-buffercache-numa">
+ <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+ <para>
+ The definitions of the columns exposed by the view are shown in <xref linkend="pgbuffercache-numa-columns"/>.
+ </para>
+
+ <table id="pgbuffercache-numa-columns">
+ <title><structname>pg_buffercache_numa</structname> Columns</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>bufferid</structfield> <type>integer</type>
+ </para>
+ <para>
+ ID, in the range 1..<varname>shared_buffers</varname>
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>os_page_num</structfield> <type>int</type>
+ </para>
+ <para>
+ number of OS memory page for this buffer
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_node</structfield> <type>int</type>
+ </para>
+ <para>
+ ID of <acronym>NUMA</acronym> node
+ </para></entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
+ </para>
+
+ <warning>
+ <para>
+ When determining the <acronym>NUMA</acronym> node, the view touches
+ all memory pages for the shared memory segment. This will force
+ allocation of the shared memory, if it wasn't allocated already,
+ and the memory may get allocated in a single <acronym>NUMA</acronym>
+ node (depending on system configuration).
+ </para>
+ </warning>
+
+ </sect2>
+
<sect2 id="pgbuffercache-summary">
<title>The <function>pg_buffercache_summary()</function> Function</title>
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 229fbff47ae..714cee6d6f1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -340,6 +340,8 @@ BufFile
Buffer
BufferAccessStrategy
BufferAccessStrategyType
+BufferCacheNumaRec
+BufferCacheNumaContext
BufferCachePagesContext
BufferCachePagesRec
BufferDesc
--
2.49.0
v29-0004-fixup.patchtext/x-patch; charset=UTF-8; name=v29-0004-fixup.patchDownload
From 8b0438045bbff5b665e65c4d1cf73b5da7d0d955 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 7 Apr 2025 21:33:20 +0200
Subject: [PATCH v29 4/4] fixup
---
contrib/pg_buffercache/pg_buffercache_pages.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index b9fdf87bcc4..a702a47efe9 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -307,8 +307,8 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
{
int i,
idx;
- Size os_page_size = 0;
- void **os_page_ptrs = NULL;
+ Size os_page_size;
+ void **os_page_ptrs;
int *os_page_status;
uint64 os_page_count;
int pages_per_buffer;
@@ -352,8 +352,8 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
*/
startptr = (char *) TYPEALIGN_DOWN(os_page_size,
BufferGetBlock(1));
- endptr = (char *) TYPEALIGN_DOWN(os_page_size,
- (char *) BufferGetBlock(NBuffers) + BLCKSZ);
+ endptr = (char *) TYPEALIGN(os_page_size,
+ (char *) BufferGetBlock(NBuffers) + BLCKSZ);
os_page_count = (endptr - startptr) / os_page_size;
/* Used to determine the NUMA node for all OS pages at once */
@@ -470,8 +470,10 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
/* start of the first page of this buffer */
startptr_buff = (char *) TYPEALIGN_DOWN(os_page_size, buffptr);
- /* start of the page right after this buffer */
- endptr_buff = (char *) TYPEALIGN_DOWN(os_page_size, buffptr + BLCKSZ);
+ /* end of the buffer (no need to align to memory page) */
+ endptr_buff = buffptr + BLCKSZ;
+
+ Assert(startptr_buff < endptr_buff);
/* calculate ID of the first page for this buffer */
page_num = (startptr_buff - startptr) / os_page_size;
--
2.49.0
On Mon, Apr 7, 2025 at 9:51 PM Tomas Vondra <tomas@vondra.me> wrote:
So it looks like that the new way to iterate on the buffers that has been introduced
in v26/v27 has some issue?Yeah, the calculations of the end pointers were wrong - we need to round
up (using TYPEALIGN()) when calculating number of pages, and just add
BLCKSZ (without any rounding) when calculating end of buffer. The 0004
fixes this for me (I tried this with various blocksizes / page sizes).Thanks for noticing this!
Hi,
v28-0001 LGTM
v28-0002 got this warning Andres was talking about, so LGTM
v28-0003 (pg_buffercache_numa now), LGTM, but I *thought* for quite
some time we have 2nd bug there, but it appears that PG never properly
aligned whole s_b to os_page_size(HP)? ... Thus we cannot assume
count(*) pg_buffercache_numa == count(*) pg_buffercache.
So before anybody else reports this as bug about duplicate bufferids:
# select * from pg_buffercache_numa where os_page_num <= 2;
bufferid | os_page_num | numa_node
----------+-------------+-----------
[..]
195 | 0 | 0
196 | 0 | 0 <-- duplicate?
196 | 1 | 0 <-- duplicate?
197 | 1 | 0
198 | 1 | 0
That is strange because on first look one could assume we get 257x
8192 blocks per os_page (2^21) that way, which is impossible.
Exercises in pointers show this:
# select * from pg_buffercache_numa where os_page_num <= 2;
DEBUG: NUMA: NBuffers=16384 os_page_count=65 os_page_size=2097152
DEBUG: NUMA: page-faulting the buffercache for proper NUMA readouts
-- custom elog(DEBUG1)
DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000
endptr_buff=0x7f866107b000 bufferid=1 page_num=0 real
buffptr=0x7f8661079000
[..]
DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000
endptr_buff=0x7f86611fd000 bufferid=194 page_num=0 real
buffptr=0x7f86611fb000
DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000
endptr_buff=0x7f86611ff000 bufferid=195 page_num=0 real
buffptr=0x7f86611fd000
DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000
endptr_buff=0x7f8661201000 bufferid=196 page_num=0 real
buffptr=0x7f86611ff000 (!)
DEBUG: ptr=0x7f8661200000 startptr_buff=0x7f8661000000
endptr_buff=0x7f8661201000 bufferid=196 page_num=1 real
buffptr=0x7f86611ff000 (!)
DEBUG: ptr=0x7f8661200000 startptr_buff=0x7f8661200000
endptr_buff=0x7f8661203000 bufferid=197 page_num=1 real
buffptr=0x7f8661201000
DEBUG: ptr=0x7f8661200000 startptr_buff=0x7f8661200000
endptr_buff=0x7f8661205000 bufferid=198 page_num=1 real
buffptr=0x7f8661203000
so we have NBuffer=196 with bufferptr=0x7f86611ff000 that is 8kB big
(and ends up at 0x7f8661201000), while we also have HP that hosts it
between 0x7f8661000000 and 0x7f8661200000. So Buffer 196 spans 2
hugepages. Open question for another day is shouldn't (of course
outside of this $thread) align s_b to HP size or not? As per above
even bufferid=1 has 0x7f8661079000 while page starts on 0x7f8661000000
(that's 495616 bytes difference).
-J.
Hi,
I've pushed all three parts of v29, with some additional corrections
(picked lower OIDs, bumped catversion, fixed commit messages).
On 4/7/25 23:01, Jakub Wartak wrote:
On Mon, Apr 7, 2025 at 9:51 PM Tomas Vondra <tomas@vondra.me> wrote:
So it looks like that the new way to iterate on the buffers that has been introduced
in v26/v27 has some issue?Yeah, the calculations of the end pointers were wrong - we need to round
up (using TYPEALIGN()) when calculating number of pages, and just add
BLCKSZ (without any rounding) when calculating end of buffer. The 0004
fixes this for me (I tried this with various blocksizes / page sizes).Thanks for noticing this!
Hi,
v28-0001 LGTM
v28-0002 got this warning Andres was talking about, so LGTM
v28-0003 (pg_buffercache_numa now), LGTM, but I *thought* for quite
some time we have 2nd bug there, but it appears that PG never properly
aligned whole s_b to os_page_size(HP)? ... Thus we cannot assume
count(*) pg_buffercache_numa == count(*) pg_buffercache.
AFAIK v29 fixed this, the end pointer calculations were wrong. With that
it passed for me with/without THP, different blocks sizes etc.
We don't align buffers to os_page_size, we align them PG_IO_ALIGN_SIZE,
which is 4kB or so. And it's determined at compile time, while THP is
determined when starting the cluster.
So before anybody else reports this as bug about duplicate bufferids:
# select * from pg_buffercache_numa where os_page_num <= 2;
bufferid | os_page_num | numa_node
----------+-------------+-----------
[..]
195 | 0 | 0
196 | 0 | 0 <-- duplicate?
196 | 1 | 0 <-- duplicate?
197 | 1 | 0
198 | 1 | 0That is strange because on first look one could assume we get 257x
8192 blocks per os_page (2^21) that way, which is impossible.
Exercises in pointers show this:# select * from pg_buffercache_numa where os_page_num <= 2;
DEBUG: NUMA: NBuffers=16384 os_page_count=65 os_page_size=2097152
DEBUG: NUMA: page-faulting the buffercache for proper NUMA readouts
-- custom elog(DEBUG1)
DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000
endptr_buff=0x7f866107b000 bufferid=1 page_num=0 real
buffptr=0x7f8661079000
[..]
DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000
endptr_buff=0x7f86611fd000 bufferid=194 page_num=0 real
buffptr=0x7f86611fb000
DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000
endptr_buff=0x7f86611ff000 bufferid=195 page_num=0 real
buffptr=0x7f86611fd000
DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000
endptr_buff=0x7f8661201000 bufferid=196 page_num=0 real
buffptr=0x7f86611ff000 (!)
DEBUG: ptr=0x7f8661200000 startptr_buff=0x7f8661000000
endptr_buff=0x7f8661201000 bufferid=196 page_num=1 real
buffptr=0x7f86611ff000 (!)
DEBUG: ptr=0x7f8661200000 startptr_buff=0x7f8661200000
endptr_buff=0x7f8661203000 bufferid=197 page_num=1 real
buffptr=0x7f8661201000
DEBUG: ptr=0x7f8661200000 startptr_buff=0x7f8661200000
endptr_buff=0x7f8661205000 bufferid=198 page_num=1 real
buffptr=0x7f8661203000so we have NBuffer=196 with bufferptr=0x7f86611ff000 that is 8kB big
(and ends up at 0x7f8661201000), while we also have HP that hosts it
between 0x7f8661000000 and 0x7f8661200000. So Buffer 196 spans 2
hugepages. Open question for another day is shouldn't (of course
outside of this $thread) align s_b to HP size or not? As per above
even bufferid=1 has 0x7f8661079000 while page starts on 0x7f8661000000
(that's 495616 bytes difference).
Right, this is because that's where the THP boundary happens to be. And
that one "duplicate" entry is for a buffer that happens to span two
pages. This is *exactly* the misalignment of blocks and pages that I was
wondering about earlier, and with the fixed endptr calculation we handle
that just fine.
No opinion on the aligment - maybe we should do that, but it's not
something this patch needs to worry about.
regards
--
Tomas Vondra
On Mon, Apr 7, 2025 at 11:27 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi,
I've pushed all three parts of v29, with some additional corrections
(picked lower OIDs, bumped catversion, fixed commit messages).
Hi Tomas, great, awesome! (this is an awesome feeling)! Thank You for
such incredible support on the last mile of this and also to Bertrand
(for persistence!), Andres and Alvaro for lots of babysitting.
AFAIK v29 fixed this, the end pointer calculations were wrong. With that
it passed for me with/without THP, different blocks sizes etc.
Yeah, that was a typo, I've started writing about v28, but then in the
middle of that v29 landed and I still was chasing that finding, I've
just forgotten to bump this.
We don't align buffers to os_page_size, we align them PG_IO_ALIGN_SIZE,
which is 4kB or so. And it's determined at compile time, while THP is
determined when starting the cluster.
[..]
Right, this is because that's where the THP boundary happens to be. And
that one "duplicate" entry is for a buffer that happens to span two
pages. This is *exactly* the misalignment of blocks and pages that I was
wondering about earlier, and with the fixed endptr calculation we handle
that just fine.No opinion on the aligment - maybe we should do that, but it's not
something this patch needs to worry about.
Agreed.I was wondering even if there are other drawbacks of the
situation, but other than not reporting duplicates here in this
pg_buffercache view, I cannot identify anything worthwhile.
-J.
On 4/7/25 23:50, Jakub Wartak wrote:
On Mon, Apr 7, 2025 at 11:27 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi,
I've pushed all three parts of v29, with some additional corrections
(picked lower OIDs, bumped catversion, fixed commit messages).Hi Tomas, great, awesome! (this is an awesome feeling)! Thank You for
such incredible support on the last mile of this and also to Bertrand
(for persistence!), Andres and Alvaro for lots of babysitting.
Glad I could help, thanks for the patch.
AFAIK v29 fixed this, the end pointer calculations were wrong. With that
it passed for me with/without THP, different blocks sizes etc.Yeah, that was a typo, I've started writing about v28, but then in the
middle of that v29 landed and I still was chasing that finding, I've
just forgotten to bump this.We don't align buffers to os_page_size, we align them PG_IO_ALIGN_SIZE,
which is 4kB or so. And it's determined at compile time, while THP is
determined when starting the cluster.[..]
Right, this is because that's where the THP boundary happens to be. And
that one "duplicate" entry is for a buffer that happens to span two
pages. This is *exactly* the misalignment of blocks and pages that I was
wondering about earlier, and with the fixed endptr calculation we handle
that just fine.No opinion on the aligment - maybe we should do that, but it's not
something this patch needs to worry about.Agreed.I was wondering even if there are other drawbacks of the
situation, but other than not reporting duplicates here in this
pg_buffercache view, I cannot identify anything worthwhile.
Well, the drawback is that accessing the buffer may require hitting two
different NUMA nodes. I'm not 100% sure it can actually happen, though.
the buffer should be initialized as a whole, so it should got to the
same node. But maybe it could be "split" by THP migration, or something
like that.
In any case, that's not caused by this patch, and it's less serious with
huge pages - it's only affect buffers on the boundaries. But with the
small 4K pages it can happen for *every* buffer.
regards
--
Tomas Vondra
Hi,
Thanks for developing this great feature.
The manual says that the 'size' column of the pg_shmem_allocations_numa view is 'int4', but the implementation is 'int8'.
The attached small patch fixes the manual.
Regards,
Noriyoshi Shinoda
-----Original Message-----
From: Tomas Vondra <tomas@vondra.me>
Sent: Tuesday, April 8, 2025 6:59 AM
To: Jakub Wartak <jakub.wartak@enterprisedb.com>
Cc: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>; Andres Freund <andres@anarazel.de>; Alvaro Herrera <alvherre@alvh.no-ip.org>; Nazir Bilal Yavuz <byavuz81@gmail.com>; PostgreSQL Hackers <pgsql-hackers@postgresql.org>
Subject: Re: Draft for basic NUMA observability
On 4/7/25 23:50, Jakub Wartak wrote:
On Mon, Apr 7, 2025 at 11:27 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi,
I've pushed all three parts of v29, with some additional corrections
(picked lower OIDs, bumped catversion, fixed commit messages).Hi Tomas, great, awesome! (this is an awesome feeling)! Thank You for
such incredible support on the last mile of this and also to Bertrand
(for persistence!), Andres and Alvaro for lots of babysitting.
Glad I could help, thanks for the patch.
AFAIK v29 fixed this, the end pointer calculations were wrong. With
that it passed for me with/without THP, different blocks sizes etc.Yeah, that was a typo, I've started writing about v28, but then in the
middle of that v29 landed and I still was chasing that finding, I've
just forgotten to bump this.We don't align buffers to os_page_size, we align them
PG_IO_ALIGN_SIZE, which is 4kB or so. And it's determined at compile
time, while THP is determined when starting the cluster.[..]
Right, this is because that's where the THP boundary happens to be.
And that one "duplicate" entry is for a buffer that happens to span
two pages. This is *exactly* the misalignment of blocks and pages
that I was wondering about earlier, and with the fixed endptr
calculation we handle that just fine.No opinion on the aligment - maybe we should do that, but it's not
something this patch needs to worry about.Agreed.I was wondering even if there are other drawbacks of the
situation, but other than not reporting duplicates here in this
pg_buffercache view, I cannot identify anything worthwhile.
Well, the drawback is that accessing the buffer may require hitting two different NUMA nodes. I'm not 100% sure it can actually happen, though.
the buffer should be initialized as a whole, so it should got to the same node. But maybe it could be "split" by THP migration, or something like that.
In any case, that's not caused by this patch, and it's less serious with huge pages - it's only affect buffers on the boundaries. But with the small 4K pages it can happen for *every* buffer.
regards
--
Tomas Vondra
Attachments:
pg_shmem_allocations_numa_doc_v1.diffapplication/octet-stream; name=pg_shmem_allocations_numa_doc_v1.diffDownload
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 0eba37268bf..737e7489b78 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -4128,7 +4128,7 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>size</structfield> <type>int4</type>
+ <structfield>size</structfield> <type>int8</type>
</para>
<para>
Size of the allocation on this particular NUMA memory node in bytes
On 4/8/25 01:26, Shinoda, Noriyoshi (SXD Japan FSI) wrote:
Hi,
Thanks for developing this great feature.
The manual says that the 'size' column of the pg_shmem_allocations_numa view is 'int4', but the implementation is 'int8'.
The attached small patch fixes the manual.
Thank you for noticing this and for the fix! Pushed.
This also reminded me we agreed to change page_num to bigint, which I
forgot to change before commit. So I adjusted that too, separately.
regards
--
Tomas Vondra
On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote:
I'll let the CI run the tests on it, and
then will push, unless someone has more comments.
Hi! I noticed strange failure after this commit[0]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dogfish&dt=2025-04-08%2011%3A25%3A11
Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48
In file included from [01m [K../pgsql/src/include/postgres.h:49 [m [K,
from [01m [K../pgsql/src/port/pg_numa.c:16 [m [K:
[01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K
[01;31m [Kfatal error: [m [Kutils/errcodes.h: No such file or
directory
79 | #include [01;31m [K"utils/errcodes.h" [m [K
| [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K
compilation terminated.
[0]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dogfish&dt=2025-04-08%2011%3A25%3A11
--
Best regards,
Kirill Reshke
Hi,
On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote:
I'll let the CI run the tests on it, and
then will push, unless someone has more comments.Hi! I noticed strange failure after this commit[0]
Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48
In file included from [01m [K../pgsql/src/include/postgres.h:49 [m [K,
from [01m [K../pgsql/src/port/pg_numa.c:16 [m [K:
[01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K
[01;31m [Kfatal error: [m [Kutils/errcodes.h: No such file or
directory
79 | #include [01;31m [K"utils/errcodes.h" [m [K
| [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K
compilation terminated.
$ ninja -t missingdeps
Missing dep: src/port/libpgport.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
Missing dep: src/port/libpgport_shlib.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
Processed 2384 nodes.
Error: There are 2 missing dependency paths.
2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules) without a non-depfile dep path to the generator.
There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean output directory.
I think it's not right that something in src/port defines an SQL callable
function. The set of headers for that pull in a lot of things.
Since the file relies on things like GUCs, I suspect this should be in
src/backend/port or such instead.
Greetings,
Andres Freund
On 4/8/25 15:06, Andres Freund wrote:
Hi,
On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote:
I'll let the CI run the tests on it, and
then will push, unless someone has more comments.Hi! I noticed strange failure after this commit[0]
Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48
In file included from [01m [K../pgsql/src/include/postgres.h:49 [m [K,
from [01m [K../pgsql/src/port/pg_numa.c:16 [m [K:
[01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K
[01;31m [Kfatal error: [m [Kutils/errcodes.h: No such file or
directory
79 | #include [01;31m [K"utils/errcodes.h" [m [K
| [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K
compilation terminated.$ ninja -t missingdeps
Missing dep: src/port/libpgport.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
Missing dep: src/port/libpgport_shlib.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
Processed 2384 nodes.
Error: There are 2 missing dependency paths.
2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules) without a non-depfile dep path to the generator.
There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean output directory.I think it's not right that something in src/port defines an SQL callable
function. The set of headers for that pull in a lot of things.Since the file relies on things like GUCs, I suspect this should be in
src/backend/port or such instead.
Yeah, I think you're right, src/backend/port seems like a better place
for this. I'll look into moving that in the evening.
regards
--
Tomas Vondra
Hi,
On April 8, 2025 9:21:57 AM EDT, Tomas Vondra <tomas@vondra.me> wrote:
On 4/8/25 15:06, Andres Freund wrote:
Hi,
On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote:
I'll let the CI run the tests on it, and
then will push, unless someone has more comments.Hi! I noticed strange failure after this commit[0]
Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48
In file included from [01m [K../pgsql/src/include/postgres.h:49 [m [K,
from [01m [K../pgsql/src/port/pg_numa.c:16 [m [K:
[01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K
[01;31m [Kfatal error: [m [Kutils/errcodes.h: No such file or
directory
79 | #include [01;31m [K"utils/errcodes.h" [m [K
| [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K
compilation terminated.$ ninja -t missingdeps
Missing dep: src/port/libpgport.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
Missing dep: src/port/libpgport_shlib.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
Processed 2384 nodes.
Error: There are 2 missing dependency paths.
2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules) without a non-depfile dep path to the generator.
There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean output directory.I think it's not right that something in src/port defines an SQL callable
function. The set of headers for that pull in a lot of things.Since the file relies on things like GUCs, I suspect this should be in
src/backend/port or such instead.Yeah, I think you're right, src/backend/port seems like a better place
for this. I'll look into moving that in the evening.
On a second look I wonder if just the SQL function and perhaps the page size function should be moved. There are FE programs that could potentially benefit from num a awareness (e.g. pgbench).
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hi,
On 2025-04-08 09:35:37 -0400, Andres Freund wrote:
On April 8, 2025 9:21:57 AM EDT, Tomas Vondra <tomas@vondra.me> wrote:
On 4/8/25 15:06, Andres Freund wrote:
On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
I think it's not right that something in src/port defines an SQL callable
function. The set of headers for that pull in a lot of things.Since the file relies on things like GUCs, I suspect this should be in
src/backend/port or such instead.Yeah, I think you're right, src/backend/port seems like a better place
for this. I'll look into moving that in the evening.On a second look I wonder if just the SQL function and perhaps the page size
function should be moved. There are FE programs that could potentially
benefit from num a awareness (e.g. pgbench).
I would move pg_numa_available() to something like
src/backend/storage/ipc/shmem.c.
I wonder if pg_numa_get_pagesize() should loose the _numa_ in the name, it's
not actually directly NUMA related? If it were pg_get_pagesize() it'd fit in
with shmem.c or such.
Or we could just make it work in FE code by making this part
Assert(IsUnderPostmaster);
Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
if (huge_pages_status == HUGE_PAGES_ON)
GetHugePageSize(&os_page_size, NULL);
#ifndef FRONTEND - we don't currently support using huge pages in FE programs
after all. But querying the page size might still be useful.
Regardless of all of that, I don't think the include of fmgr.h in pg_numa.h is
needed?
Greetings,
Andres Freund
On 4/8/25 16:59, Andres Freund wrote:
Hi,
On 2025-04-08 09:35:37 -0400, Andres Freund wrote:
On April 8, 2025 9:21:57 AM EDT, Tomas Vondra <tomas@vondra.me> wrote:
On 4/8/25 15:06, Andres Freund wrote:
On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
I think it's not right that something in src/port defines an SQL callable
function. The set of headers for that pull in a lot of things.Since the file relies on things like GUCs, I suspect this should be in
src/backend/port or such instead.Yeah, I think you're right, src/backend/port seems like a better place
for this. I'll look into moving that in the evening.On a second look I wonder if just the SQL function and perhaps the page size
function should be moved. There are FE programs that could potentially
benefit from num a awareness (e.g. pgbench).I would move pg_numa_available() to something like
src/backend/storage/ipc/shmem.c.
Makes sense, done in the attached patch.
I wonder if pg_numa_get_pagesize() should loose the _numa_ in the name, it's
not actually directly NUMA related? If it were pg_get_pagesize() it'd fit in
with shmem.c or such.
True. It's true it's not really "NUMA page", but page size for the shmem
segment. So renamed to pg_get_shmem_pagesize() and moved to shmem.c,
same as pg_numa_available().
The rest of pg_numa.c is moved to src/backend/port
Or we could just make it work in FE code by making this part
Assert(IsUnderPostmaster);
Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);if (huge_pages_status == HUGE_PAGES_ON)
GetHugePageSize(&os_page_size, NULL);#ifndef FRONTEND - we don't currently support using huge pages in FE programs
after all. But querying the page size might still be useful.
I don't really like this. Why shouldn't the FE program simply call
sysconf(_SC_PAGESIZE)? It'd be just confusing if in backend it'd also
verify huge page status.
Regardless of all of that, I don't think the include of fmgr.h in pg_numa.h is
needed?
Right, that was left there after moving the prototype into the .c file.
regards
--
Tomas Vondra
Attachments:
numa-fixes.patchtext/x-patch; charset=UTF-8; name=numa-fixes.patchDownload
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index c9ceba604b1..e1701bd56ef 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -343,7 +343,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
* This information is needed before calling move_pages() for NUMA
* node id inquiry.
*/
- os_page_size = pg_numa_get_pagesize();
+ os_page_size = pg_get_shmem_pagesize();
/*
* The pages and block size is expected to be 2^k, so one divides the
diff --git a/src/backend/port/Makefile b/src/backend/port/Makefile
index 47338d99229..5dafbf7c0c0 100644
--- a/src/backend/port/Makefile
+++ b/src/backend/port/Makefile
@@ -24,6 +24,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
$(TAS) \
atomics.o \
+ pg_numa.o \
pg_sema.o \
pg_shmem.o
diff --git a/src/backend/port/meson.build b/src/backend/port/meson.build
index 09d54e01d13..a9f7120aef4 100644
--- a/src/backend/port/meson.build
+++ b/src/backend/port/meson.build
@@ -2,6 +2,7 @@
backend_sources += files(
'atomics.c',
+ 'pg_numa.c',
)
diff --git a/src/port/pg_numa.c b/src/backend/port/pg_numa.c
similarity index 71%
rename from src/port/pg_numa.c
rename to src/backend/port/pg_numa.c
index 5e2523cf798..20be13f669d 100644
--- a/src/port/pg_numa.c
+++ b/src/backend/port/pg_numa.c
@@ -20,7 +20,6 @@
#include <windows.h>
#endif
-#include "fmgr.h"
#include "miscadmin.h"
#include "port/pg_numa.h"
#include "storage/pg_shmem.h"
@@ -36,8 +35,6 @@
#include <numa.h>
#include <numaif.h>
-Datum pg_numa_available(PG_FUNCTION_ARGS);
-
/* libnuma requires initialization as per numa(3) on Linux */
int
pg_numa_init(void)
@@ -66,8 +63,6 @@ pg_numa_get_max_node(void)
#else
-Datum pg_numa_available(PG_FUNCTION_ARGS);
-
/* Empty wrappers */
int
pg_numa_init(void)
@@ -89,32 +84,3 @@ pg_numa_get_max_node(void)
}
#endif
-
-Datum
-pg_numa_available(PG_FUNCTION_ARGS)
-{
- PG_RETURN_BOOL(pg_numa_init() != -1);
-}
-
-/* This should be used only after the server is started */
-Size
-pg_numa_get_pagesize(void)
-{
- Size os_page_size;
-#ifdef WIN32
- SYSTEM_INFO sysinfo;
-
- GetSystemInfo(&sysinfo);
- os_page_size = sysinfo.dwPageSize;
-#else
- os_page_size = sysconf(_SC_PAGESIZE);
-#endif
-
- Assert(IsUnderPostmaster);
- Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
-
- if (huge_pages_status == HUGE_PAGES_ON)
- GetHugePageSize(&os_page_size, NULL);
-
- return os_page_size;
-}
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index e10b380e5c7..0903eb50f54 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -93,6 +93,8 @@ static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
/* To get reliable results for NUMA inquiry we need to "touch pages" once */
static bool firstNumaTouch = true;
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
*/
@@ -615,7 +617,7 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
* This information is needed before calling move_pages() for NUMA memory
* node inquiry.
*/
- os_page_size = pg_numa_get_pagesize();
+ os_page_size = pg_get_shmem_pagesize();
/*
* Allocate memory for page pointers and status based on total shared
@@ -727,3 +729,32 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* This should be used only after the server is started */
+Size
+pg_get_shmem_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+ Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 7e990d9f776..40f1d324dcf 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -14,12 +14,9 @@
#ifndef PG_NUMA_H
#define PG_NUMA_H
-#include "fmgr.h"
-
extern PGDLLIMPORT int pg_numa_init(void);
extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
extern PGDLLIMPORT int pg_numa_get_max_node(void);
-extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
#ifdef USE_LIBNUMA
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index 904a336b851..c1f668ded95 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -41,6 +41,8 @@ extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
extern Size add_size(Size s1, Size s2);
extern Size mul_size(Size s1, Size s2);
+extern PGDLLIMPORT Size pg_get_shmem_pagesize(void);
+
/* ipci.c */
extern void RequestAddinShmemSpace(Size size);
diff --git a/src/port/Makefile b/src/port/Makefile
index 4274949dfa4..f11896440d5 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,7 +45,6 @@ OBJS = \
path.o \
pg_bitutils.o \
pg_localeconv_r.o \
- pg_numa.o \
pg_popcount_aarch64.o \
pg_popcount_avx512.o \
pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index fc7b059fee5..48d2dfb7cf3 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,7 +8,6 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_localeconv_r.c',
- 'pg_numa.c',
'pg_popcount_aarch64.c',
'pg_popcount_avx512.c',
'pg_strong_random.c',
On 4/8/25 15:06, Andres Freund wrote:
Hi,
On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote:
I'll let the CI run the tests on it, and
then will push, unless someone has more comments.Hi! I noticed strange failure after this commit[0]
Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48
In file included from [01m [K../pgsql/src/include/postgres.h:49 [m [K,
from [01m [K../pgsql/src/port/pg_numa.c:16 [m [K:
[01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K
[01;31m [Kfatal error: [m [Kutils/errcodes.h: No such file or
directory
79 | #include [01;31m [K"utils/errcodes.h" [m [K
| [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K
compilation terminated.$ ninja -t missingdeps
Missing dep: src/port/libpgport.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
Missing dep: src/port/libpgport_shlib.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
Processed 2384 nodes.
Error: There are 2 missing dependency paths.
2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules) without a non-depfile dep path to the generator.
There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean output directory.
Wouldn't it be good to add this (ninja -t missingdeps) to the CI task? I
ran those tests many times, and had it failed at least once I'd have
fixed it before commit.
regards
--
Tomas Vondra
Hi,
On 2025-04-09 00:47:59 +0200, Tomas Vondra wrote:
On 4/8/25 16:59, Andres Freund wrote:
Hi,
On 2025-04-08 09:35:37 -0400, Andres Freund wrote:
On April 8, 2025 9:21:57 AM EDT, Tomas Vondra <tomas@vondra.me> wrote:
On 4/8/25 15:06, Andres Freund wrote:
On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
I think it's not right that something in src/port defines an SQL callable
function. The set of headers for that pull in a lot of things.Since the file relies on things like GUCs, I suspect this should be in
src/backend/port or such instead.Yeah, I think you're right, src/backend/port seems like a better place
for this. I'll look into moving that in the evening.On a second look I wonder if just the SQL function and perhaps the page size
function should be moved. There are FE programs that could potentially
benefit from num a awareness (e.g. pgbench).I would move pg_numa_available() to something like
src/backend/storage/ipc/shmem.c.Makes sense, done in the attached patch.
I wonder if pg_numa_get_pagesize() should loose the _numa_ in the name, it's
not actually directly NUMA related? If it were pg_get_pagesize() it'd fit in
with shmem.c or such.True. It's true it's not really "NUMA page", but page size for the shmem
segment. So renamed to pg_get_shmem_pagesize() and moved to shmem.c,
same as pg_numa_available().
Cool.
The rest of pg_numa.c is moved to src/backend/port
Why move the remainder? It shouldn't have any dependency on postgres.h
afterwards? And I think it's not backend-specific.
Greetings,
Andres Freund
Hi,
On 2025-04-09 01:10:09 +0200, Tomas Vondra wrote:
On 4/8/25 15:06, Andres Freund wrote:
Hi,
On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote:
I'll let the CI run the tests on it, and
then will push, unless someone has more comments.Hi! I noticed strange failure after this commit[0]
Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48
In file included from [01m [K../pgsql/src/include/postgres.h:49 [m [K,
from [01m [K../pgsql/src/port/pg_numa.c:16 [m [K:
[01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K
[01;31m [Kfatal error: [m [Kutils/errcodes.h: No such file or
directory
79 | #include [01;31m [K"utils/errcodes.h" [m [K
| [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K
compilation terminated.$ ninja -t missingdeps
Missing dep: src/port/libpgport.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
Missing dep: src/port/libpgport_shlib.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
Processed 2384 nodes.
Error: There are 2 missing dependency paths.
2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules) without a non-depfile dep path to the generator.
There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean output directory.Wouldn't it be good to add this (ninja -t missingdeps) to the CI task? I
ran those tests many times, and had it failed at least once I'd have
fixed it before commit.
Yes, we should. It's a somewhat newer feature, so we originally couldn't.
There only was a very clunky and slow python script when I was doing most of
the meson work.
I was actually thinking that it might make sense as a meson-registered test,
that way one quickly can find the issue both locally and on CI.
Greetings,
Andres Freund
On Wed, Apr 9, 2025 at 12:48 AM Tomas Vondra <tomas@vondra.me> wrote:
On 4/8/25 16:59, Andres Freund wrote:
Hi,
On 2025-04-08 09:35:37 -0400, Andres Freund wrote:
On April 8, 2025 9:21:57 AM EDT, Tomas Vondra <tomas@vondra.me> wrote:
On 4/8/25 15:06, Andres Freund wrote:
On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
I think it's not right that something in src/port defines an SQL callable
function. The set of headers for that pull in a lot of things.Since the file relies on things like GUCs, I suspect this should be in
src/backend/port or such instead.Yeah, I think you're right, src/backend/port seems like a better place
for this. I'll look into moving that in the evening.On a second look I wonder if just the SQL function and perhaps the page size
function should be moved. There are FE programs that could potentially
benefit from num a awareness (e.g. pgbench).I would move pg_numa_available() to something like
src/backend/storage/ipc/shmem.c.Makes sense, done in the attached patch.
I wonder if pg_numa_get_pagesize() should loose the _numa_ in the name, it's
not actually directly NUMA related? If it were pg_get_pagesize() it'd fit in
with shmem.c or such.True. It's true it's not really "NUMA page", but page size for the shmem
segment. So renamed to pg_get_shmem_pagesize() and moved to shmem.c,
same as pg_numa_available().The rest of pg_numa.c is moved to src/backend/port
Hi Tomas, sorry for this. I've tested the numa-fixes.patch, CI says it
is ok to me (but it was always like that), meson test also looks good
here, and some other quick use attempts are OK. I can confirm that
`ninja -t missingdeps` is clean with this (without it really wasn't;
+1 to adding this check to CI).
Or we could just make it work in FE code by making this part
Assert(IsUnderPostmaster);
Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);if (huge_pages_status == HUGE_PAGES_ON)
GetHugePageSize(&os_page_size, NULL);#ifndef FRONTEND - we don't currently support using huge pages in FE programs
after all. But querying the page size might still be useful.I don't really like this. Why shouldn't the FE program simply call
sysconf(_SC_PAGESIZE)? It'd be just confusing if in backend it'd also
verify huge page status.
True, the pg_shm_get_page_size() looks like a great middleground (not
in numa but still for shm w/ hugepages).
Regardless of all of that, I don't think the include of fmgr.h in pg_numa.h is
needed?Right, that was left there after moving the prototype into the .c file.
-J.
On 4/9/25 01:29, Andres Freund wrote:
Hi,
On 2025-04-09 01:10:09 +0200, Tomas Vondra wrote:
On 4/8/25 15:06, Andres Freund wrote:
Hi,
On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote:
I'll let the CI run the tests on it, and
then will push, unless someone has more comments.Hi! I noticed strange failure after this commit[0]
Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48
In file included from [01m [K../pgsql/src/include/postgres.h:49 [m [K,
from [01m [K../pgsql/src/port/pg_numa.c:16 [m [K:
[01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K
[01;31m [Kfatal error: [m [Kutils/errcodes.h: No such file or
directory
79 | #include [01;31m [K"utils/errcodes.h" [m [K
| [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K
compilation terminated.$ ninja -t missingdeps
Missing dep: src/port/libpgport.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
Missing dep: src/port/libpgport_shlib.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
Processed 2384 nodes.
Error: There are 2 missing dependency paths.
2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules) without a non-depfile dep path to the generator.
There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean output directory.Wouldn't it be good to add this (ninja -t missingdeps) to the CI task? I
ran those tests many times, and had it failed at least once I'd have
fixed it before commit.Yes, we should. It's a somewhat newer feature, so we originally couldn't.
There only was a very clunky and slow python script when I was doing most of
the meson work.I was actually thinking that it might make sense as a meson-registered test,
that way one quickly can find the issue both locally and on CI.
OK, here are two patches, where 0001 adds the missingdeps check to the
Debian meson build. It just adds that to the build script.
0002 leaves the NUMA stuff in src/port (i.e. it's no longer moved to
src/backend/port). It still needs to include c.h because of PGDLLIMPORT,
but I think that's fine.
regards
--
Tomas Vondra
On 4/9/25 14:07, Tomas Vondra wrote:
...
OK, here are two patches, where 0001 adds the missingdeps check to the
Debian meson build. It just adds that to the build script.0002 leaves the NUMA stuff in src/port (i.e. it's no longer moved to
src/backend/port). It still needs to include c.h because of PGDLLIMPORT,
but I think that's fine.
Forgot to attach the patches ...
--
Tomas Vondra
Attachments:
0001-adjust-ci.patchtext/x-patch; charset=UTF-8; name=0001-adjust-ci.patchDownload
From a5d2080e8eb966698ced88364fc7beebea8a226a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 9 Apr 2025 13:29:31 +0200
Subject: [PATCH 1/2] adjust ci
---
.cirrus.tasks.yml | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 98f3455eb72..94ded37e29a 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -528,8 +528,17 @@ task:
build-32
EOF
- build_script: su postgres -c 'ninja -C build -j${BUILD_JOBS} ${MBUILD_TARGET}'
- build_32_script: su postgres -c 'ninja -C build-32 -j${BUILD_JOBS} ${MBUILD_TARGET}'
+ build_script: |
+ su postgres <<-EOF
+ ninja -C build -j${BUILD_JOBS} ${MBUILD_TARGET}
+ ninja -C build -t missingdeps
+ EOF
+
+ build_32_script: |
+ su postgres <<-EOF
+ ninja -C build-32 -j${BUILD_JOBS} ${MBUILD_TARGET}
+ ninja -C build -t missingdeps
+ EOF
upload_caches: ccache
--
2.49.0
0002-fixup.patchtext/x-patch; charset=UTF-8; name=0002-fixup.patchDownload
From 60b71e953d80150e6b596937a1f8fcd1af510798 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 8 Apr 2025 23:31:29 +0200
Subject: [PATCH 2/2] fixup
---
contrib/pg_buffercache/pg_buffercache_pages.c | 2 +-
src/backend/storage/ipc/shmem.c | 33 +++++++++++++++-
src/include/port/pg_numa.h | 3 --
src/include/storage/shmem.h | 2 +
src/port/pg_numa.c | 38 +------------------
5 files changed, 36 insertions(+), 42 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index c9ceba604b1..e1701bd56ef 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -343,7 +343,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
* This information is needed before calling move_pages() for NUMA
* node id inquiry.
*/
- os_page_size = pg_numa_get_pagesize();
+ os_page_size = pg_get_shmem_pagesize();
/*
* The pages and block size is expected to be 2^k, so one divides the
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index e10b380e5c7..0903eb50f54 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -93,6 +93,8 @@ static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
/* To get reliable results for NUMA inquiry we need to "touch pages" once */
static bool firstNumaTouch = true;
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
*/
@@ -615,7 +617,7 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
* This information is needed before calling move_pages() for NUMA memory
* node inquiry.
*/
- os_page_size = pg_numa_get_pagesize();
+ os_page_size = pg_get_shmem_pagesize();
/*
* Allocate memory for page pointers and status based on total shared
@@ -727,3 +729,32 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/* This should be used only after the server is started */
+Size
+pg_get_shmem_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+ Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 7e990d9f776..40f1d324dcf 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -14,12 +14,9 @@
#ifndef PG_NUMA_H
#define PG_NUMA_H
-#include "fmgr.h"
-
extern PGDLLIMPORT int pg_numa_init(void);
extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
extern PGDLLIMPORT int pg_numa_get_max_node(void);
-extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
#ifdef USE_LIBNUMA
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index 904a336b851..c1f668ded95 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -41,6 +41,8 @@ extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
extern Size add_size(Size s1, Size s2);
extern Size mul_size(Size s1, Size s2);
+extern PGDLLIMPORT Size pg_get_shmem_pagesize(void);
+
/* ipci.c */
extern void RequestAddinShmemSpace(Size size);
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 5e2523cf798..63dff799436 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -13,17 +13,14 @@
*-------------------------------------------------------------------------
*/
-#include "postgres.h"
+#include "c.h"
#include <unistd.h>
#ifdef WIN32
#include <windows.h>
#endif
-#include "fmgr.h"
-#include "miscadmin.h"
#include "port/pg_numa.h"
-#include "storage/pg_shmem.h"
/*
* At this point we provide support only for Linux thanks to libnuma, but in
@@ -36,8 +33,6 @@
#include <numa.h>
#include <numaif.h>
-Datum pg_numa_available(PG_FUNCTION_ARGS);
-
/* libnuma requires initialization as per numa(3) on Linux */
int
pg_numa_init(void)
@@ -66,8 +61,6 @@ pg_numa_get_max_node(void)
#else
-Datum pg_numa_available(PG_FUNCTION_ARGS);
-
/* Empty wrappers */
int
pg_numa_init(void)
@@ -89,32 +82,3 @@ pg_numa_get_max_node(void)
}
#endif
-
-Datum
-pg_numa_available(PG_FUNCTION_ARGS)
-{
- PG_RETURN_BOOL(pg_numa_init() != -1);
-}
-
-/* This should be used only after the server is started */
-Size
-pg_numa_get_pagesize(void)
-{
- Size os_page_size;
-#ifdef WIN32
- SYSTEM_INFO sysinfo;
-
- GetSystemInfo(&sysinfo);
- os_page_size = sysinfo.dwPageSize;
-#else
- os_page_size = sysconf(_SC_PAGESIZE);
-#endif
-
- Assert(IsUnderPostmaster);
- Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
-
- if (huge_pages_status == HUGE_PAGES_ON)
- GetHugePageSize(&os_page_size, NULL);
-
- return os_page_size;
-}
--
2.49.0
Updated patches with proper commit messages etc.
--
Tomas Vondra
Attachments:
0001-Cleanup-of-pg_numa.c.patchtext/x-patch; charset=UTF-8; name=0001-Cleanup-of-pg_numa.c.patchDownload
From e1f093d091610d70fba72b2848f25ff44899ea8e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 8 Apr 2025 23:31:29 +0200
Subject: [PATCH 1/2] Cleanup of pg_numa.c
This moves/renames some of the functions defined in pg_numa.c:
* pg_numa_get_pagesize() is renamed to pg_get_shmem_pagesize(), and
moved to src/backend/storage/ipc/shmem.c. The new name better reflects
that the page size is not related to NUMA, and it's specifically about
the page size used for the main shared memory segment.
* move pg_numa_available() to src/backend/storage/ipc/shmem.c, i.e. into
the backend (which more appropriate for functions callable from SQL).
While at it, improve the comment to explain what page size it returns.
* remove unnecessary includes from src/port/pg_numa.c, adding
unnecessary dependencies (src/port should be suitable for frontent).
These were leftovers from earlier patch versions.
This eliminates unnecessary dependencies on backend symbols, which we
don't want in src/port.
Reported-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
https://postgr.es/m/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.com
---
contrib/pg_buffercache/pg_buffercache_pages.c | 2 +-
src/backend/storage/ipc/shmem.c | 40 ++++++++++++++++++-
src/include/port/pg_numa.h | 3 --
src/include/storage/shmem.h | 2 +
src/port/pg_numa.c | 38 +-----------------
5 files changed, 43 insertions(+), 42 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index c9ceba604b1..e1701bd56ef 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -343,7 +343,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
* This information is needed before calling move_pages() for NUMA
* node id inquiry.
*/
- os_page_size = pg_numa_get_pagesize();
+ os_page_size = pg_get_shmem_pagesize();
/*
* The pages and block size is expected to be 2^k, so one divides the
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index e10b380e5c7..c9ae3b45b76 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -93,6 +93,8 @@ static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
/* To get reliable results for NUMA inquiry we need to "touch pages" once */
static bool firstNumaTouch = true;
+Datum pg_numa_available(PG_FUNCTION_ARGS);
+
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
*/
@@ -615,7 +617,7 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
* This information is needed before calling move_pages() for NUMA memory
* node inquiry.
*/
- os_page_size = pg_numa_get_pagesize();
+ os_page_size = pg_get_shmem_pagesize();
/*
* Allocate memory for page pointers and status based on total shared
@@ -727,3 +729,39 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+
+/*
+ * Determine the memory page size used for the shared memory segment.
+ *
+ * If the shared segment was allocated using huge pages, returns the size of
+ * a huge page. Otherwise returns the size of regular memory page.
+ *
+ * This should be used only after the server is started.
+ */
+Size
+pg_get_shmem_pagesize(void)
+{
+ Size os_page_size;
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ Assert(IsUnderPostmaster);
+ Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+ if (huge_pages_status == HUGE_PAGES_ON)
+ GetHugePageSize(&os_page_size, NULL);
+
+ return os_page_size;
+}
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_BOOL(pg_numa_init() != -1);
+}
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 7e990d9f776..40f1d324dcf 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -14,12 +14,9 @@
#ifndef PG_NUMA_H
#define PG_NUMA_H
-#include "fmgr.h"
-
extern PGDLLIMPORT int pg_numa_init(void);
extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
extern PGDLLIMPORT int pg_numa_get_max_node(void);
-extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
#ifdef USE_LIBNUMA
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index 904a336b851..c1f668ded95 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -41,6 +41,8 @@ extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
extern Size add_size(Size s1, Size s2);
extern Size mul_size(Size s1, Size s2);
+extern PGDLLIMPORT Size pg_get_shmem_pagesize(void);
+
/* ipci.c */
extern void RequestAddinShmemSpace(Size size);
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 5e2523cf798..63dff799436 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -13,17 +13,14 @@
*-------------------------------------------------------------------------
*/
-#include "postgres.h"
+#include "c.h"
#include <unistd.h>
#ifdef WIN32
#include <windows.h>
#endif
-#include "fmgr.h"
-#include "miscadmin.h"
#include "port/pg_numa.h"
-#include "storage/pg_shmem.h"
/*
* At this point we provide support only for Linux thanks to libnuma, but in
@@ -36,8 +33,6 @@
#include <numa.h>
#include <numaif.h>
-Datum pg_numa_available(PG_FUNCTION_ARGS);
-
/* libnuma requires initialization as per numa(3) on Linux */
int
pg_numa_init(void)
@@ -66,8 +61,6 @@ pg_numa_get_max_node(void)
#else
-Datum pg_numa_available(PG_FUNCTION_ARGS);
-
/* Empty wrappers */
int
pg_numa_init(void)
@@ -89,32 +82,3 @@ pg_numa_get_max_node(void)
}
#endif
-
-Datum
-pg_numa_available(PG_FUNCTION_ARGS)
-{
- PG_RETURN_BOOL(pg_numa_init() != -1);
-}
-
-/* This should be used only after the server is started */
-Size
-pg_numa_get_pagesize(void)
-{
- Size os_page_size;
-#ifdef WIN32
- SYSTEM_INFO sysinfo;
-
- GetSystemInfo(&sysinfo);
- os_page_size = sysinfo.dwPageSize;
-#else
- os_page_size = sysconf(_SC_PAGESIZE);
-#endif
-
- Assert(IsUnderPostmaster);
- Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
-
- if (huge_pages_status == HUGE_PAGES_ON)
- GetHugePageSize(&os_page_size, NULL);
-
- return os_page_size;
-}
--
2.49.0
0002-ci-Check-for-missing-dependencies-in-meson-build.patchtext/x-patch; charset=UTF-8; name=0002-ci-Check-for-missing-dependencies-in-meson-build.patchDownload
From 201f8be652e9344dfa247b035a66e52025afa149 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 9 Apr 2025 13:29:31 +0200
Subject: [PATCH 2/2] ci: Check for missing dependencies in meson build
Extends the meson build on Debian to also check for missing dependencies
by executing
ninja -t missingdeps
right after the build. This highlights unindended dependencies.
Reviewed-by: Andres Freund <andres@anarazel.de>
https://postgr.es/m/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.com
---
.cirrus.tasks.yml | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 98f3455eb72..94ded37e29a 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -528,8 +528,17 @@ task:
build-32
EOF
- build_script: su postgres -c 'ninja -C build -j${BUILD_JOBS} ${MBUILD_TARGET}'
- build_32_script: su postgres -c 'ninja -C build-32 -j${BUILD_JOBS} ${MBUILD_TARGET}'
+ build_script: |
+ su postgres <<-EOF
+ ninja -C build -j${BUILD_JOBS} ${MBUILD_TARGET}
+ ninja -C build -t missingdeps
+ EOF
+
+ build_32_script: |
+ su postgres <<-EOF
+ ninja -C build-32 -j${BUILD_JOBS} ${MBUILD_TARGET}
+ ninja -C build -t missingdeps
+ EOF
upload_caches: ccache
--
2.49.0
Hi,
On 2025-04-09 16:33:14 +0200, Tomas Vondra wrote:
From e1f093d091610d70fba72b2848f25ff44899ea8e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 8 Apr 2025 23:31:29 +0200
Subject: [PATCH 1/2] Cleanup of pg_numa.cThis moves/renames some of the functions defined in pg_numa.c:
* pg_numa_get_pagesize() is renamed to pg_get_shmem_pagesize(), and
moved to src/backend/storage/ipc/shmem.c. The new name better reflects
that the page size is not related to NUMA, and it's specifically about
the page size used for the main shared memory segment.* move pg_numa_available() to src/backend/storage/ipc/shmem.c, i.e. into
the backend (which more appropriate for functions callable from SQL).
While at it, improve the comment to explain what page size it returns.* remove unnecessary includes from src/port/pg_numa.c, adding
unnecessary dependencies (src/port should be suitable for frontent).
These were leftovers from earlier patch versions.
I don't think the include in src/port/pg_numa.c were leftover? Just the one in
pg_numa.h, right?
I'd mention that the includes of postgres.h/fmgr.h is what caused missing
build-time dependencies and via that failures on buildfarm member dogfish.
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c index 5e2523cf798..63dff799436 100644 --- a/src/port/pg_numa.c +++ b/src/port/pg_numa.c @@ -13,17 +13,14 @@ *------------------------------------------------------------------------- */-#include "postgres.h" +#include "c.h" #include <unistd.h>#ifdef WIN32
#include <windows.h>
#endif
I think this may not be needed anymore, that was just there for
GetSystemInfo(), right? Conversely, I suspect it may now be needed in the new
location of pg_numa_get_pagesize()?
From 201f8be652e9344dfa247b035a66e52025afa149 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 9 Apr 2025 13:29:31 +0200
Subject: [PATCH 2/2] ci: Check for missing dependencies in meson buildExtends the meson build on Debian to also check for missing dependencies
by executingninja -t missingdeps
right after the build. This highlights unindended dependencies.
Reviewed-by: Andres Freund <andres@anarazel.de>
/messages/by-id/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.com
FWIW, while I'd prefer it as a meson.build visible test(), I think it's ok to
have it just in CI until we have that. I would however also add it to the
windows job, as that's the most "different" type of build / source of missed
dependencies that wouldn't show up on our development systems.
Greetings,
Andres Freund
On 4/9/25 17:14, Andres Freund wrote:
Hi,
On 2025-04-09 16:33:14 +0200, Tomas Vondra wrote:
From e1f093d091610d70fba72b2848f25ff44899ea8e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 8 Apr 2025 23:31:29 +0200
Subject: [PATCH 1/2] Cleanup of pg_numa.cThis moves/renames some of the functions defined in pg_numa.c:
* pg_numa_get_pagesize() is renamed to pg_get_shmem_pagesize(), and
moved to src/backend/storage/ipc/shmem.c. The new name better reflects
that the page size is not related to NUMA, and it's specifically about
the page size used for the main shared memory segment.* move pg_numa_available() to src/backend/storage/ipc/shmem.c, i.e. into
the backend (which more appropriate for functions callable from SQL).
While at it, improve the comment to explain what page size it returns.* remove unnecessary includes from src/port/pg_numa.c, adding
unnecessary dependencies (src/port should be suitable for frontent).
These were leftovers from earlier patch versions.I don't think the include in src/port/pg_numa.c were leftover? Just the one in
pg_numa.h, right?
Right, that wasn't quite accurate. The miscadmin.h and pg_shmem.h are
unnecessary thanks to moving stuff to shmem.c.
I'd mention that the includes of postgres.h/fmgr.h is what caused missing
build-time dependencies and via that failures on buildfarm member dogfish.
Not really, I also need to include "c.h" instead of "postgres.h" (which
is also causing the same failure).
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c index 5e2523cf798..63dff799436 100644 --- a/src/port/pg_numa.c +++ b/src/port/pg_numa.c @@ -13,17 +13,14 @@ *------------------------------------------------------------------------- */-#include "postgres.h" +#include "c.h" #include <unistd.h>#ifdef WIN32
#include <windows.h>
#endifI think this may not be needed anymore, that was just there for
GetSystemInfo(), right? Conversely, I suspect it may now be needed in the new
location of pg_numa_get_pagesize()?
Good question. But if it's needed there, shouldn't it have failed on CI?
From 201f8be652e9344dfa247b035a66e52025afa149 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 9 Apr 2025 13:29:31 +0200
Subject: [PATCH 2/2] ci: Check for missing dependencies in meson buildExtends the meson build on Debian to also check for missing dependencies
by executingninja -t missingdeps
right after the build. This highlights unindended dependencies.
Reviewed-by: Andres Freund <andres@anarazel.de>
/messages/by-id/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.comFWIW, while I'd prefer it as a meson.build visible test(), I think it's ok to
have it just in CI until we have that. I would however also add it to the
windows job, as that's the most "different" type of build / source of missed
dependencies that wouldn't show up on our development systems.
We can add it as a meson.build test, sure. I was going for the CI first,
because then it fires no matter what build I do locally (I'm kinda still
used to autotools).
If you agree adding it to build_script is the right way to do that, I'll
do the same thing for the windows job.
regards
--
Tomas Vondra
Hi,
On 2025-04-09 17:28:31 +0200, Tomas Vondra wrote:
On 4/9/25 17:14, Andres Freund wrote:
I'd mention that the includes of postgres.h/fmgr.h is what caused missing
build-time dependencies and via that failures on buildfarm member dogfish.Not really, I also need to include "c.h" instead of "postgres.h" (which
is also causing the same failure).
I did mention postgres.h :)
I think this may not be needed anymore, that was just there for
GetSystemInfo(), right? Conversely, I suspect it may now be needed in the new
location of pg_numa_get_pagesize()?Good question. But if it's needed there, shouldn't it have failed on CI?
Oh. No. It shouldn't have - because that include was completely
unnecessary. We always include windows.h on windows.
From 201f8be652e9344dfa247b035a66e52025afa149 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 9 Apr 2025 13:29:31 +0200
Subject: [PATCH 2/2] ci: Check for missing dependencies in meson buildExtends the meson build on Debian to also check for missing dependencies
by executingninja -t missingdeps
right after the build. This highlights unindended dependencies.
Reviewed-by: Andres Freund <andres@anarazel.de>
/messages/by-id/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.comFWIW, while I'd prefer it as a meson.build visible test(), I think it's ok to
have it just in CI until we have that. I would however also add it to the
windows job, as that's the most "different" type of build / source of missed
dependencies that wouldn't show up on our development systems.We can add it as a meson.build test, sure. I was going for the CI first,
because then it fires no matter what build I do locally (I'm kinda still
used to autotools).
A meson test would do the same thing, it'd fail while running the tests, no?
If you agree adding it to build_script is the right way to do that, I'll
do the same thing for the windows job.
WFM.
Greetings,
Andres Freund
On 4/9/25 17:51, Andres Freund wrote:
Hi,
On 2025-04-09 17:28:31 +0200, Tomas Vondra wrote:
On 4/9/25 17:14, Andres Freund wrote:
I'd mention that the includes of postgres.h/fmgr.h is what caused missing
build-time dependencies and via that failures on buildfarm member dogfish.Not really, I also need to include "c.h" instead of "postgres.h" (which
is also causing the same failure).I did mention postgres.h :)
D'oh, I missed that. I was focused on the fmgr one.
I think this may not be needed anymore, that was just there for
GetSystemInfo(), right? Conversely, I suspect it may now be needed in the new
location of pg_numa_get_pagesize()?Good question. But if it's needed there, shouldn't it have failed on CI?
Oh. No. It shouldn't have - because that include was completely
unnecessary. We always include windows.h on windows.
Makes sense. I'll get rid of the windows.h include.
From 201f8be652e9344dfa247b035a66e52025afa149 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 9 Apr 2025 13:29:31 +0200
Subject: [PATCH 2/2] ci: Check for missing dependencies in meson buildExtends the meson build on Debian to also check for missing dependencies
by executingninja -t missingdeps
right after the build. This highlights unindended dependencies.
Reviewed-by: Andres Freund <andres@anarazel.de>
/messages/by-id/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.comFWIW, while I'd prefer it as a meson.build visible test(), I think it's ok to
have it just in CI until we have that. I would however also add it to the
windows job, as that's the most "different" type of build / source of missed
dependencies that wouldn't show up on our development systems.We can add it as a meson.build test, sure. I was going for the CI first,
because then it fires no matter what build I do locally (I'm kinda still
used to autotools).A meson test would do the same thing, it'd fail while running the tests, no?
Sure, but only if you use meson. Which I still mostly don't, so I've
been thinking about the CI first, because I use that very consistently
before pushing something.
If you agree adding it to build_script is the right way to do that, I'll
do the same thing for the windows job.WFM.
Thanks. I'll polish this a bit more and push.
regards
--
Tomas Vondra
Hi,
On Tue, Apr 08, 2025 at 12:46:16PM +0200, Tomas Vondra wrote:
On 4/8/25 01:26, Shinoda, Noriyoshi (SXD Japan FSI) wrote:
Hi,
Thanks for developing this great feature.
The manual says that the 'size' column of the pg_shmem_allocations_numa view is 'int4', but the implementation is 'int8'.
The attached small patch fixes the manual.Thank you for noticing this and for the fix! Pushed.
This also reminded me we agreed to change page_num to bigint, which I
forgot to change before commit. So I adjusted that too, separately.
I was doing some extra testing and just realized (did not think of it during the
review) that maybe we could add a pg_buffercache_numa usage example (like it's
already done for pg_buffercache).
That might sound obvious but OTOH I think that would not hurt.
Something like in the attached?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-Add-pg_buffercache_numa-usage-example.patchtext/x-diff; charset=us-asciiDownload
From c22533a6c8a8526b75a95ccf0eb39b5d10cca1f2 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Wed, 9 Apr 2025 18:11:28 +0000
Subject: [PATCH v1] Add pg_buffercache_numa usage example
Add a query showing pg_buffercache_numa usage.
---
doc/src/sgml/pgbuffercache.sgml | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 537d6014942..86846830e48 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -550,6 +550,30 @@ regression=# SELECT n.nspname, c.relname, count(*) AS buffers
public | spgist_text_tbl | 182
(10 rows)
+regression=# SELECT n.nspname, c.relname, m.numa_node, count(*) AS buffers
+ FROM pg_buffercache b JOIN pg_class c
+ ON b.relfilenode = pg_relation_filenode(c.oid) AND
+ b.reldatabase IN (0, (SELECT oid FROM pg_database
+ WHERE datname = current_database()))
+ JOIN pg_namespace n ON n.oid = c.relnamespace
+ JOIN pg_buffercache_numa m ON b.bufferid = m.bufferid
+ GROUP BY n.nspname, c.relname, m.numa_node
+ ORDER BY 4 DESC
+ LIMIT 10;
+
+ nspname | relname | numa_node | buffers
+------------+------------------------+-----------+---------
+ public | delete_test_table_pkey | 1 | 395
+ public | delete_test_table | 0 | 367
+ pg_catalog | pg_attribute | 0 | 306
+ pg_catalog | pg_attribute | 1 | 273
+ public | delete_test_table | 1 | 228
+ pg_catalog | pg_largeobject | 0 | 193
+ public | tenk1 | 1 | 188
+ public | gin_test_idx | 1 | 187
+ public | quad_poly_tbl | 1 | 182
+ public | tenk2 | 0 | 176
+(10 rows)
regression=# SELECT * FROM pg_buffercache_summary();
buffers_used | buffers_unused | buffers_dirty | buffers_pinned | usagecount_avg
--
2.34.1
Hi!
On 4/7/25 11:27 PM, Tomas Vondra wrote:
I've pushed all three parts of v29, with some additional corrections
(picked lower OIDs, bumped catversion, fixed commit messages).
While building the PG18 beta1/2 packages I noticed that in our build
containers the selftest for pg_buffercache_numa and numa failed. It
seems that libnuma was available and pg_numa_init/numa_available returns
no errors, we still fail in pg_numa_query_pages/move_pages with EPERM
yielding the following error when accessing
pg_buffercache_numa/pg_shmem_allocations_numa:
ERROR: failed NUMA pages inquiry: Operation not permitted
The man-page of move_pages lead me to believe that this is because of
the missing capability CAP_SYS_NICE on the process but I couldn't prove
that theory with the attached patch.
The patch did make the tests pass but also disabled NUMA permanently on
a vanilla Debian VM and that is certainly not wanted. It may well be
that my understanding of checking capabilities and how they work is
incomplete. I also think that adding a new dependency for the reason of
just checking the capability is probably a bit of an overkill, maybe we
can check if we can access move_pages once without an error before
treating it as one?
I'd be happy to debug this further but I have limited access to our
build-infra, I should be able to sneak in commands during the build though.
Thanks,
Patrick
Attachments:
0001-Check-for-CAP_SYS_NICE-in-pg_numa_init-v1.patchtext/x-patch; charset=UTF-8; name=0001-Check-for-CAP_SYS_NICE-in-pg_numa_init-v1.patchDownload
From 5ddfa2184b85d76e95cbe2bc00991cad6b154fa0 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Patrick=20St=C3=A4hlin?= <me@packi.ch>
Date: Thu, 17 Jul 2025 19:04:25 +0200
Subject: [PATCH] Check for CAP_SYS_NICE in pg_numa_init
Make sure we have CAP_SYS_NICE set for our process to see if we're
actually able to use any of the libnuma functions. If this is not set
the selftest will fail in some build environments where we compile with
libnuma but are then not able to use it.
Failing test is pg_buffercache_numa with the following message:
ERROR: failed NUMA pages inquiry: Operation not permitted
---
configure | 187 +++++++++++++++++++++++++++++++++
configure.ac | 16 +++
doc/src/sgml/installation.sgml | 20 ++++
meson.build | 13 +++
meson_options.txt | 3 +
src/include/pg_config.h.in | 6 ++
src/port/pg_numa.c | 35 +++++-
7 files changed, 279 insertions(+), 1 deletion(-)
diff --git a/configure b/configure
index 1b9980226c5..0126a4991a4 100755
--- a/configure
+++ b/configure
@@ -717,6 +717,9 @@ LIBCURL_CPPFLAGS
LIBCURL_LIBS
LIBCURL_CFLAGS
with_libcurl
+LIBCAP_LIBS
+LIBCAP_CFLAGS
+with_libcap
with_uuid
LIBURING_LIBS
LIBURING_CFLAGS
@@ -877,6 +880,7 @@ with_libedit_preferred
with_liburing
with_uuid
with_ossp_uuid
+with_libcap
with_libcurl
with_libnuma
with_libxml
@@ -911,6 +915,8 @@ ICU_CFLAGS
ICU_LIBS
LIBURING_CFLAGS
LIBURING_LIBS
+LIBCAP_CFLAGS
+LIBCAP_LIBS
LIBCURL_CFLAGS
LIBCURL_LIBS
LIBNUMA_CFLAGS
@@ -1596,6 +1602,7 @@ Optional Packages:
--with-liburing build with io_uring support, for asynchronous I/O
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
+ --with-libcap build with libcap support
--with-libcurl build with libcurl support
--with-libnuma build with libnuma support
--with-libxml build with XML support
@@ -1635,6 +1642,9 @@ Some influential environment variables:
C compiler flags for LIBURING, overriding pkg-config
LIBURING_LIBS
linker flags for LIBURING, overriding pkg-config
+ LIBCAP_CFLAGS
+ C compiler flags for LIBCAP, overriding pkg-config
+ LIBCAP_LIBS linker flags for LIBCAP, overriding pkg-config
LIBCURL_CFLAGS
C compiler flags for LIBCURL, overriding pkg-config
LIBCURL_LIBS
@@ -8926,6 +8936,183 @@ fi
+#
+# libcap
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libcap support" >&5
+$as_echo_n "checking whether to build with libcap support... " >&6; }
+
+
+
+# Check whether --with-libcap was given.
+if test "${with_libcap+set}" = set; then :
+ withval=$with_libcap;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBCAP 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libcap option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libcap=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libcap" >&5
+$as_echo "$with_libcap" >&6; }
+
+
+if test "$with_libcap" = yes ; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for cap_get_proc in -lcap" >&5
+$as_echo_n "checking for cap_get_proc in -lcap... " >&6; }
+if ${ac_cv_lib_cap_cap_get_proc+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lcap $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char cap_get_proc ();
+int
+main ()
+{
+return cap_get_proc ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_cap_cap_get_proc=yes
+else
+ ac_cv_lib_cap_cap_get_proc=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_cap_cap_get_proc" >&5
+$as_echo "$ac_cv_lib_cap_cap_get_proc" >&6; }
+if test "x$ac_cv_lib_cap_cap_get_proc" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBCAP 1
+_ACEOF
+
+ LIBS="-lcap $LIBS"
+
+else
+ as_fn_error $? "library 'libcap' is required for capablities support" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for libcap" >&5
+$as_echo_n "checking for libcap... " >&6; }
+
+if test -n "$LIBCAP_CFLAGS"; then
+ pkg_cv_LIBCAP_CFLAGS="$LIBCAP_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"libcap\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "libcap") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBCAP_CFLAGS=`$PKG_CONFIG --cflags "libcap" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBCAP_LIBS"; then
+ pkg_cv_LIBCAP_LIBS="$LIBCAP_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"libcap\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "libcap") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBCAP_LIBS=`$PKG_CONFIG --libs "libcap" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBCAP_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "libcap" 2>&1`
+ else
+ LIBCAP_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "libcap" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBCAP_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (libcap) were not met:
+
+$LIBCAP_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBCAP_CFLAGS
+and LIBCAP_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBCAP_CFLAGS
+and LIBCAP_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBCAP_CFLAGS=$pkg_cv_LIBCAP_CFLAGS
+ LIBCAP_LIBS=$pkg_cv_LIBCAP_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
+
#
# libcurl
#
diff --git a/configure.ac b/configure.ac
index 3e3fcfa9831..3c595212b59 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1019,6 +1019,21 @@ fi
AC_SUBST(with_uuid)
+#
+# libcap
+#
+AC_MSG_CHECKING([whether to build with libcap support])
+PGAC_ARG_BOOL(with, libcap, no, [build with libcap support],
+ [AC_DEFINE([USE_LIBCAP], 1, [Define to build with libcap support. (--with-libcap)])])
+AC_MSG_RESULT([$with_libcap])
+AC_SUBST(with_libcap)
+
+if test "$with_libcap" = yes ; then
+ AC_CHECK_LIB(cap, cap_get_proc, [], [AC_MSG_ERROR([library 'libcap' is required for capablities support])])
+ PKG_CHECK_MODULES(LIBCAP, libcap)
+fi
+
+
#
# libcurl
#
@@ -1075,6 +1090,7 @@ if test "$with_libnuma" = yes ; then
PKG_CHECK_MODULES(LIBNUMA, numa)
fi
+
#
# XML
#
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index de19f3ad929..3696ac64eed 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1151,6 +1151,16 @@ build-postgresql:
</listitem>
</varlistentry>
+ <varlistentry id="configure-option-with-libnuma">
+ <term><option>--with-libcap</option></term>
+ <listitem>
+ <para>
+ Build with libcap support for detecting that NUMA can work on the machine.
+ This is only used if the build also uses <literal>--with-libnuma</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-option-with-libcurl">
<term><option>--with-libcurl</option></term>
<listitem>
@@ -2664,6 +2674,16 @@ ninja install
</listitem>
</varlistentry>
+ <varlistentry id="configure-with-libcap-meson">
+ <term><option>-Dlibcap={ auto | enabled | disabled }</option></term>
+ <listitem>
+ <para>
+ Build with libcap support for detecting that NUMA can work on the machine.
+ This is only used if the build also uses <productname>libnuma</productname>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="configure-with-libnuma-meson">
<term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
<listitem>
diff --git a/meson.build b/meson.build
index 21c31f05f75..01676ba5958 100644
--- a/meson.build
+++ b/meson.build
@@ -960,6 +960,17 @@ else
endif
+###############################################################
+# Library: libcap
+###############################################################
+
+libcapopt = get_option('libcap')
+libcap = dependency('libcap', required: libcapopt)
+if libcap.found()
+ cdata.set('USE_LIBCAP', 1)
+endif
+
+
###############################################################
# Library: libnuma
###############################################################
@@ -3328,6 +3339,7 @@ backend_both_deps += [
ldap,
libintl,
libnuma,
+ libcap,
liburing,
libxml,
lz4,
@@ -3983,6 +3995,7 @@ if meson.version().version_compare('>=0.57')
'gss': gssapi,
'icu': icu,
'ldap': ldap,
+ 'libcap': libcap,
'libcurl': libcurl,
'libnuma': libnuma,
'liburing': liburing,
diff --git a/meson_options.txt b/meson_options.txt
index 06bf5627d3c..fef0f58f82c 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -100,6 +100,9 @@ option('icu', type: 'feature', value: 'auto',
option('ldap', type: 'feature', value: 'auto',
description: 'LDAP support')
+option('libcap', type: 'feature', value: 'auto',
+ description: 'libcap support')
+
option('libcurl', type : 'feature', value: 'auto',
description: 'libcurl support')
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c4dc5d72bdb..53338b9f8f5 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -244,6 +244,9 @@
/* Define to 1 if you have the `crypto' library (-lcrypto). */
#undef HAVE_LIBCRYPTO
+/* Define to 1 if you have the `cap' library (-lcap). */
+#undef HAVE_LIBCAP
+
/* Define to 1 if you have the `curl' library (-lcurl). */
#undef HAVE_LIBCURL
@@ -693,6 +696,9 @@
/* Define to 1 to build with LDAP support. (--with-ldap) */
#undef USE_LDAP
+/* Define to 1 to build with libcap support. (--with-libcap) */
+#undef USE_LIBCAP
+
/* Define to 1 to build with libcurl support. (--with-libcurl) */
#undef USE_LIBCURL
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 3368a43a338..87c076d0cdf 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -30,6 +30,10 @@
#include <numa.h>
#include <numaif.h>
+#ifdef USE_LIBCAP
+#include <sys/capability.h>
+#endif
+
/*
* numa_move_pages() chunk size, has to be <= 16 to work around a kernel bug
* in do_pages_stat() (chunked by DO_PAGES_STAT_CHUNK_NR). By using the same
@@ -47,9 +51,38 @@
int
pg_numa_init(void)
{
- int r = numa_available();
+ int r;
+#ifdef USE_LIBCAP
+ cap_t cap;
+ cap_flag_value_t on;
+#endif
+
+ r = numa_available();
+
+#ifdef USE_LIBCAP
+ if (r == -1)
+ return r;
+
+ /*
+ * Check if we have CAP_SYS_NICE set, which is required by NUMA function
+ * calls.
+ */
+ cap = cap_get_proc();
+ if (cap == NULL)
+ return r;
+
+ if (cap_get_flag(cap, CAP_SYS_NICE, CAP_PERMITTED, &on) != 0)
+ on = CAP_CLEAR;
+
+ cap_free(cap);
+
+ if (on == CAP_SET)
+ return r;
+ return -1;
+#else
return r;
+#endif
}
/*
--
2.48.1
On Tue, Jul 22, 2025 at 11:30 AM Patrick Stählin <me@packi.ch> wrote:
Hi!
On 4/7/25 11:27 PM, Tomas Vondra wrote:
I've pushed all three parts of v29, with some additional corrections
(picked lower OIDs, bumped catversion, fixed commit messages).While building the PG18 beta1/2 packages I noticed that in our build
containers the selftest for pg_buffercache_numa and numa failed. It
seems that libnuma was available and pg_numa_init/numa_available returns
no errors, we still fail in pg_numa_query_pages/move_pages with EPERM
yielding the following error when accessing
pg_buffercache_numa/pg_shmem_allocations_numa:ERROR: failed NUMA pages inquiry: Operation not permitted
The man-page of move_pages lead me to believe that this is because of
the missing capability CAP_SYS_NICE on the process but I couldn't prove
that theory with the attached patch.
The patch did make the tests pass but also disabled NUMA permanently on
a vanilla Debian VM and that is certainly not wanted. It may well be
that my understanding of checking capabilities and how they work is
incomplete. I also think that adding a new dependency for the reason of
just checking the capability is probably a bit of an overkill, maybe we
can check if we can access move_pages once without an error before
treating it as one?I'd be happy to debug this further but I have limited access to our
build-infra, I should be able to sneak in commands during the build though.
Hi Patrick,
So is it because the container was started without CAP_SYS_NICE so
even root -> postgres is not having this cap? In my book container
would be rather small and certainly single container wouldn't be
spanning multiple CPU sockets, so I would just disable libnuma, anyway
if I do on regular VM:
# capsh --drop=CAP_SYS_NICE -- -c "su - postgres"
$ /usr/sbin/capsh --print
[..]
Current IAB: !cap_sys_nice
[..]
then I can still query pg_shmem_allocations_numa and
pg_buffercache_numa after start. Same happens with setpriv(1), if I do
little cross-check:
# setpriv --reuid nobody --regid nogroup --clear-groups
--bounding=-sys_nice -- id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
# setpriv --reuid nobody --regid nogroup --clear-groups
--bounding=-sys_nice -- sleep 60 &
# pgrep sleep ### => 14882
# grep ^Cap /proc/14882/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffff7fffff
CapAmb: 0000000000000000
# capsh --decode=000001ffff7fffff
0x000001ffff7fffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore
# capsh --decode=000001ffff7fffff | grep -i nice ### nothing (no cap)
#
and then for start pg for real:
# setpriv --reuid postgres --regid postgres --clear-groups
--bounding=-sys_nice -- /usr/pgsql19/bin/pg_ctl -D /tmp/pg19 -l
/tmp/logfile start
$ psql.. ### => pid 15012
# grep ^Cap /proc/15012/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffff7fffff
CapAmb: 0000000000000000
# capsh --decode=000001ffff7fffff
0x000001ffff7fffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore
# capsh --decode=000001ffff7fffff | grep -i nice ### nothing (no cap)
#
.. and I still cannot reproduce this in VM.
Can you provide exact details about this container technology?
Can you provide /usr/sbin/capsh --print just before starting PG there?
Maybe this is more cgroup/cpuset somehow related too?
Anyway, there is a simpler way to make the tests pass if that's what
you are after. We do have
contrib/pg_buffercache/sql/pg_buffercache_numa.sql which is expected
to match outputs in pg_buffercache_numa.out OR (!)
pg_buffercache_numa_1.out. We could just handle this edge case by
adding pg_buffercache_numa_2.out too probably (which would just
contain semi-valid scenario for "ERROR: failed NUMA pages inquiry:
Operation not permitted")
-J.
Hi Jakub
On 7/24/25 10:01 AM, Jakub Wartak wrote:
On Tue, Jul 22, 2025 at 11:30 AM Patrick Stählin <me@packi.ch> wrote:
Hi!
On 4/7/25 11:27 PM, Tomas Vondra wrote:
I've pushed all three parts of v29, with some additional corrections
(picked lower OIDs, bumped catversion, fixed commit messages).While building the PG18 beta1/2 packages I noticed that in our build
containers the selftest for pg_buffercache_numa and numa failed. It
seems that libnuma was available and pg_numa_init/numa_available returns
no errors, we still fail in pg_numa_query_pages/move_pages with EPERM
yielding the following error when accessing
pg_buffercache_numa/pg_shmem_allocations_numa:ERROR: failed NUMA pages inquiry: Operation not permitted
The man-page of move_pages lead me to believe that this is because of
the missing capability CAP_SYS_NICE on the process but I couldn't prove
that theory with the attached patch.
The patch did make the tests pass but also disabled NUMA permanently on
a vanilla Debian VM and that is certainly not wanted. It may well be
that my understanding of checking capabilities and how they work is
incomplete. I also think that adding a new dependency for the reason of
just checking the capability is probably a bit of an overkill, maybe we
can check if we can access move_pages once without an error before
treating it as one?I'd be happy to debug this further but I have limited access to our
build-infra, I should be able to sneak in commands during the build though.Hi Patrick,
So is it because the container was started without CAP_SYS_NICE so
even root -> postgres is not having this cap? In my book container
would be rather small and certainly single container wouldn't be
spanning multiple CPU sockets, so I would just disable libnuma, anyway
if I do on regular VM:
[...]
This is just for the build-env but it runs the selftest and this fails
then. The containers this is running in prod is a totally different
setup and there the numa calls actually work. Disabling it may be an
option but it would be nice to detect that we can't access it at runtime.
Can you provide exact details about this container technology?
We use podman to set everything up.
Can you provide /usr/sbin/capsh --print just before starting PG there?
Maybe this is more cgroup/cpuset somehow related too?
Here is the output, it seems that cap_sys_nice is missing from the
bounding set:
+ /usr/sbin/capsh --print
Current: =
Bounding set
=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_setfcap
Ambient set =
Current IAB:
!cap_dac_read_search,!cap_linux_immutable,!cap_net_broadcast,!cap_net_admin,!cap_net_raw,!cap_ipc_lock,!cap_ipc_owner,!cap_sys_module,!cap_sys_rawio,!cap_sys_ptrace,!cap_sys_pacct,!cap_sys_admin,!cap_sys_boot,!cap_sys_nice,!cap_sys_resource,!cap_sys_time,!cap_sys_tty_config,!cap_mknod,!cap_lease,!cap_audit_write,!cap_audit_control,!cap_mac_override,!cap_mac_admin,!cap_syslog,!cap_wake_alarm,!cap_block_suspend,!cap_audit_read,!cap_perfmon,!cap_bpf,!cap_checkpoint_restore
Securebits: 00/0x0/1'b0 (no-new-privs=0)
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
secure-no-ambient-raise: no (unlocked)
uid=2000(buildkite-agent) euid=2000(buildkite-agent)
gid=2000(buildkite-agent)
groups=2000(buildkite-agent)
Guessed mode: HYBRID (4)
Anyway, there is a simpler way to make the tests pass if that's what
you are after. We do have
contrib/pg_buffercache/sql/pg_buffercache_numa.sql which is expected
to match outputs in pg_buffercache_numa.out OR (!)
pg_buffercache_numa_1.out. We could just handle this edge case by
adding pg_buffercache_numa_2.out too probably (which would just
contain semi-valid scenario for "ERROR: failed NUMA pages inquiry:
Operation not permitted")
Ah, didn't know that was a possibility. Until this sees more usage than
just querying the state, this may be a nice workaround. If this is more
wide-spread we probably need something a bit more robust for the
detection. I already patch out the tests for our build-env so for me
it's "solved" but that is certainly not a proper solution.
Just FYI, I'll be on PTO so I won't have access to the build-env in the
next two weeks.
Thanks,
Patrick