Changing shared_buffers without restart

Started by Dmitry Dolgovabout 1 year ago131 messages

9erthalion6@gmail.com

about 1 year ago

5 attachment(s)

TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
changing shared memory mapping layout. Any feedback is appreciated.

Hi,

Being able to change PostgreSQL configuration on the fly is an important
property for performance tuning, since it reduces the feedback time and
invasiveness of the process. In certain cases it even becomes highly desired,
e.g. when doing automatic tuning. But there are couple of important
configuration options that could not be modified without a restart, the most
notorious example is shared_buffers.

I've been working recently on an idea how to change that, allowing to modify
shared_buffers without a restart. To demonstrate the approach, I've prepared a
PoC that ignores lots of stuff, but works in a limited set of use cases I was
testing. I would like to discuss the idea and get some feedback.

Patches 1-3 prepare the infrastructure and shared memory layout. They could be
useful even with multithreaded PostgreSQL, when there will be no need for
shared memory. I assume, in the multithreaded world there still will be need
for a contiguous chunk of memory to share between threads, and its layout would
be similar to the one with shared memory mappings.

Patch 4 actually does resizing. It's shared memory specific of course, and
utilized Linux specific mremap, meaning open portability questions.

Patch 5 is somewhat independent, but quite convenient to have. It also utilizes
Linux specific call memfd_create.

The patch set still doesn't address lots of things, e.g. shared memory segment
detach/reattach, portability questions, it doesn't touch EXEC_BACKEND code and
huge pages.

So far I was doing some rudimentary testing: spinning up PostgreSQL, then
increasing shared_buffers and running pgbench with the scale factor large
enough to extend the data set into newly allocated buffers:

-- shared_buffers 128 MB
=# SELECT * FROM pg_buffercache_summary();
buffers_used | buffers_unused | buffers_dirty | buffers_pinned
--------------+----------------+---------------+----------------
134 | 16250 | 1 | 0

-- change shared_buffers to 512 MB
=# select pg_reload_conf();
=# SELECT * FROM pg_buffercache_summary();
buffers_used | buffers_unused | buffers_dirty | buffers_pinned
--------------+----------------+---------------+---------------
221 | 65315 | 1 | 0

-- round of pgbench read-only load
=# SELECT * FROM pg_buffercache_summary();
buffers_used | buffers_unused | buffers_dirty | buffers_pinned
--------------+----------------+---------------+---------------
41757 | 23779 | 216 | 0

Here is the breakdown:

v1-0001-Allow-to-use-multiple-shared-memory-mappings.patch

Preparation, introduces the possibility to work with many shmem mappings. To
make it less invasive, I've duplicated the shmem API to extend it with the
shmem_slot argument, while redirecting the original API to it. There are
probably better ways of doing that, I'm open for suggestions.

v1-0002-Allow-placing-shared-memory-mapping-with-an-offse.patch

Implements a new layout of shared memory mappings to include room for resizing.
I've done a couple of tests to verify that such space in between doesn't affect
how the kernel calculates actual used memory, to make sure that e.g. cgroup
will not trigger OOM. The only change seems to be in VmPeak, which is total
mapped pages.

v1-0003-Introduce-multiple-shmem-slots-for-shared-buffers.patch

Splits shared_buffers into multiple slots, moving out structures that depend on
NBuffers into separate mappings. There are two large gaps here:

* Shmem size calculation for those mappings is not correct yet, it includes too
many other things (no particular issues here, just haven't had time).
* It makes hardcoded assumptions about what is the upper limit for resizing,
which is currently low purely for experiments. Ideally there should be a new
configuration option to specify the total available memory, which would be a
base for subsequent calculations.

v1-0004-Allow-to-resize-shared-memory-without-restart.patch

Do shared_buffers change without a restart. Current approach is clumsy, it adds
an assign hook for shared_buffers and goes from there using mremap to resize
mappings. But I haven't immediately found any better approach. Currently it
supports only an increase of shared_buffers.

v1-0005-Use-anonymous-files-to-back-shared-memory-segment.patch

Allows an anonyous file to back a shared mapping. This makes certain things
easier, e.g. mappings visual representation, and gives an fd for possible
future customizations.

In this thread I'm hoping to answer following questions:

* Are there any concerns about this approach?
* What would be a better mechanism to handle resizing than an assign hook?
* Assuming I'll be able to address already known missing bits, what are the
chances the patch series could be accepted?

Attachments:

v1-0001-Allow-to-use-multiple-shared-memory-mappings.patchtext/plain; charset=us-asciiDownload

From 954613a63cb1102d7eb88f92e7ff561828bbb5c9 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Wed, 9 Oct 2024 15:41:32 +0200
Subject: [PATCH v1 1/5] Allow to use multiple shared memory mappings

Currently all the work with shared memory is done via a single anonymous
memory mapping, which limits ways how the shared memory could be organized.

Introduce possibility to allocate multiple shared memory mappings, where
a single mapping is associated with a specified shared memory slot.
There is only fixed amount of available slots, currently only one main
shared memory slot is allocated. A new shared memory API is introduces,
extended with a slot as a new parameter. As a path of least resistance,
the original API is kept in place, utilizing the main shared memory slot.
---
 src/backend/port/posix_sema.c       |   4 +-
 src/backend/port/sysv_sema.c        |   4 +-
 src/backend/port/sysv_shmem.c       | 138 +++++++++++++++++++---------
 src/backend/port/win32_sema.c       |   2 +-
 src/backend/storage/ipc/ipc.c       |   2 +-
 src/backend/storage/ipc/ipci.c      |  61 ++++++------
 src/backend/storage/ipc/shmem.c     | 133 ++++++++++++++++++---------
 src/backend/storage/lmgr/lwlock.c   |   5 +-
 src/include/storage/buf_internals.h |   1 +
 src/include/storage/ipc.h           |   2 +-
 src/include/storage/pg_sema.h       |   2 +-
 src/include/storage/pg_shmem.h      |  18 ++++
 src/include/storage/shmem.h         |  10 ++
 13 files changed, 258 insertions(+), 124 deletions(-)

diff --git a/src/backend/port/posix_sema.c b/src/backend/port/posix_sema.c
index 64186ec0a7..b97723d2ed 100644
--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -193,7 +193,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * we don't have to expose the counters to other processes.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_slot)
 {
 	struct stat statbuf;
 
@@ -220,7 +220,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSlot(PGSemaphoreShmemSize(maxSemas), shmem_slot);
 #endif
 
 	numSems = 0;
diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c
index 5b88a92bc9..8ef95b12c9 100644
--- a/src/backend/port/sysv_sema.c
+++ b/src/backend/port/sysv_sema.c
@@ -307,7 +307,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * have clobbered.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_slot)
 {
 	struct stat statbuf;
 
@@ -328,7 +328,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSlot(PGSemaphoreShmemSize(maxSemas), shmem_slot);
 	numSharedSemas = 0;
 	maxSharedSemas = maxSemas;
 
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 362a37d3b3..065a5b63ac 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -94,8 +94,19 @@ typedef enum
 unsigned long UsedShmemSegID = 0;
 void	   *UsedShmemSegAddr = NULL;
 
-static Size AnonymousShmemSize;
-static void *AnonymousShmem = NULL;
+typedef struct AnonymousMapping
+{
+	int shmem_slot;
+	Size shmem_size; 			/* Size of the mapping */
+	void *shmem; 				/* Pointer to the start of the mapped memory */
+	void *seg_addr; 			/* SysV shared memory for the header */
+	unsigned long seg_id; 		/* IPC key */
+} AnonymousMapping;
+
+static AnonymousMapping Mappings[ANON_MAPPINGS];
+
+/* Keeps track of used mapping slots */
+static int next_free_slot = 0;
 
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
@@ -104,6 +115,28 @@ static IpcMemoryState PGSharedMemoryAttach(IpcMemoryId shmId,
 										   void *attachAt,
 										   PGShmemHeader **addr);
 
+static const char*
+MappingName(int shmem_slot)
+{
+	switch (shmem_slot)
+	{
+		case MAIN_SHMEM_SLOT:
+			return "main";
+		default:
+			return "unknown";
+	}
+}
+
+static void
+DebugMappings()
+{
+	for(int i = 0; i < next_free_slot; i++)
+	{
+		AnonymousMapping m = Mappings[i];
+		elog(DEBUG1, "Mapping[%s]: addr %p, size %zu",
+			 MappingName(i), m.shmem, m.shmem_size);
+	}
+}
 
 /*
  *	InternalIpcMemoryCreate(memKey, size)
@@ -591,14 +624,13 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
 /*
  * Creates an anonymous mmap()ed shared memory segment.
  *
- * Pass the requested size in *size.  This function will modify *size to the
- * actual size of the allocation, if it ends up allocating a segment that is
- * larger than requested.
+ * This function will modify mapping size to the actual size of the allocation,
+ * if it ends up allocating a segment that is larger than requested.
  */
-static void *
-CreateAnonymousSegment(Size *size)
+static void
+CreateAnonymousSegment(AnonymousMapping *mapping)
 {
-	Size		allocsize = *size;
+	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
 
@@ -623,8 +655,11 @@ CreateAnonymousSegment(Size *size)
 				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
-			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-				 allocsize);
+		{
+			DebugMappings();
+			elog(DEBUG1, "slot[%s]: mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
+				 MappingName(mapping->shmem_slot), allocsize);
+		}
 	}
 #endif
 
@@ -642,7 +677,7 @@ CreateAnonymousSegment(Size *size)
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
-		allocsize = *size;
+		allocsize = mapping->shmem_size;
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
 				   PG_MMAP_FLAGS, -1, 0);
 		mmap_errno = errno;
@@ -651,8 +686,10 @@ CreateAnonymousSegment(Size *size)
 	if (ptr == MAP_FAILED)
 	{
 		errno = mmap_errno;
+		DebugMappings();
 		ereport(FATAL,
-				(errmsg("could not map anonymous shared memory: %m"),
+				(errmsg("slot[%s]: could not map anonymous shared memory: %m",
+						MappingName(mapping->shmem_slot)),
 				 (mmap_errno == ENOMEM) ?
 				 errhint("This error usually means that PostgreSQL's request "
 						 "for a shared memory segment exceeded available memory, "
@@ -663,8 +700,8 @@ CreateAnonymousSegment(Size *size)
 						 allocsize) : 0));
 	}
 
-	*size = allocsize;
-	return ptr;
+	mapping->shmem = ptr;
+	mapping->shmem_size = allocsize;
 }
 
 /*
@@ -674,13 +711,18 @@ CreateAnonymousSegment(Size *size)
 static void
 AnonymousShmemDetach(int status, Datum arg)
 {
-	/* Release anonymous shared memory block, if any. */
-	if (AnonymousShmem != NULL)
+	for(int i = 0; i < next_free_slot; i++)
 	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		AnonymousMapping m = Mappings[i];
+
+		/* Release anonymous shared memory block, if any. */
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
 
@@ -705,6 +747,7 @@ PGSharedMemoryCreate(Size size,
 	PGShmemHeader *hdr;
 	struct stat statbuf;
 	Size		sysvsize;
+	AnonymousMapping *mapping = &Mappings[next_free_slot];
 
 	/*
 	 * We use the data directory's ID info (inode and device numbers) to
@@ -733,11 +776,15 @@ PGSharedMemoryCreate(Size size,
 
 	/* Room for a header? */
 	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
+	mapping->shmem_size = size;
+	mapping->shmem_slot = next_free_slot;
 
 	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
-		AnonymousShmem = CreateAnonymousSegment(&size);
-		AnonymousShmemSize = size;
+		/* On success, mapping data will be modified. */
+		CreateAnonymousSegment(mapping);
+
+		next_free_slot++;
 
 		/* Register on-exit routine to unmap the anonymous segment */
 		on_shmem_exit(AnonymousShmemDetach, (Datum) 0);
@@ -760,7 +807,7 @@ PGSharedMemoryCreate(Size size,
 	 * loop simultaneously.  (CreateDataDirLockFile() does not entirely ensure
 	 * that, but prefer fixing it over coping here.)
 	 */
-	NextShmemSegID = statbuf.st_ino;
+	NextShmemSegID = statbuf.st_ino + next_free_slot;
 
 	for (;;)
 	{
@@ -852,13 +899,13 @@ PGSharedMemoryCreate(Size size,
 	/*
 	 * Initialize space allocation status for segment.
 	 */
-	hdr->totalsize = size;
+	hdr->totalsize = mapping->shmem_size;
 	hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader));
 	*shim = hdr;
 
 	/* Save info for possible future use */
-	UsedShmemSegAddr = memAddress;
-	UsedShmemSegID = (unsigned long) NextShmemSegID;
+	mapping->seg_addr = memAddress;
+	mapping->seg_id = (unsigned long) NextShmemSegID;
 
 	/*
 	 * If AnonymousShmem is NULL here, then we're not using anonymous shared
@@ -866,10 +913,10 @@ PGSharedMemoryCreate(Size size,
 	 * block. Otherwise, the System V shared memory block is only a shim, and
 	 * we must return a pointer to the real block.
 	 */
-	if (AnonymousShmem == NULL)
+	if (mapping->shmem == NULL)
 		return hdr;
-	memcpy(AnonymousShmem, hdr, sizeof(PGShmemHeader));
-	return (PGShmemHeader *) AnonymousShmem;
+	memcpy(mapping->shmem, hdr, sizeof(PGShmemHeader));
+	return (PGShmemHeader *) mapping->shmem;
 }
 
 #ifdef EXEC_BACKEND
@@ -969,23 +1016,28 @@ PGSharedMemoryNoReAttach(void)
 void
 PGSharedMemoryDetach(void)
 {
-	if (UsedShmemSegAddr != NULL)
+	for(int i = 0; i < next_free_slot; i++)
 	{
-		if ((shmdt(UsedShmemSegAddr) < 0)
+		AnonymousMapping m = Mappings[i];
+
+		if (m.seg_addr != NULL)
+		{
+			if ((shmdt(m.seg_addr) < 0)
 #if defined(EXEC_BACKEND) && defined(__CYGWIN__)
-		/* Work-around for cygipc exec bug */
-			&& shmdt(NULL) < 0
+			/* Work-around for cygipc exec bug */
+				&& shmdt(NULL) < 0
 #endif
-			)
-			elog(LOG, "shmdt(%p) failed: %m", UsedShmemSegAddr);
-		UsedShmemSegAddr = NULL;
-	}
+				)
+				elog(LOG, "shmdt(%p) failed: %m", m.seg_addr);
+			m.seg_addr = NULL;
+		}
 
-	if (AnonymousShmem != NULL)
-	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
diff --git a/src/backend/port/win32_sema.c b/src/backend/port/win32_sema.c
index f2b54bdfda..d62084cc0d 100644
--- a/src/backend/port/win32_sema.c
+++ b/src/backend/port/win32_sema.c
@@ -44,7 +44,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * process exits.
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_slot)
 {
 	mySemSet = (HANDLE *) malloc(maxSemas * sizeof(HANDLE));
 	if (mySemSet == NULL)
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index b06e4b8452..2aabd4a77f 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -68,7 +68,7 @@ static void proc_exit_prepare(int code);
  * ----------------------------------------------------------------
  */
 
-#define MAX_ON_EXITS 20
+#define MAX_ON_EXITS 40
 
 struct ONEXIT
 {
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 35fa2e1dda..8224015b53 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -88,7 +88,7 @@ RequestAddinShmemSpace(Size size)
  * required.
  */
 Size
-CalculateShmemSize(int *num_semaphores)
+CalculateShmemSize(int *num_semaphores, int shmem_slot)
 {
 	Size		size;
 	int			numSemas;
@@ -202,33 +202,36 @@ CreateSharedMemoryAndSemaphores(void)
 
 	Assert(!IsUnderPostmaster);
 
-	/* Compute the size of the shared-memory block */
-	size = CalculateShmemSize(&numSemas);
-	elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
-
-	/*
-	 * Create the shmem segment
-	 */
-	seghdr = PGSharedMemoryCreate(size, &shim);
-
-	/*
-	 * Make sure that huge pages are never reported as "unknown" while the
-	 * server is running.
-	 */
-	Assert(strcmp("unknown",
-				  GetConfigOption("huge_pages_status", false, false)) != 0);
-
-	InitShmemAccess(seghdr);
-
-	/*
-	 * Create semaphores
-	 */
-	PGReserveSemaphores(numSemas);
-
-	/*
-	 * Set up shared memory allocation mechanism
-	 */
-	InitShmemAllocation();
+	for(int slot = 0; slot < ANON_MAPPINGS; slot++)
+	{
+		/* Compute the size of the shared-memory block */
+		size = CalculateShmemSize(&numSemas, slot);
+		elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
+
+		/*
+		 * Create the shmem segment
+		 */
+		seghdr = PGSharedMemoryCreate(size, &shim);
+
+		/*
+		 * Make sure that huge pages are never reported as "unknown" while the
+		 * server is running.
+		 */
+		Assert(strcmp("unknown",
+					  GetConfigOption("huge_pages_status", false, false)) != 0);
+
+		InitShmemAccessInSlot(seghdr, slot);
+
+		/*
+		 * Create semaphores
+		 */
+		PGReserveSemaphores(numSemas, slot);
+
+		/*
+		 * Set up shared memory allocation mechanism
+		 */
+		InitShmemAllocationInSlot(slot);
+	}
 
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
@@ -359,7 +362,7 @@ InitializeShmemGUCs(void)
 	/*
 	 * Calculate the shared memory size and round up to the nearest megabyte.
 	 */
-	size_b = CalculateShmemSize(&num_semas);
+	size_b = CalculateShmemSize(&num_semas, MAIN_SHMEM_SLOT);
 	size_mb = add_size(size_b, (1024 * 1024) - 1) / (1024 * 1024);
 	sprintf(buf, "%zu", size_mb);
 	SetConfigOption("shared_memory_size", buf,
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 6d5f083986..c670b9cf43 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -75,17 +75,12 @@
 #include "utils/builtins.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
+static void *ShmemAllocRawInSlot(Size size, Size *allocated_size,
+								 int shmem_slot);
 
 /* shared memory global variables */
 
-static PGShmemHeader *ShmemSegHdr;	/* shared mem segment header */
-
-static void *ShmemBase;			/* start address of shared memory */
-
-static void *ShmemEnd;			/* end+1 address of shared memory */
-
-slock_t    *ShmemLock;			/* spinlock for shared memory and LWLock
-								 * allocation */
+ShmemSegment Segments[ANON_MAPPINGS];
 
 static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
 
@@ -99,11 +94,17 @@ static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
 void
 InitShmemAccess(void *seghdr)
 {
-	PGShmemHeader *shmhdr = (PGShmemHeader *) seghdr;
+	InitShmemAccessInSlot(seghdr, MAIN_SHMEM_SLOT);
+}
 
-	ShmemSegHdr = shmhdr;
-	ShmemBase = (void *) shmhdr;
-	ShmemEnd = (char *) ShmemBase + shmhdr->totalsize;
+void
+InitShmemAccessInSlot(void *seghdr, int shmem_slot)
+{
+	PGShmemHeader *shmhdr = (PGShmemHeader *) seghdr;
+	ShmemSegment *seg = &Segments[shmem_slot];
+	seg->ShmemSegHdr = shmhdr;
+	seg->ShmemBase = (void *) shmhdr;
+	seg->ShmemEnd = (char *) seg->ShmemBase + shmhdr->totalsize;
 }
 
 /*
@@ -114,7 +115,13 @@ InitShmemAccess(void *seghdr)
 void
 InitShmemAllocation(void)
 {
-	PGShmemHeader *shmhdr = ShmemSegHdr;
+	InitShmemAllocationInSlot(MAIN_SHMEM_SLOT);
+}
+
+void
+InitShmemAllocationInSlot(int shmem_slot)
+{
+	PGShmemHeader *shmhdr = Segments[shmem_slot].ShmemSegHdr;
 	char	   *aligned;
 
 	Assert(shmhdr != NULL);
@@ -123,9 +130,9 @@ InitShmemAllocation(void)
 	 * Initialize the spinlock used by ShmemAlloc.  We must use
 	 * ShmemAllocUnlocked, since obviously ShmemAlloc can't be called yet.
 	 */
-	ShmemLock = (slock_t *) ShmemAllocUnlocked(sizeof(slock_t));
+	Segments[shmem_slot].ShmemLock = (slock_t *) ShmemAllocUnlockedInSlot(sizeof(slock_t), shmem_slot);
 
-	SpinLockInit(ShmemLock);
+	SpinLockInit(Segments[shmem_slot].ShmemLock);
 
 	/*
 	 * Allocations after this point should go through ShmemAlloc, which
@@ -150,11 +157,17 @@ InitShmemAllocation(void)
  */
 void *
 ShmemAlloc(Size size)
+{
+	return ShmemAllocInSlot(size, MAIN_SHMEM_SLOT);
+}
+
+void *
+ShmemAllocInSlot(Size size, int shmem_slot)
 {
 	void	   *newSpace;
 	Size		allocated_size;
 
-	newSpace = ShmemAllocRaw(size, &allocated_size);
+	newSpace = ShmemAllocRawInSlot(size, &allocated_size, shmem_slot);
 	if (!newSpace)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -184,6 +197,12 @@ ShmemAllocNoError(Size size)
  */
 static void *
 ShmemAllocRaw(Size size, Size *allocated_size)
+{
+	return ShmemAllocRawInSlot(size, allocated_size, MAIN_SHMEM_SLOT);
+}
+
+static void *
+ShmemAllocRawInSlot(Size size, Size *allocated_size, int shmem_slot)
 {
 	Size		newStart;
 	Size		newFree;
@@ -203,22 +222,22 @@ ShmemAllocRaw(Size size, Size *allocated_size)
 	size = CACHELINEALIGN(size);
 	*allocated_size = size;
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_slot].ShmemSegHdr != NULL);
 
-	SpinLockAcquire(ShmemLock);
+	SpinLockAcquire(Segments[shmem_slot].ShmemLock);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_slot].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree <= ShmemSegHdr->totalsize)
+	if (newFree <= Segments[shmem_slot].ShmemSegHdr->totalsize)
 	{
-		newSpace = (void *) ((char *) ShmemBase + newStart);
-		ShmemSegHdr->freeoffset = newFree;
+		newSpace = (void *) ((char *) Segments[shmem_slot].ShmemBase + newStart);
+		Segments[shmem_slot].ShmemSegHdr->freeoffset = newFree;
 	}
 	else
 		newSpace = NULL;
 
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[shmem_slot].ShmemLock);
 
 	/* note this assert is okay with newSpace == NULL */
 	Assert(newSpace == (void *) CACHELINEALIGN(newSpace));
@@ -236,6 +255,12 @@ ShmemAllocRaw(Size size, Size *allocated_size)
  */
 void *
 ShmemAllocUnlocked(Size size)
+{
+	return ShmemAllocUnlockedInSlot(size, MAIN_SHMEM_SLOT);
+}
+
+void *
+ShmemAllocUnlockedInSlot(Size size, int shmem_slot)
 {
 	Size		newStart;
 	Size		newFree;
@@ -246,19 +271,19 @@ ShmemAllocUnlocked(Size size)
 	 */
 	size = MAXALIGN(size);
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_slot].ShmemSegHdr != NULL);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_slot].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree > ShmemSegHdr->totalsize)
+	if (newFree > Segments[shmem_slot].ShmemSegHdr->totalsize)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of shared memory (%zu bytes requested)",
 						size)));
-	ShmemSegHdr->freeoffset = newFree;
+	Segments[shmem_slot].ShmemSegHdr->freeoffset = newFree;
 
-	newSpace = (void *) ((char *) ShmemBase + newStart);
+	newSpace = (void *) ((char *) Segments[shmem_slot].ShmemBase + newStart);
 
 	Assert(newSpace == (void *) MAXALIGN(newSpace));
 
@@ -273,7 +298,13 @@ ShmemAllocUnlocked(Size size)
 bool
 ShmemAddrIsValid(const void *addr)
 {
-	return (addr >= ShmemBase) && (addr < ShmemEnd);
+	return ShmemAddrIsValidInSlot(addr, MAIN_SHMEM_SLOT);
+}
+
+bool
+ShmemAddrIsValidInSlot(const void *addr, int shmem_slot)
+{
+	return (addr >= Segments[shmem_slot].ShmemBase) && (addr < Segments[shmem_slot].ShmemEnd);
 }
 
 /*
@@ -334,6 +365,18 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
+{
+	return ShmemInitHashInSlot(name, init_size, max_size, infoP, hash_flags,
+							   MAIN_SHMEM_SLOT);
+}
+
+HTAB *
+ShmemInitHashInSlot(const char *name,		/* table string name for shmem index */
+			  long init_size,	/* initial table size */
+			  long max_size,	/* max size of the table */
+			  HASHCTL *infoP,	/* info about key and bucket size */
+			  int hash_flags,	/* info about infoP */
+			  int shmem_slot) 	/* in which slot to keep the table */
 {
 	bool		found;
 	void	   *location;
@@ -350,9 +393,9 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 	hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
 
 	/* look it up in the shmem index */
-	location = ShmemInitStruct(name,
+	location = ShmemInitStructInSlot(name,
 							   hash_get_shared_size(infoP, hash_flags),
-							   &found);
+							   &found, shmem_slot);
 
 	/*
 	 * if it already exists, attach to it rather than allocate and initialize
@@ -385,6 +428,13 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
  */
 void *
 ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+{
+	return ShmemInitStructInSlot(name, size, foundPtr, MAIN_SHMEM_SLOT);
+}
+
+void *
+ShmemInitStructInSlot(const char *name, Size size, bool *foundPtr,
+					  int shmem_slot)
 {
 	ShmemIndexEnt *result;
 	void	   *structPtr;
@@ -393,7 +443,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	if (!ShmemIndex)
 	{
-		PGShmemHeader *shmemseghdr = ShmemSegHdr;
+		PGShmemHeader *shmemseghdr = Segments[shmem_slot].ShmemSegHdr;
 
 		/* Must be trying to create/attach to ShmemIndex itself */
 		Assert(strcmp(name, "ShmemIndex") == 0);
@@ -416,7 +466,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 			 * process can be accessing shared memory yet.
 			 */
 			Assert(shmemseghdr->index == NULL);
-			structPtr = ShmemAlloc(size);
+			structPtr = ShmemAllocInSlot(size, shmem_slot);
 			shmemseghdr->index = structPtr;
 			*foundPtr = false;
 		}
@@ -433,8 +483,8 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		LWLockRelease(ShmemIndexLock);
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("could not create ShmemIndex entry for data structure \"%s\"",
-						name)));
+				 errmsg("could not create ShmemIndex entry for data structure \"%s\" in slot %d",
+						name, shmem_slot)));
 	}
 
 	if (*foundPtr)
@@ -459,7 +509,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		Size		allocated_size;
 
 		/* It isn't in the table yet. allocate and initialize it */
-		structPtr = ShmemAllocRaw(size, &allocated_size);
+		structPtr = ShmemAllocRawInSlot(size, &allocated_size, shmem_slot);
 		if (structPtr == NULL)
 		{
 			/* out of memory; remove the failed ShmemIndex entry */
@@ -478,14 +528,13 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	LWLockRelease(ShmemIndexLock);
 
-	Assert(ShmemAddrIsValid(structPtr));
+	Assert(ShmemAddrIsValidInSlot(structPtr, shmem_slot));
 
 	Assert(structPtr == (void *) CACHELINEALIGN(structPtr));
 
 	return structPtr;
 }
 
-
 /*
  * Add two Size values, checking for overflow
  */
@@ -545,7 +594,7 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 	while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
 	{
 		values[0] = CStringGetTextDatum(ent->key);
-		values[1] = Int64GetDatum((char *) ent->location - (char *) ShmemSegHdr);
+		values[1] = Int64GetDatum((char *) ent->location - (char *) Segments[MAIN_SHMEM_SLOT].ShmemSegHdr);
 		values[2] = Int64GetDatum(ent->size);
 		values[3] = Int64GetDatum(ent->allocated_size);
 		named_allocated += ent->allocated_size;
@@ -557,15 +606,15 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 	/* output shared memory allocated but not counted via the shmem index */
 	values[0] = CStringGetTextDatum("<anonymous>");
 	nulls[1] = true;
-	values[2] = Int64GetDatum(ShmemSegHdr->freeoffset - named_allocated);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SLOT].ShmemSegHdr->freeoffset - named_allocated);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
 	/* output as-of-yet unused shared memory */
 	nulls[0] = true;
-	values[1] = Int64GetDatum(ShmemSegHdr->freeoffset);
+	values[1] = Int64GetDatum(Segments[MAIN_SHMEM_SLOT].ShmemSegHdr->freeoffset);
 	nulls[1] = false;
-	values[2] = Int64GetDatum(ShmemSegHdr->totalsize - ShmemSegHdr->freeoffset);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SLOT].ShmemSegHdr->totalsize - Segments[MAIN_SHMEM_SLOT].ShmemSegHdr->freeoffset);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index e765754d80..fb0c33bf17 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -81,6 +81,7 @@
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
 #include "postmaster/postmaster.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/proclist.h"
 #include "storage/spin.h"
@@ -607,9 +608,9 @@ LWLockNewTrancheId(void)
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - sizeof(int));
 	/* We use the ShmemLock spinlock to protect LWLockCounter */
-	SpinLockAcquire(ShmemLock);
+	SpinLockAcquire(Segments[MAIN_SHMEM_SLOT].ShmemLock);
 	result = (*LWLockCounter)++;
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[MAIN_SHMEM_SLOT].ShmemLock);
 
 	return result;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index f190e6e5e4..aef80e049b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -23,6 +23,7 @@
 #include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
+#include "storage/pg_shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index b2d062781e..be4b131288 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -77,7 +77,7 @@ extern void check_on_shmem_exit_lists_are_empty(void);
 /* ipci.c */
 extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;
 
-extern Size CalculateShmemSize(int *num_semaphores);
+extern Size CalculateShmemSize(int *num_semaphores, int shmem_slot);
 extern void CreateSharedMemoryAndSemaphores(void);
 #ifdef EXEC_BACKEND
 extern void AttachSharedMemoryStructs(void);
diff --git a/src/include/storage/pg_sema.h b/src/include/storage/pg_sema.h
index dfef79ac96..081fffaf16 100644
--- a/src/include/storage/pg_sema.h
+++ b/src/include/storage/pg_sema.h
@@ -41,7 +41,7 @@ typedef HANDLE PGSemaphore;
 extern Size PGSemaphoreShmemSize(int maxSemas);
 
 /* Module initialization (called during postmaster start or shmem reinit) */
-extern void PGReserveSemaphores(int maxSemas);
+extern void PGReserveSemaphores(int maxSemas, int shmem_slot);
 
 /* Allocate a PGSemaphore structure with initial count 1 */
 extern PGSemaphore PGSemaphoreCreate(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 3065ff5be7..e968deeef7 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -25,6 +25,7 @@
 #define PG_SHMEM_H
 
 #include "storage/dsm_impl.h"
+#include "storage/spin.h"
 
 typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 {
@@ -41,6 +42,20 @@ typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 #endif
 } PGShmemHeader;
 
+typedef struct ShmemSegment
+{
+	PGShmemHeader *ShmemSegHdr; 	/* shared mem segment header */
+	void *ShmemBase; 				/* start address of shared memory */
+	void *ShmemEnd; 				/* end+1 address of shared memory */
+	slock_t    *ShmemLock; 			/* spinlock for shared memory and LWLock
+									 * allocation */
+} ShmemSegment;
+
+// Number of available slots for anonymous memory mappings
+#define ANON_MAPPINGS 1
+
+extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -90,4 +105,7 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+/* The main slot, contains everything except buffer blocks and related data. */
+#define MAIN_SHMEM_SLOT 0
+
 #endif							/* PG_SHMEM_H */
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index 842989111c..d3e9cc721d 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -28,15 +28,25 @@
 /* shmem.c */
 extern PGDLLIMPORT slock_t *ShmemLock;
 extern void InitShmemAccess(void *seghdr);
+extern void InitShmemAccessInSlot(void *seghdr, int shmem_slot);
 extern void InitShmemAllocation(void);
+extern void InitShmemAllocationInSlot(int shmem_slot);
 extern void *ShmemAlloc(Size size);
+extern void *ShmemAllocInSlot(Size size, int shmem_slot);
 extern void *ShmemAllocNoError(Size size);
 extern void *ShmemAllocUnlocked(Size size);
+extern void *ShmemAllocUnlockedInSlot(Size size, int shmem_slot);
 extern bool ShmemAddrIsValid(const void *addr);
+extern bool ShmemAddrIsValidInSlot(const void *addr, int shmem_slot);
 extern void InitShmemIndex(void);
+extern void InitVariableShmemIndex(void);
 extern HTAB *ShmemInitHash(const char *name, long init_size, long max_size,
 						   HASHCTL *infoP, int hash_flags);
+extern HTAB *ShmemInitHashInSlot(const char *name, long init_size, long max_size,
+						   HASHCTL *infoP, int hash_flags, int shmem_slot);
 extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
+extern void *ShmemInitStructInSlot(const char *name, Size size, bool *foundPtr,
+								   int shmem_slot);
 extern Size add_size(Size s1, Size s2);
 extern Size mul_size(Size s1, Size s2);
 

base-commit: 2488058dc356a43455b21a099ea879fff9266634
-- 
2.45.1

v1-0002-Allow-placing-shared-memory-mapping-with-an-offse.patchtext/plain; charset=us-asciiDownload

From e9980f76cbd1ea6f6d732e2a27dd1342258d26e5 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Wed, 16 Oct 2024 20:21:33 +0200
Subject: [PATCH v1 2/5] Allow placing shared memory mapping with an offset

Currently the kernel is responsible to chose an address, where to place each
shared memory mapping, which is the lowest possible address that do not clash
with any other mappings. This is considered to be the most portable approach,
but one of the downsides is that there is no place to resize allocated mappings
anymore. Here is how it looks like for one mapping in /proc/$PID/maps,
/dev/zero represents the anonymous shared memory we talk about:

    00400000-00490000         /path/bin/postgres
    ...
    012d9000-0133e000         [heap]
    7f443a800000-7f470a800000 /dev/zero (deleted)
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
    ...
    7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842
    7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted)

By specifying the mapping address directly it's possible to place the
mapping in a way that leaves room for resizing. The idea is first to get
the address chosen by the kernel, then apply some offset derived from
the expected upper limit. Because we base the layout on the address
chosen by the kernel, things like address space randomization should not
be a problem, since the randomization is applied to the mmap base, which
is one per process. The result looks like this:

    012d9000-0133e000         [heap]
    7f443a800000-7f444196c000 /dev/zero (deleted)
    [...free space...]
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2

This approach do not impact the actual memory usage as reported by the kernel.
Here is the output of /proc/$PID/status for the master version with
shared_buffers = 128 MB:

    // Peak virtual memory size, which is described as total pages mapped in mm_struct
    VmPeak:           422780 kB
    // Size of memory portions. It contains RssAnon + RssFile + RssShmem
    VmRSS:             21248 kB
    // Size of resident anonymous memory
    RssAnon:             640 kB
    // Size of resident file mappings
    RssFile:            9728 kB
    // Size of resident shmem memory (includes SysV shm, mapping of tmpfs and
    // shared anonymous mappings)
    RssShmem:          10880 kB

Here is the same for the patch with the shared mapping placed at
an offset 10 GB:

    VmPeak:          1102844 kB
    VmRSS:             21376 kB
    RssAnon:             640 kB
    RssFile:            9856 kB
    RssShmem:          10880 kB

Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup
was created with the memory limit 256 MB, then PostgreSQL was launched withing
this cgroup with shared_buffers = 128 MB:

    $ cd /sys/fs/cgroup
    $ mkdir postgres
    $ cd postres
    $ echo 268435456 > memory.max

    $ echo $MASTER_PID_SHELL > cgroup.procs
    # postgres from the master branch has being successfully launched
    #  from that shell
    $ cat memory.current
    17465344 (~16 MB)
    # stop postgres

    $ echo $PATCH_PID_SHELL > cgroup.procs
    # postgres from the patch has being successfully launched from that shell
    $ cat memory.current
    18219008 (~17 MB)

Note that currently the implementation makes assumptions about the upper limit.
Ideally it should be based on the maximum available memory.
---
 src/backend/port/sysv_shmem.c | 120 +++++++++++++++++++++++++++++++++-
 1 file changed, 119 insertions(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 065a5b63ac..7e6c8bb78d 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -108,6 +108,63 @@ static AnonymousMapping Mappings[ANON_MAPPINGS];
 /* Keeps track of used mapping slots */
 static int next_free_slot = 0;
 
+/*
+ * Anonymous mapping placing (/dev/zero (deleted) below) looks like this:
+ *
+ * 00400000-00490000         /path/bin/postgres
+ * ...
+ * 012d9000-0133e000         [heap]
+ * 7f443a800000-7f470a800000 /dev/zero (deleted)
+ * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
+ * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
+ * ...
+ * 7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842
+ * 7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted)
+ * ...
+ *
+ * We would like to place multiple mappings in such a way, that there will be
+ * enough space between them in the address space to be able to resize up to
+ * certain size, but without counting towards the total memory consumption.
+ *
+ * By letting Linux to chose a mapping address, it will pick up the lowest
+ * possible address that do not clash with any other mappings, which will be
+ * right before locales in the example above. This information (maximum allowed
+ * size of mappings and the lowest mapping address) is enough to place every
+ * mapping as follow:
+ *
+ * - Take the lowest mapping address, which we call later the probe address.
+ * - Substract the offset of the previous mapping.
+ * - Substract the maximum allowed size for the current mapping from the
+ *   address.
+ * - Place the mapping by the resulting address.
+ *
+ * The result would look like this:
+ *
+ * 012d9000-0133e000         [heap]
+ * 7f4426f54000-7f442e010000 /dev/zero (deleted)
+ * [...free space...]
+ * 7f443a800000-7f444196c000 /dev/zero (deleted)
+ * [...free space...]
+ * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
+ * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
+ * ...
+ */
+Size SHMEM_EXTRA_SIZE_LIMIT[1] = {
+	0, 									/* MAIN_SHMEM_SLOT */
+};
+
+/* Remembers offset of the last mapping from the probe address */
+static Size last_offset = 0;
+
+/*
+ * Size of the mapping, which will be used to calculate anonymous mapping
+ * address. It should not be too small, otherwise there is a chance the probe
+ * mapping will be created between other mappings, leaving no room extending
+ * it. But it should not be too large either, in case if there are limitations
+ * on the mapping size. Current value is the default shared_buffers.
+ */
+#define PROBE_MAPPING_SIZE (Size) 128 * 1024 * 1024
+
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
@@ -673,13 +730,74 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 
 	if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
 	{
+		void *probe = NULL;
+
 		/*
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
 		allocsize = mapping->shmem_size;
-		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
+
+		/*
+		 * Try to create mapping at an address, which will allow to extend it
+		 * later:
+		 *
+		 * - First create the temporary probe mapping of a fixed size and let
+		 *   kernel to place it at address of its choice. By the virtue of the
+		 *   probe mapping size we expect it to be located at the lowest
+		 *   possible address, expecting some non mapped space above.
+		 *
+		 * - Unmap the probe mapping, remember the address.
+		 *
+		 * - Create an actual anonymous mapping at that address with the
+		 *   offset. The offset is calculated in such a way to allow growing
+		 *   the mapping withing certain boundaries. For this mapping we use
+		 *   MAP_FIXED_NOREPLACE, which will error out with EEXIST if there is
+		 *   any mapping clash.
+		 *
+		 * - If the last step has failed, fallback to the regular mapping
+		 *   creation and signal that shared buffers could not be resized
+		 *   without a restart.
+		 */
+		probe = mmap(NULL, PROBE_MAPPING_SIZE, PROT_READ | PROT_WRITE,
 				   PG_MMAP_FLAGS, -1, 0);
+
+		if (probe == MAP_FAILED)
+		{
+			mmap_errno = errno;
+			DebugMappings();
+			elog(DEBUG1, "slot[%s]: probe mmap(%zu) failed: %m",
+					MappingName(mapping->shmem_slot), allocsize);
+		}
+		else
+		{
+			Size offset = last_offset + SHMEM_EXTRA_SIZE_LIMIT[next_free_slot] + allocsize;
+			last_offset = offset;
+
+			munmap(probe, PROBE_MAPPING_SIZE);
+
+			ptr = mmap(probe - offset, allocsize, PROT_READ | PROT_WRITE,
+					   PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, -1, 0);
+			mmap_errno = errno;
+			if (ptr == MAP_FAILED)
+			{
+				DebugMappings();
+				elog(DEBUG1, "slot[%s]: mmap(%zu) at address %p failed: %m",
+					 MappingName(mapping->shmem_slot), allocsize, probe - offset);
+			}
+
+		}
+	}
+
+	if (ptr == MAP_FAILED)
+	{
+		/*
+		 * Fallback to the portable way of creating a mapping.
+		 */
+		allocsize = mapping->shmem_size;
+
+		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
+						   PG_MMAP_FLAGS, -1, 0);
 		mmap_errno = errno;
 	}
 
-- 
2.45.1

v1-0003-Introduce-multiple-shmem-slots-for-shared-buffers.patchtext/plain; charset=us-asciiDownload

From 62ae567f1c7a56c32722508b60251f9cec245ea3 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Wed, 16 Oct 2024 20:24:04 +0200
Subject: [PATCH v1 3/5] Introduce multiple shmem slots for shared buffers

Add more shmem slots to split shared buffers into following chunks:
* BUFFERS_SHMEM_SLOT: contains buffer blocks
* BUFFER_DESCRIPTORS_SHMEM_SLOT: contains buffer descriptors
* BUFFER_IOCV_SHMEM_SLOT: contains condition variables for buffers
* CHECKPOINT_BUFFERS_SHMEM_SLOT: contains checkpoint buffer ids
* STRATEGY_SHMEM_SLOT: contains buffer strategy status

Size of the corresponding shared data directly depends on NBuffers, meaning
that if we would like to change NBuffers, they have to be resized
correspondingly. Placing each of them in a separate shmem slot allows to
achieve that.

There are some asumptions made about each of shmem slots upper size limit. The
buffer blocks have the largest, while the rest claim less extra room for
resize. Ideally those limits have to be deduced from the maximum allowed shared
memory.
---
 src/backend/port/sysv_shmem.c          | 17 +++++-
 src/backend/storage/buffer/buf_init.c  | 79 +++++++++++++++++---------
 src/backend/storage/buffer/buf_table.c |  5 +-
 src/backend/storage/buffer/freelist.c  |  4 +-
 src/backend/storage/ipc/ipci.c         |  2 +-
 src/include/storage/bufmgr.h           |  2 +-
 src/include/storage/pg_shmem.h         | 23 +++++++-
 7 files changed, 97 insertions(+), 35 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 7e6c8bb78d..beebd4d85e 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -149,8 +149,13 @@ static int next_free_slot = 0;
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
  * ...
  */
-Size SHMEM_EXTRA_SIZE_LIMIT[1] = {
+Size SHMEM_EXTRA_SIZE_LIMIT[6] = {
 	0, 									/* MAIN_SHMEM_SLOT */
+	(Size) 1024 * 1024 * 1024 * 10, 	/* BUFFERS_SHMEM_SLOT */
+	(Size) 1024 * 1024 * 1024 * 1, 		/* BUFFER_DESCRIPTORS_SHMEM_SLOT */
+	(Size) 1024 * 1024 * 100, 			/* BUFFER_IOCV_SHMEM_SLOT */
+	(Size) 1024 * 1024 * 100, 			/* CHECKPOINT_BUFFERS_SHMEM_SLOT */
+	(Size) 1024 * 1024 * 100, 			/* STRATEGY_SHMEM_SLOT */
 };
 
 /* Remembers offset of the last mapping from the probe address */
@@ -179,6 +184,16 @@ MappingName(int shmem_slot)
 	{
 		case MAIN_SHMEM_SLOT:
 			return "main";
+		case BUFFERS_SHMEM_SLOT:
+			return "buffers";
+		case BUFFER_DESCRIPTORS_SHMEM_SLOT:
+			return "descriptors";
+		case BUFFER_IOCV_SHMEM_SLOT:
+			return "iocv";
+		case CHECKPOINT_BUFFERS_SHMEM_SLOT:
+			return "checkpoint";
+		case STRATEGY_SHMEM_SLOT:
+			return "strategy";
 		default:
 			return "unknown";
 	}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 46116a1f64..6bca286bef 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -62,7 +62,10 @@ CkptSortItem *CkptBufferIds;
  * Initialize shared buffer pool
  *
  * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend).
+ * postmaster, or in a standalone backend). Size of data structures initialized
+ * here depends on NBuffers, and to be able to change NBuffers without a
+ * restart we store each structure into a separate shared memory slot, which
+ * could be resized on demand.
  */
 void
 InitBufferPool(void)
@@ -74,22 +77,22 @@ InitBufferPool(void)
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
+		ShmemInitStructInSlot("Buffer Descriptors",
 						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+						&foundDescs, BUFFER_DESCRIPTORS_SHMEM_SLOT);
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
 		TYPEALIGN(PG_IO_ALIGN_SIZE,
-				  ShmemInitStruct("Buffer Blocks",
+				  ShmemInitStructInSlot("Buffer Blocks",
 								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
-								  &foundBufs));
+								  &foundBufs, BUFFERS_SHMEM_SLOT));
 
 	/* Align condition variables to cacheline boundary. */
 	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
-		ShmemInitStruct("Buffer IO Condition Variables",
+		ShmemInitStructInSlot("Buffer IO Condition Variables",
 						NBuffers * sizeof(ConditionVariableMinimallyPadded),
-						&foundIOCV);
+						&foundIOCV, BUFFER_IOCV_SHMEM_SLOT);
 
 	/*
 	 * The array used to sort to-be-checkpointed buffer ids is located in
@@ -99,8 +102,9 @@ InitBufferPool(void)
 	 * painful.
 	 */
 	CkptBufferIds = (CkptSortItem *)
-		ShmemInitStruct("Checkpoint BufferIds",
-						NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
+		ShmemInitStructInSlot("Checkpoint BufferIds",
+						NBuffers * sizeof(CkptSortItem), &foundBufCkpt,
+						CHECKPOINT_BUFFERS_SHMEM_SLOT);
 
 	if (foundDescs || foundBufs || foundIOCV || foundBufCkpt)
 	{
@@ -154,33 +158,54 @@ InitBufferPool(void)
  * BufferShmemSize
  *
  * compute the size of shared memory for the buffer pool including
- * data pages, buffer descriptors, hash tables, etc.
+ * data pages, buffer descriptors, hash tables, etc. based on the
+ * shared memory slot. The main slot must not allocate anything
+ * related to buffers, every other slot will receive part of the
+ * data.
  */
 Size
-BufferShmemSize(void)
+BufferShmemSize(int shmem_slot)
 {
 	Size		size = 0;
 
-	/* size of buffer descriptors */
-	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
-	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_slot == MAIN_SHMEM_SLOT)
+		return size;
 
-	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
-	size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	if (shmem_slot == BUFFER_DESCRIPTORS_SHMEM_SLOT)
+	{
+		/* size of buffer descriptors */
+		size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
+		/* to allow aligning buffer descriptors */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
 
-	/* size of stuff controlled by freelist.c */
-	size = add_size(size, StrategyShmemSize());
+	if (shmem_slot == BUFFERS_SHMEM_SLOT)
+	{
+		/* size of data pages, plus alignment padding */
+		size = add_size(size, PG_IO_ALIGN_SIZE);
+		size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	}
 
-	/* size of I/O condition variables */
-	size = add_size(size, mul_size(NBuffers,
-								   sizeof(ConditionVariableMinimallyPadded)));
-	/* to allow aligning the above */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_slot == STRATEGY_SHMEM_SLOT)
+	{
+		/* size of stuff controlled by freelist.c */
+		size = add_size(size, StrategyShmemSize());
+	}
 
-	/* size of checkpoint sort array in bufmgr.c */
-	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	if (shmem_slot == BUFFER_IOCV_SHMEM_SLOT)
+	{
+		/* size of I/O condition variables */
+		size = add_size(size, mul_size(NBuffers,
+									   sizeof(ConditionVariableMinimallyPadded)));
+		/* to allow aligning the above */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
+
+	if (shmem_slot == CHECKPOINT_BUFFERS_SHMEM_SLOT)
+	{
+		/* size of checkpoint sort array in bufmgr.c */
+		size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	}
 
 	return size;
 }
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 0fa5468930..ccbaed8010 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -59,10 +59,11 @@ InitBufTable(int size)
 	info.entrysize = sizeof(BufferLookupEnt);
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
-	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
+	SharedBufHash = ShmemInitHashInSlot("Shared Buffer Lookup Table",
 								  size, size,
 								  &info,
-								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION,
+								  STRATEGY_SHMEM_SLOT);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 19797de31a..8ce1611db2 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -491,9 +491,9 @@ StrategyInitialize(bool init)
 	 * Get or create the shared strategy control block
 	 */
 	StrategyControl = (BufferStrategyControl *)
-		ShmemInitStruct("Buffer Strategy Status",
+		ShmemInitStructInSlot("Buffer Strategy Status",
 						sizeof(BufferStrategyControl),
-						&found);
+						&found, STRATEGY_SHMEM_SLOT);
 
 	if (!found)
 	{
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 8224015b53..fbaddba396 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -115,7 +115,7 @@ CalculateShmemSize(int *num_semaphores, int shmem_slot)
 											 sizeof(ShmemIndexEnt)));
 	size = add_size(size, dsm_estimate_size());
 	size = add_size(size, DSMRegistryShmemSize());
-	size = add_size(size, BufferShmemSize());
+	size = add_size(size, BufferShmemSize(shmem_slot));
 	size = add_size(size, LockShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c8422571b7..4c09d270c9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -301,7 +301,7 @@ extern bool EvictUnpinnedBuffer(Buffer buf);
 
 /* in buf_init.c */
 extern void InitBufferPool(void);
-extern Size BufferShmemSize(void);
+extern Size BufferShmemSize(int);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index e968deeef7..c0143e3899 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -52,7 +52,7 @@ typedef struct ShmemSegment
 } ShmemSegment;
 
 // Number of available slots for anonymous memory mappings
-#define ANON_MAPPINGS 1
+#define ANON_MAPPINGS 6
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 
@@ -105,7 +105,28 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+/*
+ * To be able to dynamically resize largest parts of the data stored in shared
+ * memory, we split it into multiple shared memory mappings slots. Each slot
+ * contains only certain part of the data, which size depends on NBuffers.
+ */
+
 /* The main slot, contains everything except buffer blocks and related data. */
 #define MAIN_SHMEM_SLOT 0
 
+/* Buffer blocks */
+#define BUFFERS_SHMEM_SLOT 1
+
+/* Buffer descriptors */
+#define BUFFER_DESCRIPTORS_SHMEM_SLOT 2
+
+/* Condition variables for buffers */
+#define BUFFER_IOCV_SHMEM_SLOT 3
+
+/* Checkpoint BufferIds */
+#define CHECKPOINT_BUFFERS_SHMEM_SLOT 4
+
+/* Buffer strategy status */
+#define STRATEGY_SHMEM_SLOT 5
+
 #endif							/* PG_SHMEM_H */
-- 
2.45.1

v1-0004-Allow-to-resize-shared-memory-without-restart.patchtext/plain; charset=us-asciiDownload

From 7183999bba1cbeebd059d18e5a590cbef7aff2d1 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Wed, 16 Oct 2024 20:24:58 +0200
Subject: [PATCH v1 4/5] Allow to resize shared memory without restart

Add assing hook for shared_buffers to resize shared memory using space,
introduced in the previous commits without requiring PostgreSQL restart.
Size for every shared memory slot is recalculated based on the new
NBuffers, and extended using mremap. After allocating new space, new
shared structures (buffer blocks, descriptors, etc) are allocated as
needed. Here is how it looks like after raising shared_buffers from 128
MB to 512 MB and calling pg_reload_conf():

    -- 128 MB
    7f5a2bd04000-7f5a32e52000  /dev/zero (deleted)
    7f5a39252000-7f5a4030e000  /dev/zero (deleted)
    7f5a4670e000-7f5a4d7ba000  /dev/zero (deleted)
    7f5a53bba000-7f5a5ad26000  /dev/zero (deleted)
    7f5a9ad26000-7f5aa9d94000  /dev/zero (deleted)
    ^ buffers mapping, ~240 MB
    7f5d29d94000-7f5d30e00000  /dev/zero (deleted)

    -- 512 MB
    7f5a2bd04000-7f5a33274000  /dev/zero (deleted)
    7f5a39252000-7f5a4057e000  /dev/zero (deleted)
    7f5a4670e000-7f5a4d9fa000  /dev/zero (deleted)
    7f5a53bba000-7f5a5b1a6000  /dev/zero (deleted)
    7f5a9ad26000-7f5ac1f14000  /dev/zero (deleted)
    ^ buffers mapping, ~625 MB
    7f5d29d94000-7f5d30f80000  /dev/zero (deleted)

The implementation supports only increasing of shared_buffers. For
decreasing the value a similar procedure is needed. But the buffer
blocks with data have to be drained first, so that the actual data set
fits into the new smaller space.

From experiment it turns out that shared mappings have to be extended
separately for each process that uses them. Another rough edge is that a
backend, executing pg_reload_conf interactively, will not resize
mappings immediately, for some reason it will require another command.

Note, that mremap is Linux specific, thus the implementation not very
portable.
---
 src/backend/port/sysv_shmem.c                 | 62 +++++++++++++
 src/backend/storage/buffer/buf_init.c         | 86 +++++++++++++++++++
 src/backend/storage/ipc/ipci.c                | 11 +++
 src/backend/storage/ipc/shmem.c               | 14 ++-
 .../utils/activity/wait_event_names.txt       |  1 +
 src/backend/utils/misc/guc_tables.c           |  4 +-
 src/include/storage/bufmgr.h                  |  1 +
 src/include/storage/lwlocklist.h              |  1 +
 src/include/storage/pg_shmem.h                |  2 +
 9 files changed, 171 insertions(+), 11 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index beebd4d85e..4bdadbb0e2 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -30,9 +30,11 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "portability/mem.h"
+#include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
@@ -859,6 +861,66 @@ AnonymousShmemDetach(int status, Datum arg)
 	}
 }
 
+/*
+ * An assign callback for shared_buffers GUC -- a somewhat clumsy way of
+ * resizing shared memory without a restart. On NBuffers change use the new
+ * value to recalculate required size for every shmem slot, then base on the
+ * new and old values initialize new buffer blocks.
+ *
+ * The actual slot resizing is done via mremap, which will fail if is not
+ * sufficient space to expand the mapping.
+ *
+ * XXX: For some readon in the current implementation the change is applied to
+ * the backend calling pg_reload_conf only at the backend exit.
+ */
+void
+AnonymousShmemResize(int newval, void *extra)
+{
+	int	numSemas;
+	bool reinit = false;
+	int NBuffersOld = NBuffers;
+
+	/*
+	 * XXX: Currently only increasing of shared_buffers is supported. For
+	 * decreasing something similar has to be done, but buffer blocks with
+	 * data have to be drained first.
+	 */
+	if(NBuffers > newval)
+		return;
+
+	/* XXX: Hack, NBuffers has to be exposed in the the interface for
+	 * memory calculation and buffer blocks reinitialization instead. */
+	NBuffers = newval;
+
+	for(int i = 0; i < next_free_slot; i++)
+	{
+		Size new_size = CalculateShmemSize(&numSemas, i);
+		AnonymousMapping *m = &Mappings[i];
+
+		if (m->shmem == NULL)
+			continue;
+
+		if (m->shmem_size == new_size)
+			continue;
+
+		if (mremap(m->shmem, m->shmem_size, new_size, 0) < 0)
+			elog(LOG, "mremap(%p, %zu) failed: %m",
+				 m->shmem, m->shmem_size);
+		else
+		{
+			reinit = true;
+			m->shmem_size = new_size;
+		}
+	}
+
+	if (reinit)
+	{
+		LWLockAcquire(ShmemResizeLock, LW_EXCLUSIVE);
+		ResizeBufferPool(NBuffersOld);
+		LWLockRelease(ShmemResizeLock);
+	}
+}
+
 /*
  * PGSharedMemoryCreate
  *
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6bca286bef..4054abf0e8 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -154,6 +154,92 @@ InitBufferPool(void)
 						 &backend_flush_after);
 }
 
+/*
+ * Reinitialize shared memory structures, which size depends on NBuffers. It's
+ * similar to InitBufferPool, but applied only to the buffers in the range
+ * between NBuffersOld and NBuffers.
+ */
+void
+ResizeBufferPool(int NBuffersOld)
+{
+	bool		foundBufs,
+				foundDescs,
+				foundIOCV,
+				foundBufCkpt;
+	int			i;
+
+	/* XXX: Only increasing of shared_buffers is supported in this function */
+	if(NBuffersOld > NBuffers)
+		return;
+
+	/* Align descriptors to a cacheline boundary. */
+	BufferDescriptors = (BufferDescPadded *)
+		ShmemInitStructInSlot("Buffer Descriptors",
+						NBuffers * sizeof(BufferDescPadded),
+						&foundDescs, BUFFER_DESCRIPTORS_SHMEM_SLOT);
+
+	/* Align condition variables to cacheline boundary. */
+	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
+		ShmemInitStructInSlot("Buffer IO Condition Variables",
+						NBuffers * sizeof(ConditionVariableMinimallyPadded),
+						&foundIOCV, BUFFER_IOCV_SHMEM_SLOT);
+
+	/*
+	 * The array used to sort to-be-checkpointed buffer ids is located in
+	 * shared memory, to avoid having to allocate significant amounts of
+	 * memory at runtime. As that'd be in the middle of a checkpoint, or when
+	 * the checkpointer is restarted, memory allocation failures would be
+	 * painful.
+	 */
+	CkptBufferIds = (CkptSortItem *)
+		ShmemInitStructInSlot("Checkpoint BufferIds",
+						NBuffers * sizeof(CkptSortItem), &foundBufCkpt,
+						CHECKPOINT_BUFFERS_SHMEM_SLOT);
+
+	/* Align buffer pool on IO page size boundary. */
+	BufferBlocks = (char *)
+		TYPEALIGN(PG_IO_ALIGN_SIZE,
+				  ShmemInitStructInSlot("Buffer Blocks",
+								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  &foundBufs, BUFFERS_SHMEM_SLOT));
+
+	/*
+	 * Initialize the headers for new buffers.
+	 */
+	for (i = NBuffersOld - 1; i < NBuffers; i++)
+	{
+		BufferDesc *buf = GetBufferDescriptor(i);
+
+		ClearBufferTag(&buf->tag);
+
+		pg_atomic_init_u32(&buf->state, 0);
+		buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
+
+		buf->buf_id = i;
+
+		/*
+		 * Initially link all the buffers together as unused. Subsequent
+		 * management of this list is done by freelist.c.
+		 */
+		buf->freeNext = i + 1;
+
+		LWLockInitialize(BufferDescriptorGetContentLock(buf),
+						 LWTRANCHE_BUFFER_CONTENT);
+
+		ConditionVariableInit(BufferDescriptorGetIOCV(buf));
+	}
+
+	/* Correct last entry of linked list */
+	GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
+
+	/* Init other shared buffer-management stuff */
+	StrategyInitialize(!foundDescs);
+
+	/* Initialize per-backend file flush context */
+	WritebackContextInit(&BackendWritebackContext,
+						 &backend_flush_after);
+}
+
 /*
  * BufferShmemSize
  *
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index fbaddba396..56fa339f55 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -86,6 +86,9 @@ RequestAddinShmemSpace(Size size)
  *
  * If num_semaphores is not NULL, it will be set to the number of semaphores
  * required.
+ *
+ * XXX: Calculation for non main shared memory slots are incorrect, it includes
+ * more than needed for buffers only.
  */
 Size
 CalculateShmemSize(int *num_semaphores, int shmem_slot)
@@ -153,6 +156,14 @@ CalculateShmemSize(int *num_semaphores, int shmem_slot)
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, WaitLSNShmemSize());
 
+	/*
+	 * XXX: For some reason slightly more memory is needed for larger
+	 * shared_buffers, but this size is enough for any large value I've tested
+	 * with. Is it a mistake in how slots are split, or there was a hidden
+	 * inconsistency in shmem calculation?
+	 */
+	size = add_size(size, 1024 * 1024 * 100);
+
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
 
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index c670b9cf43..20c4b1d5ad 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -491,17 +491,13 @@ ShmemInitStructInSlot(const char *name, Size size, bool *foundPtr,
 	{
 		/*
 		 * Structure is in the shmem index so someone else has allocated it
-		 * already.  The size better be the same as the size we are trying to
-		 * initialize to, or there is a name conflict (or worse).
+		 * already. Verify the structure's size:
+		 * - If it's the same, we've found the expected structure.
+		 * - If it's different, we're resizing the expected structure.
 		 */
 		if (result->size != size)
-		{
-			LWLockRelease(ShmemIndexLock);
-			ereport(ERROR,
-					(errmsg("ShmemIndex entry size is wrong for data structure"
-							" \"%s\": expected %zu, actual %zu",
-							name, size, result->size)));
-		}
+			result->size = size;
+
 		structPtr = result->location;
 	}
 	else
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index d10ca723dc..42296d950e 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -347,6 +347,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
+ShmemResize	"Waiting to resize shared memory."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 636780673b..7f2c45b7f9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2301,14 +2301,14 @@ struct config_int ConfigureNamesInt[] =
 	 * checking for overflow, so we mustn't allow more than INT_MAX / 2.
 	 */
 	{
-		{"shared_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+		{"shared_buffers", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the number of shared memory buffers used by the server."),
 			NULL,
 			GUC_UNIT_BLOCKS
 		},
 		&NBuffers,
 		16384, 16, INT_MAX / 2,
-		NULL, NULL, NULL
+		NULL, AnonymousShmemResize, NULL
 	},
 
 	{
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 4c09d270c9..ff75c46307 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -302,6 +302,7 @@ extern bool EvictUnpinnedBuffer(Buffer buf);
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(int);
+extern void ResizeBufferPool(int);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 88dc79b2bd..fb310e8b9d 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, WaitLSN)
+PG_LWLOCK(54, ShmemResize)
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index c0143e3899..ff4736c6c8 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -105,6 +105,8 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+void AnonymousShmemResize(int newval, void *extra);
+
 /*
  * To be able to dynamically resize largest parts of the data stored in shared
  * memory, we split it into multiple shared memory mappings slots. Each slot
-- 
2.45.1

v1-0005-Use-anonymous-files-to-back-shared-memory-segment.patchtext/plain; charset=us-asciiDownload

From 6df85a35e8f6cca94a963d516f1b6974850ba05b Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 15 Oct 2024 16:18:45 +0200
Subject: [PATCH v1 5/5] Use anonymous files to back shared memory segments

Allow to use anonymous files for shared memory, instead of plain
anonymous memory. Such an anonymous file is created via memfd_create, it
lives in memory, behaves like a regular file and semantically equivalent
to an anonymous memory allocated via mmap with MAP_ANONYMOUS.

Advantages of using anon files are following:

* We've got a file descriptor, which could be used for regular file
  operations (modification, truncation, you name it).

* The file could be given a name, which improves readability when it
  comes to process maps. Here is how it looks like

7f5a2bd04000-7f5a32e52000 rw-s 00000000 00:01 1845 /memfd:strategy (deleted)
7f5a39252000-7f5a4030e000 rw-s 00000000 00:01 1842 /memfd:checkpoint (deleted)
7f5a4670e000-7f5a4d7ba000 rw-s 00000000 00:01 1839 /memfd:iocv (deleted)
7f5a53bba000-7f5a5ad26000 rw-s 00000000 00:01 1836 /memfd:descriptors (deleted)
7f5a9ad26000-7f5aa9d94000 rw-s 00000000 00:01 1833 /memfd:buffers (deleted)
7f5d29d94000-7f5d30e00000 rw-s 00000000 00:01 1830 /memfd:main (deleted)

* By default, Linux will not add file-backed shared mappings into a core dump,
  making it more convenient to work with them in PostgreSQL: no more huge dumps
  to process.

The downside is that memfd_create is Linux specific.
---
 src/backend/port/sysv_shmem.c | 47 +++++++++++++++++++++++++++++------
 src/include/portability/mem.h |  2 +-
 2 files changed, 40 insertions(+), 9 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 4bdadbb0e2..a01c3e4789 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -103,6 +103,7 @@ typedef struct AnonymousMapping
 	void *shmem; 				/* Pointer to the start of the mapped memory */
 	void *seg_addr; 			/* SysV shared memory for the header */
 	unsigned long seg_id; 		/* IPC key */
+	int segment_fd; 			/* fd for the backing anon file */
 } AnonymousMapping;
 
 static AnonymousMapping Mappings[ANON_MAPPINGS];
@@ -116,7 +117,7 @@ static int next_free_slot = 0;
  * 00400000-00490000         /path/bin/postgres
  * ...
  * 012d9000-0133e000         [heap]
- * 7f443a800000-7f470a800000 /dev/zero (deleted)
+ * 7f443a800000-7f470a800000 /memfd:main (deleted)
  * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
  * ...
@@ -143,9 +144,9 @@ static int next_free_slot = 0;
  * The result would look like this:
  *
  * 012d9000-0133e000         [heap]
- * 7f4426f54000-7f442e010000 /dev/zero (deleted)
+ * 7f4426f54000-7f442e010000 /memfd:main (deleted)
  * [...free space...]
- * 7f443a800000-7f444196c000 /dev/zero (deleted)
+ * 7f443a800000-7f444196c000 /memfd:buffers (deleted)
  * [...free space...]
  * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
@@ -708,6 +709,18 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
 
+	/*
+	 * Prepare an anonymous file backing the segment. Its size will be
+	 * specified later via ftruncate.
+	 *
+	 * The file behaves like a regular file, but lives in memory. Once all
+	 * references to the file are dropped,  it is automatically released.
+	 * Anonymous memory is used for all backing pages of the file, thus it has
+	 * the same semantics as anonymous memory allocations using mmap with the
+	 * MAP_ANONYMOUS flag.
+	 */
+	mapping->segment_fd = memfd_create(MappingName(mapping->shmem_slot), 0);
+
 #ifndef MAP_HUGETLB
 	/* PGSharedMemoryCreate should have dealt with this case */
 	Assert(huge_pages != HUGE_PAGES_ON);
@@ -725,8 +738,13 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		if (allocsize % hugepagesize != 0)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
+		/*
+		 * Do not use an anonymous file here yet. When adding it, do not forget
+		 * to use ftruncate and flags MFD_HUGETLB & MFD_HUGE_2MB/MFD_HUGE_1GB
+		 * in memfd_create.
+		 */
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+				   PG_MMAP_FLAGS | MAP_ANONYMOUS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 		{
@@ -762,7 +780,8 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		 * - First create the temporary probe mapping of a fixed size and let
 		 *   kernel to place it at address of its choice. By the virtue of the
 		 *   probe mapping size we expect it to be located at the lowest
-		 *   possible address, expecting some non mapped space above.
+		 *   possible address, expecting some non mapped space above. The probe
+		 *   is does not need to be  backed by an anonymous file.
 		 *
 		 * - Unmap the probe mapping, remember the address.
 		 *
@@ -777,7 +796,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		 *   without a restart.
 		 */
 		probe = mmap(NULL, PROBE_MAPPING_SIZE, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS, -1, 0);
+				   PG_MMAP_FLAGS | MAP_ANONYMOUS, -1, 0);
 
 		if (probe == MAP_FAILED)
 		{
@@ -793,8 +812,14 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 
 			munmap(probe, PROBE_MAPPING_SIZE);
 
+			/*
+			 * Specify the segment file size using allocsize, which contains
+			 * potentially modified size.
+			 */
+			ftruncate(mapping->segment_fd, allocsize);
+
 			ptr = mmap(probe - offset, allocsize, PROT_READ | PROT_WRITE,
-					   PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, -1, 0);
+					   PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, mapping->segment_fd, 0);
 			mmap_errno = errno;
 			if (ptr == MAP_FAILED)
 			{
@@ -813,8 +838,11 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		 */
 		allocsize = mapping->shmem_size;
 
+		/* Specify the segment file size using allocsize. */
+		ftruncate(mapping->segment_fd, allocsize);
+
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-						   PG_MMAP_FLAGS, -1, 0);
+						   PG_MMAP_FLAGS, mapping->segment_fd, 0);
 		mmap_errno = errno;
 	}
 
@@ -903,6 +931,9 @@ AnonymousShmemResize(int newval, void *extra)
 		if (m->shmem_size == new_size)
 			continue;
 
+		/* Resize the backing anon file. */
+		ftruncate(m->segment_fd, new_size);
+
 		if (mremap(m->shmem, m->shmem_size, new_size, 0) < 0)
 			elog(LOG, "mremap(%p, %zu) failed: %m",
 				 m->shmem, m->shmem_size);
diff --git a/src/include/portability/mem.h b/src/include/portability/mem.h
index 2cd05313b8..50db0da28d 100644
--- a/src/include/portability/mem.h
+++ b/src/include/portability/mem.h
@@ -38,7 +38,7 @@
 #define MAP_NOSYNC			0
 #endif
 
-#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_HASSEMAPHORE)
 
 /* Some really old systems don't define MAP_FAILED. */
 #ifndef MAP_FAILED
-- 
2.45.1

Dmitry Dolgov

9erthalion6@gmail.com

about 1 year ago

In reply to: Dmitry Dolgov (#1)

Re: Changing shared_buffers without restart

On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:

TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
changing shared memory mapping layout. Any feedback is appreciated.

It was pointed out to me, that earlier this year there was a useful
discussion about similar matters "PGC_SIGHUP shared_buffers?" [1]/messages/by-id/CA+TgmoaGCFPhMjz7veJOeef30=KdpOxgywcLwNbr-Gny-mXwcg@mail.gmail.com. From
what I see the patch series falls into the "re-map" category in that
thread.

[1]: /messages/by-id/CA+TgmoaGCFPhMjz7veJOeef30=KdpOxgywcLwNbr-Gny-mXwcg@mail.gmail.com

Vladlen Popolitov

v.popolitov@postgrespro.ru

about 1 year ago

In reply to: Dmitry Dolgov (#2)

Re: Changing shared_buffers without restart

I tried to apply patches, but failed. I suppose the problem with CRLF in the end of lines in the patch files. At least, after manual change of v1-0001 and v1-0002 from CRLF to LF patches applied, but it was not helped for v1-0003 - v1.0005 - they have also other mistakes during patch process. Could you check patch files and place them in correct format?

The new status of this patch is: Waiting on Author

Thomas Munro

thomas.munro@gmail.com

about 1 year ago

In reply to: Dmitry Dolgov (#1)

Re: Changing shared_buffers without restart

On Sat, Oct 19, 2024 at 8:21 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Currently it
supports only an increase of shared_buffers.

Just BTW in case it is interesting, Palak and I experimented with how
to shrink the buffer pool while PostgreSQL is running, while we were
talking about 13453ee (which it shares infrastructure with). This
version fails if something is pinned and in the way of the shrink
operation, but you could imagine other policies (wait, cancel it,
...):

https://github.com/macdice/postgres/commit/db26fe0c98476cdbbd1bcf553f3b7864cb142247

Dmitry Dolgov

9erthalion6@gmail.com

about 1 year ago

In reply to: Thomas Munro (#4)

Re: Changing shared_buffers without restart

On Thu, Nov 07, 2024 at 02:05:52PM GMT, Thomas Munro wrote:
On Sat, Oct 19, 2024 at 8:21 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Currently it
supports only an increase of shared_buffers.

Just BTW in case it is interesting, Palak and I experimented with how
to shrink the buffer pool while PostgreSQL is running, while we were
talking about 13453ee (which it shares infrastructure with). This
version fails if something is pinned and in the way of the shrink
operation, but you could imagine other policies (wait, cancel it,
...):

https://github.com/macdice/postgres/commit/db26fe0c98476cdbbd1bcf553f3b7864cb142247

Thanks, looks interesting. I'll try to experiment with that in the next
version of the patch.

Dmitry Dolgov

9erthalion6@gmail.com

about 1 year ago

In reply to: Vladlen Popolitov (#3)

Re: Changing shared_buffers without restart

On Wed, Nov 06, 2024 at 07:10:06PM GMT, Vladlen Popolitov wrote:
Hi

I tried to apply patches, but failed. I suppose the problem with CRLF in the end of lines in the patch files. At least, after manual change of v1-0001 and v1-0002 from CRLF to LF patches applied, but it was not helped for v1-0003 - v1.0005 - they have also other mistakes during patch process. Could you check patch files and place them in correct format?

The new status of this patch is: Waiting on Author

Well, I'm going to rebase the patch if that's what you mean. But just
FYI -- it could be applied without any issues to the base commit
mentioned in the series.

Peter Eisentraut

peter@eisentraut.org

about 1 year ago

In reply to: Dmitry Dolgov (#1)

1 attachment(s)

Re: Changing shared_buffers without restart

On 18.10.24 21:21, Dmitry Dolgov wrote:

v1-0001-Allow-to-use-multiple-shared-memory-mappings.patch

Preparation, introduces the possibility to work with many shmem mappings. To
make it less invasive, I've duplicated the shmem API to extend it with the
shmem_slot argument, while redirecting the original API to it. There are
probably better ways of doing that, I'm open for suggestions.

After studying this a bit, I tend to think you should just change the
existing APIs in place. So for example,

void *ShmemAlloc(Size size);

becomes

void *ShmemAlloc(int shmem_slot, Size size);

There aren't that many callers, and all these duplicated interfaces
almost add more new code than they save.

It might be worth making exceptions for interfaces that are likely to be
used by extensions. For example, I see pg_stat_statements using
ShmemInitStruct() and ShmemInitHash(). But that seems to be it. Are
there any other examples out there? Maybe there are many more that I
don't see right now. But at least for the initialization functions, it
doesn't seem worth it to preserve the existing interfaces exactly.

In any case, I think the slot number should be the first argument. This
matches how MemoryContextAlloc() or also talloc() work.

(Now here is an idea: Could these just be memory contexts? Instead of
making six shared memory slots, could you make six memory contexts with
a special shared memory type. And ShmemAlloc becomes the allocation
function, etc.?)

I noticed the existing code made inconsistent use of PGShmemHeader * vs.
void *, which also bled into your patch. I made the attached little
patch to clean that up a bit.

I suggest splitting the struct ShmemSegment into one struct for the
three memory addresses and a separate array just for the slock_t's. The
former struct can then stay private in storage/ipc/shmem.c, only the
locks need to be exported.

Maybe rename ANON_MAPPINGS to something like NUM_ANON_MAPPINGS.

Also, maybe some of this should be declared in storage/shmem.h rather
than in storage/pg_shmem.h. We have the existing ShmemLock in there, so
it would be a bit confusing to have the per-segment locks elsewhere.

v1-0003-Introduce-multiple-shmem-slots-for-shared-buffers.patch

Splits shared_buffers into multiple slots, moving out structures that depend on
NBuffers into separate mappings. There are two large gaps here:

* Shmem size calculation for those mappings is not correct yet, it includes too
many other things (no particular issues here, just haven't had time).
* It makes hardcoded assumptions about what is the upper limit for resizing,
which is currently low purely for experiments. Ideally there should be a new
configuration option to specify the total available memory, which would be a
base for subsequent calculations.

Yes, I imagine a shared_buffers_hard_limit setting. We could maybe
default that to the total available memory, but it would also be good to
be able to specify it directly, for testing.

v1-0005-Use-anonymous-files-to-back-shared-memory-segment.patch

Allows an anonyous file to back a shared mapping. This makes certain things
easier, e.g. mappings visual representation, and gives an fd for possible
future customizations.

I think this could be a useful patch just by itself, without the rest of
the series, because of

* By default, Linux will not add file-backed shared mappings into a
core dump, making it more convenient to work with them in PostgreSQL:
no more huge dumps to process.

This could be significant operational benefit.

When you say "by default", is this adjustable? Does someone actually
want the whole shared memory in their core file? (If it's adjustable,
is it also adjustable for anonymous mappings?)

I'm wondering about this change:

-#define PG_MMAP_FLAGS 
(MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+#define PG_MMAP_FLAGS                  (MAP_SHARED|MAP_HASSEMAPHORE)

It looks like this would affect all mmap() calls, not only the one
you're changing. But that's the only one that uses this macro! I don't
understand why we need this; I don't see anything in the commit log
about this ever being used for any portability. I think we should just
get rid of it and have mmap() use the right flags directly.

I see that FreeBSD has a memfd_create() function. Might be worth a try.
Obviously, this whole thing needs a configure test for memfd_create()
anyway.

I see that memfd_create() has a MFD_HUGETLB flag. It's not very clear
how that interacts with the MAP_HUGETLB flag for mmap(). Do you need to
specify both of them if you want huge pages?

Attachments:

0001-More-thorough-use-of-PGShmemHeader-instead-of-void.patch.nocfbottext/plain; charset=UTF-8; name=0001-More-thorough-use-of-PGShmemHeader-instead-of-void.patch.nocfbotDownload

From 78562cb315da1cc5b35c07aba5a3fd7faacdad48 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Tue, 19 Nov 2024 13:15:15 +0100
Subject: [PATCH] More thorough use of PGShmemHeader * instead of void *

---
 src/backend/port/sysv_shmem.c           |  4 ++--
 src/backend/port/win32_shmem.c          |  4 ++--
 src/backend/postmaster/launch_backend.c |  2 +-
 src/backend/storage/ipc/shmem.c         | 13 ++++---------
 src/include/storage/pg_shmem.h          |  2 +-
 src/include/storage/shmem.h             |  3 ++-
 6 files changed, 12 insertions(+), 16 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 362a37d3b3a..fa6ee15ce56 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -92,7 +92,7 @@ typedef enum
 
 
 unsigned long UsedShmemSegID = 0;
-void	   *UsedShmemSegAddr = NULL;
+PGShmemHeader *UsedShmemSegAddr = NULL;
 
 static Size AnonymousShmemSize;
 static void *AnonymousShmem = NULL;
@@ -892,7 +892,7 @@ PGSharedMemoryReAttach(void)
 	IpcMemoryId shmid;
 	PGShmemHeader *hdr;
 	IpcMemoryState state;
-	void	   *origUsedShmemSegAddr = UsedShmemSegAddr;
+	PGShmemHeader *origUsedShmemSegAddr = UsedShmemSegAddr;
 
 	Assert(UsedShmemSegAddr != NULL);
 	Assert(IsUnderPostmaster);
diff --git a/src/backend/port/win32_shmem.c b/src/backend/port/win32_shmem.c
index 3bcce9d3b63..827f9cd79b4 100644
--- a/src/backend/port/win32_shmem.c
+++ b/src/backend/port/win32_shmem.c
@@ -42,7 +42,7 @@
 void	   *ShmemProtectiveRegion = NULL;
 
 HANDLE		UsedShmemSegID = INVALID_HANDLE_VALUE;
-void	   *UsedShmemSegAddr = NULL;
+PGShmemHeader *UsedShmemSegAddr = NULL;
 static Size UsedShmemSegSize = 0;
 
 static bool EnableLockPagesPrivilege(int elevel);
@@ -424,7 +424,7 @@ void
 PGSharedMemoryReAttach(void)
 {
 	PGShmemHeader *hdr;
-	void	   *origUsedShmemSegAddr = UsedShmemSegAddr;
+	PGShmemHeader *origUsedShmemSegAddr = UsedShmemSegAddr;
 
 	Assert(ShmemProtectiveRegion != NULL);
 	Assert(UsedShmemSegAddr != NULL);
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 1f2d829ec5a..8f48c938968 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -94,7 +94,7 @@ typedef struct
 	void	   *ShmemProtectiveRegion;
 	HANDLE		UsedShmemSegID;
 #endif
-	void	   *UsedShmemSegAddr;
+	PGShmemHeader *UsedShmemSegAddr;
 	slock_t    *ShmemLock;
 #ifdef USE_INJECTION_POINTS
 	struct InjectionPointsCtl *ActiveInjectionPoints;
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 6d5f0839864..50f987ae240 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -92,18 +92,13 @@ static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
 
 /*
  *	InitShmemAccess() --- set up basic pointers to shared memory.
- *
- * Note: the argument should be declared "PGShmemHeader *seghdr",
- * but we use void to avoid having to include ipc.h in shmem.h.
  */
 void
-InitShmemAccess(void *seghdr)
+InitShmemAccess(PGShmemHeader *seghdr)
 {
-	PGShmemHeader *shmhdr = (PGShmemHeader *) seghdr;
-
-	ShmemSegHdr = shmhdr;
-	ShmemBase = (void *) shmhdr;
-	ShmemEnd = (char *) ShmemBase + shmhdr->totalsize;
+	ShmemSegHdr = seghdr;
+	ShmemBase = seghdr;
+	ShmemEnd = (char *) ShmemBase + seghdr->totalsize;
 }
 
 /*
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 3065ff5be71..7a07c5807ac 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -69,7 +69,7 @@ extern PGDLLIMPORT unsigned long UsedShmemSegID;
 extern PGDLLIMPORT HANDLE UsedShmemSegID;
 extern PGDLLIMPORT void *ShmemProtectiveRegion;
 #endif
-extern PGDLLIMPORT void *UsedShmemSegAddr;
+extern PGDLLIMPORT PGShmemHeader *UsedShmemSegAddr;
 
 #if !defined(WIN32) && !defined(EXEC_BACKEND)
 #define DEFAULT_SHARED_MEMORY_TYPE SHMEM_TYPE_MMAP
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index 842989111c3..8cdbe7a89c8 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -27,7 +27,8 @@
 
 /* shmem.c */
 extern PGDLLIMPORT slock_t *ShmemLock;
-extern void InitShmemAccess(void *seghdr);
+struct PGShmemHeader;			/* avoid including storage/pg_shmem.h here */
+extern void InitShmemAccess(struct PGShmemHeader *seghdr);
 extern void InitShmemAllocation(void);
 extern void *ShmemAlloc(Size size);
 extern void *ShmemAllocNoError(Size size);
-- 
2.47.0

Dmitry Dolgov

9erthalion6@gmail.com

about 1 year ago

In reply to: Peter Eisentraut (#7)

Re: Changing shared_buffers without restart

On Tue, Nov 19, 2024 at 01:57:00PM GMT, Peter Eisentraut wrote:
On 18.10.24 21:21, Dmitry Dolgov wrote:

v1-0001-Allow-to-use-multiple-shared-memory-mappings.patch

Preparation, introduces the possibility to work with many shmem mappings. To
make it less invasive, I've duplicated the shmem API to extend it with the
shmem_slot argument, while redirecting the original API to it. There are
probably better ways of doing that, I'm open for suggestions.

After studying this a bit, I tend to think you should just change the
existing APIs in place. So for example,

void *ShmemAlloc(Size size);

becomes

void *ShmemAlloc(int shmem_slot, Size size);

There aren't that many callers, and all these duplicated interfaces almost
add more new code than they save.

It might be worth making exceptions for interfaces that are likely to be
used by extensions. For example, I see pg_stat_statements using
ShmemInitStruct() and ShmemInitHash(). But that seems to be it. Are there
any other examples out there? Maybe there are many more that I don't see
right now. But at least for the initialization functions, it doesn't seem
worth it to preserve the existing interfaces exactly.

In any case, I think the slot number should be the first argument. This
matches how MemoryContextAlloc() or also talloc() work.

Yeah, agree. I'll reshape this part, thanks.

(Now here is an idea: Could these just be memory contexts? Instead of
making six shared memory slots, could you make six memory contexts with a
special shared memory type. And ShmemAlloc becomes the allocation function,
etc.?)

Sound interesting. I don't know how good the memory context interface
would fit here, but I'll do some investigation.

I noticed the existing code made inconsistent use of PGShmemHeader * vs.
void *, which also bled into your patch. I made the attached little patch
to clean that up a bit.

Right, it was bothering me the whole time, but not strong enough to make
me fix this in the PoC just yet.

I suggest splitting the struct ShmemSegment into one struct for the three
memory addresses and a separate array just for the slock_t's. The former
struct can then stay private in storage/ipc/shmem.c, only the locks need to
be exported.

Maybe rename ANON_MAPPINGS to something like NUM_ANON_MAPPINGS.

Also, maybe some of this should be declared in storage/shmem.h rather than
in storage/pg_shmem.h. We have the existing ShmemLock in there, so it would
be a bit confusing to have the per-segment locks elsewhere.

[...]

I'm wondering about this change:
-#define PG_MMAP_FLAGS (MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+#define PG_MMAP_FLAGS                  (MAP_SHARED|MAP_HASSEMAPHORE)
It looks like this would affect all mmap() calls, not only the one you're
changing. But that's the only one that uses this macro! I don't understand
why we need this; I don't see anything in the commit log about this ever
being used for any portability. I think we should just get rid of it and
have mmap() use the right flags directly.

I see that FreeBSD has a memfd_create() function. Might be worth a try.
Obviously, this whole thing needs a configure test for memfd_create()
anyway.

Yep, those points make sense to me.

v1-0005-Use-anonymous-files-to-back-shared-memory-segment.patch

Allows an anonyous file to back a shared mapping. This makes certain things
easier, e.g. mappings visual representation, and gives an fd for possible
future customizations.

I think this could be a useful patch just by itself, without the rest of the
series, because of

* By default, Linux will not add file-backed shared mappings into a
core dump, making it more convenient to work with them in PostgreSQL:
no more huge dumps to process.

This could be significant operational benefit.

When you say "by default", is this adjustable? Does someone actually want
the whole shared memory in their core file? (If it's adjustable, is it also
adjustable for anonymous mappings?)

Yes, there is /proc/<pid>/coredump_filter [1]https://www.kernel.org/doc/html/latest/filesystems/proc.html#proc-pid-coredump-filter-core-dump-filtering-settings, that allows to specify
what to include. One can ask to exclude anon, file-backed and hugetlb
shared memory, with the only caveat that it's per process. I guess
normally no one wants to have a full shared memory in the coredump, but
there could be exceptions.

I see that memfd_create() has a MFD_HUGETLB flag. It's not very clear how
that interacts with the MAP_HUGETLB flag for mmap(). Do you need to specify
both of them if you want huge pages?

Correct, both (one flag in memfd_create and one for mmap) are needed to
use huge pages.

[1]: https://www.kernel.org/doc/html/latest/filesystems/proc.html#proc-pid-coredump-filter-core-dump-filtering-settings

Peter Eisentraut

peter@eisentraut.org

about 1 year ago

In reply to: Dmitry Dolgov (#8)

Re: Changing shared_buffers without restart

On 19.11.24 14:29, Dmitry Dolgov wrote:

I see that memfd_create() has a MFD_HUGETLB flag. It's not very clear how
that interacts with the MAP_HUGETLB flag for mmap(). Do you need to specify
both of them if you want huge pages?

Correct, both (one flag in memfd_create and one for mmap) are needed to
use huge pages.

I was worried because the FreeBSD man page says

MFD_HUGETLB This flag is currently unsupported.

It looks like FreeBSD doesn't have MAP_HUGETLB, so maybe this is irrelevant.

But you should make sure in your patch that the right set of flags for
huge pages is passed.

#10

Robert Haas

robertmhaas@gmail.com

about 1 year ago

In reply to: Dmitry Dolgov (#1)

Re: Changing shared_buffers without restart

On Fri, Oct 18, 2024 at 3:21 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
changing shared memory mapping layout. Any feedback is appreciated.

A lot of people would like to have this feature, so I hope this
proposal works out. Thanks for working on it.

I think the idea of having multiple shared memory segments is
interesting and makes sense, but I would prefer to see them called
"segments" rather than "slots" just as do we do for DSMs. The name
"slot" is somewhat overused, and invites confusion with replication
slots, inter alia. I think it's possible that having multiple fixed
shared memory segments will spell trouble on Windows, where we already
need to use a retry loop to try to get the main shared memory segment
mapped at the correct address. If there are multiple segments and we
need whatever ASLR stuff happens on Windows to not place anything else
overlapping with any of them, that means there's more chances for
stuff to fail than if we just need one address range to be free.
Granted, the individual ranges are smaller, so maybe it's fine? But I
don't know.

The big thing that worries me is synchronization, and while I've only
looked at the patch set briefly, it doesn't look to me as though
there's enough machinery here to make that work correctly. Suppose
that shared_buffers=8GB (a million buffers) and I change it to
shared_buffers=16GB (2 million buffers). As soon as any one backend
has seen that changed and expanded shared_buffers, there's a
possibility that some other backend which has not yet seen the change
might see a buffer number greater than a million. If it tries to use
that buffer number before it absorbs the change, something bad will
happen. The most obvious way for it to see such a buffer number - and
possibly the only one - is to do a lookup in the buffer mapping table
and find a buffer ID there that was inserted by some other backend
that has already seen the change.

Fixing this seems tricky. My understanding is that BufferGetBlock() is
extremely performance-critical, so having to do a bounds check there
to make sure that a given buffer number is in range would probably be
bad for performance. Also, even if the overhead weren't prohibitive, I
don't think we can safely stick code that unmaps and remaps shared
memory segments into a function that currently just does math, because
we've probably got places where we assume this operation can't fail --
as well as places where we assume that if we call BufferGetBlock(i)
and then BufferGetBlock(j), the second call won't change the answer to
the first.

It seems to me that it's probably only safe to swap out a backend's
notion of where shared_buffers is located when the backend holds on
buffer pins, and maybe not even all such places, because it would be a
problem if a backend looks up the address of a buffer before actually
pinning it, on the assumption that the answer can't change. I don't
know if that ever happens, but it would be a legal coding pattern
today. Doing it between statements seems safe as long as there are no
cursors holding pins. Doing it in the middle of a statement is
probably possible if we can verify that we're at a "safe" point in the
code, but I'm not sure exactly which points are safe. If we have no
code anywhere that assumes the address of an unpinned buffer can't
change before we pin it, then I guess the check for pins is the only
thing we need, but I don't know that to be the case.

I guess I would have imagined that a change like this would have to be
done in phases. In phase 1, we'd tell all of the backends that
shared_buffers had expanded to some new, larger value; but the new
buffers wouldn't be usable for anything yet. Then, once we confirmed
that everyone had the memo, we'd tell all the backends that those
buffers are now available for use. If shared_buffers were contracted,
phase 1 would tell all of the backends that shared_buffers had
contracted to some new, smaller value. Once a particular backend
learns about that, they will refuse to put any new pages into those
high-numbered buffers, but the existing contents would still be valid.
Once everyone has been told about this, we can go through and evict
all of those buffers, and then let everyone know that's done. Then
they shrink their mappings.

It looks to me like the patch doesn't expand the buffer mapping table,
which seems essential. But maybe I missed that.

--
Robert Haas
EDB: http://www.enterprisedb.com

#11

Peter Eisentraut

peter@eisentraut.org

about 1 year ago

In reply to: Dmitry Dolgov (#8)

Re: Changing shared_buffers without restart

On 19.11.24 14:29, Dmitry Dolgov wrote:

I noticed the existing code made inconsistent use of PGShmemHeader * vs.
void *, which also bled into your patch. I made the attached little patch
to clean that up a bit.

Right, it was bothering me the whole time, but not strong enough to make
me fix this in the PoC just yet.

I committed a bit of this, so check that when you're rebasing your patch
set.

#12

Dmitry Dolgov

9erthalion6@gmail.com

about 1 year ago

In reply to: Robert Haas (#10)

Re: Changing shared_buffers without restart

On Mon, Nov 25, 2024 at 02:33:48PM GMT, Robert Haas wrote:

I think the idea of having multiple shared memory segments is
interesting and makes sense, but I would prefer to see them called
"segments" rather than "slots" just as do we do for DSMs. The name
"slot" is somewhat overused, and invites confusion with replication
slots, inter alia. I think it's possible that having multiple fixed
shared memory segments will spell trouble on Windows, where we already
need to use a retry loop to try to get the main shared memory segment
mapped at the correct address. If there are multiple segments and we
need whatever ASLR stuff happens on Windows to not place anything else
overlapping with any of them, that means there's more chances for
stuff to fail than if we just need one address range to be free.
Granted, the individual ranges are smaller, so maybe it's fine? But I
don't know.

I haven't had a chance to experiment with that on Windows, but I'm
hoping that in the worst case fallback to a single mapping via proposed
infrastructure (and the consequent limitations) would be acceptable.

The big thing that worries me is synchronization, and while I've only
looked at the patch set briefly, it doesn't look to me as though
there's enough machinery here to make that work correctly. Suppose
that shared_buffers=8GB (a million buffers) and I change it to
shared_buffers=16GB (2 million buffers). As soon as any one backend
has seen that changed and expanded shared_buffers, there's a
possibility that some other backend which has not yet seen the change
might see a buffer number greater than a million. If it tries to use
that buffer number before it absorbs the change, something bad will
happen. The most obvious way for it to see such a buffer number - and
possibly the only one - is to do a lookup in the buffer mapping table
and find a buffer ID there that was inserted by some other backend
that has already seen the change.

Right, I haven't put much efforts into synchronization yet. It's in my
bucket list for the next iteration of the patch.

code, but I'm not sure exactly which points are safe. If we have no
code anywhere that assumes the address of an unpinned buffer can't
change before we pin it, then I guess the check for pins is the only
thing we need, but I don't know that to be the case.

Probably I'm missing something here. What scenario do you have in mind,
when the address of a buffer is changing?

I guess I would have imagined that a change like this would have to be
done in phases. In phase 1, we'd tell all of the backends that
shared_buffers had expanded to some new, larger value; but the new
buffers wouldn't be usable for anything yet. Then, once we confirmed
that everyone had the memo, we'd tell all the backends that those
buffers are now available for use. If shared_buffers were contracted,
phase 1 would tell all of the backends that shared_buffers had
contracted to some new, smaller value. Once a particular backend
learns about that, they will refuse to put any new pages into those
high-numbered buffers, but the existing contents would still be valid.
Once everyone has been told about this, we can go through and evict
all of those buffers, and then let everyone know that's done. Then
they shrink their mappings.

Yep, sounds good. I was pondering about more crude approach, but doing
this in phases seems to be a way to go.

It looks to me like the patch doesn't expand the buffer mapping table,
which seems essential. But maybe I missed that.

Do you mean the "Shared Buffer Lookup Table"? It does expand it, but
under somewhat unfitting name STRATEGY_SHMEM_SLOT. But now that I look
at the code, I see a few issues around that -- so I would have to
improve it anyway, thanks for pointing that out.

#13

Robert Haas

robertmhaas@gmail.com

about 1 year ago

In reply to: Dmitry Dolgov (#12)

Re: Changing shared_buffers without restart

On Tue, Nov 26, 2024 at 2:18 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

I haven't had a chance to experiment with that on Windows, but I'm
hoping that in the worst case fallback to a single mapping via proposed
infrastructure (and the consequent limitations) would be acceptable.

Yeah, if you can still fall back to a single mapping, I think that's
OK. It would be nicer if it could work on every platform in the same
way, but half a loaf is better than none.

code, but I'm not sure exactly which points are safe. If we have no
code anywhere that assumes the address of an unpinned buffer can't
change before we pin it, then I guess the check for pins is the only
thing we need, but I don't know that to be the case.

Probably I'm missing something here. What scenario do you have in mind,
when the address of a buffer is changing?

I was assuming that if you expand the mapping for shared_buffers, you
can't count on the new mapping being at the same address as the old
mapping. If you can, that makes things simpler, but what if the OS has
mapped something else just afterward, in the address space that you're
hoping to use when you expand the mapping?

It looks to me like the patch doesn't expand the buffer mapping table,
which seems essential. But maybe I missed that.

Do you mean the "Shared Buffer Lookup Table"? It does expand it, but
under somewhat unfitting name STRATEGY_SHMEM_SLOT. But now that I look
at the code, I see a few issues around that -- so I would have to
improve it anyway, thanks for pointing that out.

Yeah, we -- or at least I -- usually call that the buffer mapping
table. There are identifiers like BufMappingPartitionLock, for
example. I'm slightly surprised that the ShmemInitHash() call uses
something else as the identifier, but I guess that's how it is.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14

Dmitry Dolgov

9erthalion6@gmail.com

about 1 year ago

In reply to: Robert Haas (#13)

Re: Changing shared_buffers without restart

On Wed, Nov 27, 2024 at 10:20:27AM GMT, Robert Haas wrote:

code, but I'm not sure exactly which points are safe. If we have no
code anywhere that assumes the address of an unpinned buffer can't
change before we pin it, then I guess the check for pins is the only
thing we need, but I don't know that to be the case.

Probably I'm missing something here. What scenario do you have in mind,
when the address of a buffer is changing?

I was assuming that if you expand the mapping for shared_buffers, you
can't count on the new mapping being at the same address as the old
mapping. If you can, that makes things simpler, but what if the OS has
mapped something else just afterward, in the address space that you're
hoping to use when you expand the mapping?

Yes, that's the whole point of the exercise with remap -- to keep
addresses unchanged, making buffer management simpler and allowing
resize mappings quicker. The trade off is that we would need to take
care of shared mapping placing.

My understanding is that clashing of mappings (either at creation time
or when resizing) could happen only withing the process address space,
and the assumption is that by the time we prepare the mapping layout all
the rest of mappings for the process are already done. But I agree, it's
an interesting question -- I'm going to investigate if those assumptions
could be wrong under certain conditions. Currently if something else is
mapped at the same address where we want to expand the mapping, we will
get an error and can decide how to proceed (e.g. if it happens at
creation time, proceed with a single mapping, otherwise ignore mapping
resize).

#15

Robert Haas

robertmhaas@gmail.com

about 1 year ago

In reply to: Dmitry Dolgov (#14)

Re: Changing shared_buffers without restart

On Wed, Nov 27, 2024 at 3:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

My understanding is that clashing of mappings (either at creation time
or when resizing) could happen only withing the process address space,
and the assumption is that by the time we prepare the mapping layout all
the rest of mappings for the process are already done.

I don't think that's correct at all. First, the user could type LOAD
'whatever' at any time. But second, even if they don't or you prohibit
them from doing so, the process could allocate memory for any of a
million different things, and that could require mapping a new region
of memory, and the OS could choose to place that just after an
existing mapping, or at least close enough that we can't expand the
object size as much as desired.

If we had an upper bound on the size of shared_buffers and could
reserve that amount of address space at startup time but only actually
map a portion of it, then we could later remap and expand into the
reserved space. Without that, I think there's absolutely no guarantee
that the amount of address space that we need is available when we
want to extend a mapping.

--
Robert Haas
EDB: http://www.enterprisedb.com

#16

Jelte Fennema-Nio

postgres@jeltef.nl

about 1 year ago

In reply to: Robert Haas (#15)

Re: Changing shared_buffers without restart

On Wed, 27 Nov 2024 at 22:06, Robert Haas <robertmhaas@gmail.com> wrote:

If we had an upper bound on the size of shared_buffers

I think a fairly reliable upper bound is the amount of physical memory
on the system at time of postmaster start. We could make it a GUC to
set the upper bound for the rare cases where people do stuff like
adding swap space later or doing online VM growth. We could even have
the default be something like 4x the physical memory to accommodate
those people by default.

reserve that amount of address space at startup time but only actually
map a portion of it

Or is this the difficult part?

#17

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Robert Haas (#15)

Re: Changing shared_buffers without restart

Hi,

On 2024-11-27 16:05:47 -0500, Robert Haas wrote:

On Wed, Nov 27, 2024 at 3:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

My understanding is that clashing of mappings (either at creation time
or when resizing) could happen only withing the process address space,
and the assumption is that by the time we prepare the mapping layout all
the rest of mappings for the process are already done.

I don't think that's correct at all. First, the user could type LOAD
'whatever' at any time. But second, even if they don't or you prohibit
them from doing so, the process could allocate memory for any of a
million different things, and that could require mapping a new region
of memory, and the OS could choose to place that just after an
existing mapping, or at least close enough that we can't expand the
object size as much as desired.

If we had an upper bound on the size of shared_buffers and could
reserve that amount of address space at startup time but only actually
map a portion of it, then we could later remap and expand into the
reserved space. Without that, I think there's absolutely no guarantee
that the amount of address space that we need is available when we
want to extend a mapping.

Strictly speaking we don't actually need to map shared buffers to the same
location in each process... We do need that for most other uses of shared
memory, including the buffer mapping table, but not for the buffer data
itself.

Whether it's worth the complexity of dealing with differing locations is
another matter.

Greetings,

Andres Freund

#18

Robert Haas

robertmhaas@gmail.com

about 1 year ago

In reply to: Jelte Fennema-Nio (#16)

Re: Changing shared_buffers without restart

On Wed, Nov 27, 2024 at 4:28 PM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:

On Wed, 27 Nov 2024 at 22:06, Robert Haas <robertmhaas@gmail.com> wrote:

If we had an upper bound on the size of shared_buffers

I think a fairly reliable upper bound is the amount of physical memory
on the system at time of postmaster start. We could make it a GUC to
set the upper bound for the rare cases where people do stuff like
adding swap space later or doing online VM growth. We could even have
the default be something like 4x the physical memory to accommodate
those people by default.

Yes, Peter mentioned similar ideas on this thread last week.

reserve that amount of address space at startup time but only actually
map a portion of it

Or is this the difficult part?

I'm not sure how difficult this is, although I'm pretty sure that it's
more difficult than adding a GUC. My point wasn't so much whether this
is easy or hard but rather that it's essential if you want to avoid
having addresses change when the resizing happens.

--
Robert Haas
EDB: http://www.enterprisedb.com

#19

Robert Haas

robertmhaas@gmail.com

about 1 year ago

In reply to: Andres Freund (#17)

Re: Changing shared_buffers without restart

On Wed, Nov 27, 2024 at 4:41 PM Andres Freund <andres@anarazel.de> wrote:

Strictly speaking we don't actually need to map shared buffers to the same
location in each process... We do need that for most other uses of shared
memory, including the buffer mapping table, but not for the buffer data
itself.

Well, if it can move, then you have to make sure it doesn't move while
someone's holding onto a pointer into it. I'm not exactly sure how
hard it is to guarantee that, but we certainly do construct pointers
into shared_buffers and use them at least for short periods of time,
so it's not a purely academic concern.

--
Robert Haas
EDB: http://www.enterprisedb.com

#20

Dmitry Dolgov

9erthalion6@gmail.com

about 1 year ago

In reply to: Robert Haas (#15)

Re: Changing shared_buffers without restart

On Wed, Nov 27, 2024 at 04:05:47PM GMT, Robert Haas wrote:
On Wed, Nov 27, 2024 at 3:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

My understanding is that clashing of mappings (either at creation time
or when resizing) could happen only withing the process address space,
and the assumption is that by the time we prepare the mapping layout all
the rest of mappings for the process are already done.

I don't think that's correct at all. First, the user could type LOAD
'whatever' at any time. But second, even if they don't or you prohibit
them from doing so, the process could allocate memory for any of a
million different things, and that could require mapping a new region
of memory, and the OS could choose to place that just after an
existing mapping, or at least close enough that we can't expand the
object size as much as desired.

If we had an upper bound on the size of shared_buffers and could
reserve that amount of address space at startup time but only actually
map a portion of it, then we could later remap and expand into the
reserved space. Without that, I think there's absolutely no guarantee
that the amount of address space that we need is available when we
want to extend a mapping.

Just done a couple of experiments, and I think this could be addressed by
careful placing of mappings as well, based on two assumptions: for a new
mapping the kernel always picks up a lowest address that allows enough space,
and the maximum amount of allocable memory for other mappings could be derived
from total available memory. With that in mind the shared mapping layout will
have to have a large gap at the start, between the lowest address and the
shared mappings used for buffers and rest -- the gap where all the other
mapping (allocations, libraries, madvise, etc) will land. It's similar to
address space reserving you mentioned above, will reduce possibility of
clashing significantly, and looks something like this:

01339000-0139e000 [heap]
0139e000-014aa000 [heap]
7f2dd72f6000-7f2dfbc9c000 /memfd:strategy (deleted)
7f2e0209c000-7f2e269b0000 /memfd:checkpoint (deleted)
7f2e2cdb0000-7f2e516b4000 /memfd:iocv (deleted)
7f2e57ab4000-7f2e7c478000 /memfd:descriptors (deleted)
7f2ebc478000-7f2ee8d3c000 /memfd:buffers (deleted)
^ note the distance between two mappings,
which is intended for resize
7f3168d3c000-7f318d600000 /memfd:main (deleted)
^ here is where the gap starts
7f4194c00000-7f4194e7d000
^ this one is an anonymous maping created due to large
memory allocation after shared mappings were created
7f4195000000-7f419527d000
7f41952dc000-7f4195416000
7f4195416000-7f4195600000 /dev/shm/PostgreSQL.2529797530
7f4195600000-7f41a311d000 /usr/lib/locale/locale-archive
7f41a317f000-7f41a3200000
7f41a3200000-7f41a3201000 /usr/lib64/libicudata.so.74.2

The assumption about picking up a lowest address is just how it works right now
on Linux, this fact is already used in the patch. The idea that we could put
upper boundary on the size of other mappings based on total available memory
comes from the fact that anonymous mappings, that are much larger than memory,
will fail without overcommit. With overcommit it becomes different, but if
allocations are hitting that limit I can imagine there are bigger problems than
shared buffer resize.

This approach follows the same ideas already used in the patch, and have the
same trade offs: no address changes, but questions about portability.

#21

Robert Haas

robertmhaas@gmail.com

about 1 year ago

In reply to: Dmitry Dolgov (#20)

Re: Changing shared_buffers without restart

On Thu, Nov 28, 2024 at 11:30 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

on Linux, this fact is already used in the patch. The idea that we could put
upper boundary on the size of other mappings based on total available memory
comes from the fact that anonymous mappings, that are much larger than memory,
will fail without overcommit. With overcommit it becomes different, but if
allocations are hitting that limit I can imagine there are bigger problems than
shared buffer resize.

This approach follows the same ideas already used in the patch, and have the
same trade offs: no address changes, but questions about portability.

I definitely welcome the fact that you have some platform-specific
knowledge of the Linux behavior, because that's expertise that is
obviously quite useful here and which I lack. I'm personally not
overly concerned about whether it works on every other platform -- I
would prefer an implementation that works everywhere, but I'd rather
have one that works on Linux than have nothing. It's unclear to me why
operating systems don't offer better primitives for this sort of thing
-- in theory there could be a system call that sets aside a pool of
address space and then other system calls that let you allocate
shared/unshared memory within that space or even at specific
addresses, but actually such things don't exist.

All that having been said, what does concern me a bit is our ability
to predict what Linux will do well enough to keep what we're doing
safe; and also whether the Linux behavior might abruptly change in the
future. Users would be sad if we released this feature and then a
future kernel upgrade causes PostgreSQL to completely stop working. I
don't know how the Linux kernel developers actually feel about this
sort of thing, but if I imagine myself as a kernel developer, I can
totally see myself saying "well, we never promised that this would
work in any particular way, so we're free to change it whenever we
like." We've certainly used that argument here countless times.

--
Robert Haas
EDB: http://www.enterprisedb.com

#22

Matthias van de Meent

boekewurm+postgres@gmail.com

about 1 year ago

In reply to: Robert Haas (#21)

Re: Changing shared_buffers without restart

On Thu, 28 Nov 2024 at 18:19, Robert Haas <robertmhaas@gmail.com> wrote:

[...] It's unclear to me why
operating systems don't offer better primitives for this sort of thing
-- in theory there could be a system call that sets aside a pool of
address space and then other system calls that let you allocate
shared/unshared memory within that space or even at specific
addresses, but actually such things don't exist.

Isn't that more a stdlib/malloc issue? AFAIK, Linux's mmap(2) syscall
allows you to request memory from the OS at arbitrary addresses - it's
just that stdlib's malloc doens't expose the 'alloc at this address'
part of that API.

Windows seems to have an equivalent API in VirtualAlloc*. Both the
Windows API and Linux's mmap have an optional address argument, which
(when not NULL) is where the allocation will be placed (some
conditions apply, based on flags and specific API used), so, assuming
we have some control on where to allocate memory, we should be able to
reserve enough memory by using these APIs.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#23

Dmitry Dolgov

9erthalion6@gmail.com

about 1 year ago

In reply to: Robert Haas (#21)

Re: Changing shared_buffers without restart

On Thu, Nov 28, 2024 at 12:18:54PM GMT, Robert Haas wrote:

All that having been said, what does concern me a bit is our ability
to predict what Linux will do well enough to keep what we're doing
safe; and also whether the Linux behavior might abruptly change in the
future. Users would be sad if we released this feature and then a
future kernel upgrade causes PostgreSQL to completely stop working. I
don't know how the Linux kernel developers actually feel about this
sort of thing, but if I imagine myself as a kernel developer, I can
totally see myself saying "well, we never promised that this would
work in any particular way, so we're free to change it whenever we
like." We've certainly used that argument here countless times.

Agree, at the moment I can't say for sure how reliable this behavior is
in long term. I'll try to see if there are ways to get more confidence
about that.

#24

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Matthias van de Meent (#22)

Re: Changing shared_buffers without restart

Matthias van de Meent <boekewurm+postgres@gmail.com> writes:

On Thu, 28 Nov 2024 at 18:19, Robert Haas <robertmhaas@gmail.com> wrote:

[...] It's unclear to me why
operating systems don't offer better primitives for this sort of thing
-- in theory there could be a system call that sets aside a pool of
address space and then other system calls that let you allocate
shared/unshared memory within that space or even at specific
addresses, but actually such things don't exist.

Isn't that more a stdlib/malloc issue? AFAIK, Linux's mmap(2) syscall
allows you to request memory from the OS at arbitrary addresses - it's
just that stdlib's malloc doens't expose the 'alloc at this address'
part of that API.

I think what Robert is concerned about is that there is exactly 0
guarantee that that will succeed, because you have no control over
system-driven allocations of address space (for example, loading
of extensions or JIT code). In fact, given things like ASLR, there
is pressure on the kernel crew to make that *less* predictable not
more so. So even if we devise a method that seems to work reliably
today, we could have little faith that it would work with next year's
kernels.

regards, tom lane

#25

Matthias van de Meent

boekewurm+postgres@gmail.com

about 1 year ago

In reply to: Tom Lane (#24)

Re: Changing shared_buffers without restart

On Thu, 28 Nov 2024 at 19:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Matthias van de Meent <boekewurm+postgres@gmail.com> writes:

On Thu, 28 Nov 2024 at 18:19, Robert Haas <robertmhaas@gmail.com> wrote:

[...] It's unclear to me why
operating systems don't offer better primitives for this sort of thing
-- in theory there could be a system call that sets aside a pool of
address space and then other system calls that let you allocate
shared/unshared memory within that space or even at specific
addresses, but actually such things don't exist.

Isn't that more a stdlib/malloc issue? AFAIK, Linux's mmap(2) syscall
allows you to request memory from the OS at arbitrary addresses - it's
just that stdlib's malloc doens't expose the 'alloc at this address'
part of that API.

I think what Robert is concerned about is that there is exactly 0
guarantee that that will succeed, because you have no control over
system-driven allocations of address space (for example, loading
of extensions or JIT code). In fact, given things like ASLR, there
is pressure on the kernel crew to make that *less* predictable not
more so.

I see what you mean, but I think that shouldn't be much of an issue.
I'm not a kernel hacker, but I've never heard about anyone arguing to
remove mmap's mapping-overwriting behavior for user-controlled
mappings - it seems too useful as a way to guarantee relative memory
addresses (agreed, there is now mseal(2), but that is the user asking
for security on their own mapping, this isn't applied to arbitrary
mappings).

I mean, we can do the following to get a nice contiguous empty address
space no other mmap(NULL)s will get put into:

/* reserve size bytes of memory */
base = mmap(NULL, size, PROT_NONE, ...flags, ...);
/* use the first small_size bytes of that reservation */
allocated_in_reserved = mmap(base, small_size, PROT_READ |
PROT_WRITE, MAP_FIXED, ...);

With the PROT_NONE protection option the OS doesn't actually allocate
any backing memory, but guarantees no other mmap(NULL, ...) will get
placed in that area such that it overlaps with that allocation until
the area is munmap-ed, thus allowing us to reserve a chunk of address
space without actually using (much) memory. Deallocations have to go
through mmap(... PROT_NONE, ...) instead of munmap if we'd want to
keep the full area reserved, but I think that's not that much of an
issue.

I also highly doubt Linux will remove or otherwise limit the PROT_NONE
option to such a degree that we won't be able to "balloon" the memory
address space for (e.g.) dynamic shared buffer resizing.

See also: FreeBSD's MAP_GUARD mmap flag, Window's MEM_RESERVE and
MEM_RESERVE_PLACEHOLDER flags for VirtualAlloc[2][Ex].
See also [0]https://www.gnu.org/software/libc/manual/html_node/Memory-Protection.html where PROT_NONE is explicitly called out as a tool for
reserving memory address space.

So even if we devise a method that seems to work reliably
today, we could have little faith that it would work with next year's
kernels.

I really don't think that userspace memory address space reservations
through e.g. PROT_NONE or MEM_RESERVE[_PLACEHOLDER] will be retired
anytime soon, at least not without the relevant kernels also providing
effective alternatives.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0]: https://www.gnu.org/software/libc/manual/html_node/Memory-Protection.html

#26

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Matthias van de Meent (#25)

Re: Changing shared_buffers without restart

Matthias van de Meent <boekewurm+postgres@gmail.com> writes:

I mean, we can do the following to get a nice contiguous empty address
space no other mmap(NULL)s will get put into:

/* reserve size bytes of memory */
base = mmap(NULL, size, PROT_NONE, ...flags, ...);
/* use the first small_size bytes of that reservation */
allocated_in_reserved = mmap(base, small_size, PROT_READ |
PROT_WRITE, MAP_FIXED, ...);

With the PROT_NONE protection option the OS doesn't actually allocate
any backing memory, but guarantees no other mmap(NULL, ...) will get
placed in that area such that it overlaps with that allocation until
the area is munmap-ed, thus allowing us to reserve a chunk of address
space without actually using (much) memory.

Well, that's all great if it works portably. But I don't see one word
in either POSIX or the Linux mmap(2) man page that promises those
semantics for PROT_NONE. I also wonder how well a giant chunk of
"unbacked" address space will interoperate with the OOM killer,
top(1)'s display of used memory, and other things that have caused us
headaches with large shared-memory arenas.

Maybe those issues are all in the past and this'll work great.
I'm not holding my breath though.

regards, tom lane

#27

Dmitry Dolgov

9erthalion6@gmail.com

about 1 year ago

In reply to: Matthias van de Meent (#25)

Re: Changing shared_buffers without restart

On Fri, Nov 29, 2024 at 01:56:30AM GMT, Matthias van de Meent wrote:

I mean, we can do the following to get a nice contiguous empty address
space no other mmap(NULL)s will get put into:

/* reserve size bytes of memory */
base = mmap(NULL, size, PROT_NONE, ...flags, ...);
/* use the first small_size bytes of that reservation */
allocated_in_reserved = mmap(base, small_size, PROT_READ |
PROT_WRITE, MAP_FIXED, ...);

With the PROT_NONE protection option the OS doesn't actually allocate
any backing memory, but guarantees no other mmap(NULL, ...) will get
placed in that area such that it overlaps with that allocation until
the area is munmap-ed, thus allowing us to reserve a chunk of address
space without actually using (much) memory.

From what I understand it's not much different from the scenario when we
just map as much as we want in advance. The actual memory will not be
allocated in both cases due to CoW, oom_score seems to be the same. I
agree it sounds attractive, but after some experimenting it looks like
it won't work with huge pages insige a cgroup v2 (=container).

The reason is Linux has recently learned to apply memory reservation
limits on hugetlb inside a cgroup, which are applied to mmap. Nowadays
this feature is often configured out of the box in various container
orchestrators, meaning that a scenario "set hugetlb=1GB on a container,
reserve 32GB with PROT_NONE" will fail. I've also tried to mix and
match, reserve some address space via non-hugetlb mapping, and allocate
a hugetlb out of it, but it doesn't work either (the smaller mmap
complains about MAP_HUGETLB with EINVAL).

#28

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Dmitry Dolgov (#20)

Re: Changing shared_buffers without restart

Hi,

On 2024-11-28 17:30:32 +0100, Dmitry Dolgov wrote:

The assumption about picking up a lowest address is just how it works right now
on Linux, this fact is already used in the patch. The idea that we could put
upper boundary on the size of other mappings based on total available memory
comes from the fact that anonymous mappings, that are much larger than memory,
will fail without overcommit.

The overcommit issue shouldn't be a big hurdle - by mmap()ing with
MAP_NORESERVE the space isn't reserved. Then madvise with MADV_POPULATE_WRITE
can be used to actually populate the used range of the mapping and MADV_REMOVE
can be used to shrink the mapping again.

With overcommit it becomes different, but if allocations are hitting that
limit I can imagine there are bigger problems than shared buffer resize.

I'm fairly sure it'll not work to just disregard issues around overcommit. A
overly large memory allocation, without MAP_NORESERVE, will actually reduce
the amount of memory that can be used for other allocations. That's obviously
problematic, because you'll now have a smaller shared buffers, but can't use
the memory for work_mem type allocations...

Greetings,

Andres Freund

#29

Dmitry Dolgov

9erthalion6@gmail.com

about 1 year ago

In reply to: Dmitry Dolgov (#27)

Re: Changing shared_buffers without restart

On Fri, Nov 29, 2024 at 05:47:27PM GMT, Dmitry Dolgov wrote:

On Fri, Nov 29, 2024 at 01:56:30AM GMT, Matthias van de Meent wrote:

I mean, we can do the following to get a nice contiguous empty address
space no other mmap(NULL)s will get put into:

/* reserve size bytes of memory */
base = mmap(NULL, size, PROT_NONE, ...flags, ...);
/* use the first small_size bytes of that reservation */
allocated_in_reserved = mmap(base, small_size, PROT_READ |
PROT_WRITE, MAP_FIXED, ...);

With the PROT_NONE protection option the OS doesn't actually allocate
any backing memory, but guarantees no other mmap(NULL, ...) will get
placed in that area such that it overlaps with that allocation until
the area is munmap-ed, thus allowing us to reserve a chunk of address
space without actually using (much) memory.

From what I understand it's not much different from the scenario when we
just map as much as we want in advance. The actual memory will not be
allocated in both cases due to CoW, oom_score seems to be the same. I
agree it sounds attractive, but after some experimenting it looks like
it won't work with huge pages insige a cgroup v2 (=container).

The reason is Linux has recently learned to apply memory reservation
limits on hugetlb inside a cgroup, which are applied to mmap. Nowadays
this feature is often configured out of the box in various container
orchestrators, meaning that a scenario "set hugetlb=1GB on a container,
reserve 32GB with PROT_NONE" will fail. I've also tried to mix and
match, reserve some address space via non-hugetlb mapping, and allocate
a hugetlb out of it, but it doesn't work either (the smaller mmap
complains about MAP_HUGETLB with EINVAL).

I've asked about that in linux-mm [1]https://lore.kernel.org/linux-mm/pr7zggtdgjqjwyrfqzusih2suofszxvlfxdptbo2smneixkp7i@nrmtbhemy3is/t/. To my surprise, the
recommendations were to stick to creating a large mapping in advance,
and slice smaller mappings out of that, which could be resized later.
The OOM score should not be affected, and hugetlb could be avoided using
MAP_NORESERVE flag for the initial mapping (I've experimented with that,
seems to be working just fine, even if the slices are not using
MAP_NORESERVE).

I guess that would mean I'll try to experiment with this approach as
well. But what others think? How much research do we need to do, to gain
some confidence about large shared mappings and make it realistically
acceptable?

[1]: https://lore.kernel.org/linux-mm/pr7zggtdgjqjwyrfqzusih2suofszxvlfxdptbo2smneixkp7i@nrmtbhemy3is/t/

#30

Robert Haas

robertmhaas@gmail.com

about 1 year ago

In reply to: Dmitry Dolgov (#29)

Re: Changing shared_buffers without restart

On Mon, Dec 2, 2024 at 2:18 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

I've asked about that in linux-mm [1]. To my surprise, the
recommendations were to stick to creating a large mapping in advance,
and slice smaller mappings out of that, which could be resized later.
The OOM score should not be affected, and hugetlb could be avoided using
MAP_NORESERVE flag for the initial mapping (I've experimented with that,
seems to be working just fine, even if the slices are not using
MAP_NORESERVE).

I guess that would mean I'll try to experiment with this approach as
well. But what others think? How much research do we need to do, to gain
some confidence about large shared mappings and make it realistically
acceptable?

Personally, I like this approach. It seems to me that this opens up
the possibility of a system where the virtual addresses of data
structures in shared memory never change, which I think will avoid an
absolutely massive amount of implementation complexity. It's obviously
not ideal that we have to specify in advance an upper limit on the
potential size of shared_buffers, but we can live with it. It's better
than what we have today; and certainly cloud providers will have no
issue with pre-setting that to a reasonable value. I don't know if we
can port it to other operating systems, but it seems at least possible
that they offer similar primitives, or will in the future; if not, we
can disable the feature on those platforms.

I still think the synchronization is going to be tricky. For example
when you go to shrink a mapping, you need to make sure that it's free
of buffers that anyone might touch; and when you grow a mapping, you
need to make sure that nobody tries to touch that address space before
they grow the mapping, which goes back to my earlier point about
someone doing a lookup into the buffer mapping table and finding a
buffer number that is beyond the end of what they've already mapped.
But I think it may be doable with sufficient cleverness.

--
Robert Haas
EDB: http://www.enterprisedb.com

#31

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

about 1 year ago

In reply to: Robert Haas (#30)

1 attachment(s)

Re: Changing shared_buffers without restart

On Tue, Dec 3, 2024 at 8:01 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 2, 2024 at 2:18 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

I've asked about that in linux-mm [1]. To my surprise, the
recommendations were to stick to creating a large mapping in advance,
and slice smaller mappings out of that, which could be resized later.
The OOM score should not be affected, and hugetlb could be avoided using
MAP_NORESERVE flag for the initial mapping (I've experimented with that,
seems to be working just fine, even if the slices are not using
MAP_NORESERVE).

I guess that would mean I'll try to experiment with this approach as
well. But what others think? How much research do we need to do, to gain
some confidence about large shared mappings and make it realistically
acceptable?

Personally, I like this approach. It seems to me that this opens up
the possibility of a system where the virtual addresses of data
structures in shared memory never change, which I think will avoid an
absolutely massive amount of implementation complexity. It's obviously
not ideal that we have to specify in advance an upper limit on the
potential size of shared_buffers, but we can live with it. It's better
than what we have today; and certainly cloud providers will have no
issue with pre-setting that to a reasonable value. I don't know if we
can port it to other operating systems, but it seems at least possible
that they offer similar primitives, or will in the future; if not, we
can disable the feature on those platforms.

I still think the synchronization is going to be tricky. For example
when you go to shrink a mapping, you need to make sure that it's free
of buffers that anyone might touch; and when you grow a mapping, you
need to make sure that nobody tries to touch that address space before
they grow the mapping, which goes back to my earlier point about
someone doing a lookup into the buffer mapping table and finding a
buffer number that is beyond the end of what they've already mapped.
But I think it may be doable with sufficient cleverness.

From the discussion so far, the protocol for each shared memory slot
(or segment as suggested by Robert) seems to be the following.
1. At the start create a memory mapping using mmap with maximum
allocation (maxsize) with PROT_READ/PROT_WRITE and MAP_NORESERVE to
reserve address space. Assume this is created at virtual address
maddr.
2. Resize it to the required size (size) using mremap() - this will be
used to create shared memory objects
3. Map a segment with PROT_NONE and MAP_NORESERVE at maddr + size.
This segment would not allow any other mapping to be added in the
required space. PROT_NONE will protect from unintentional writes/reads
from this space.
4. When resizing the segment remove the mapping created in step 3 and
execute step 2 and 3 again. Synchronization, mentioned by Robert,
should be carried out somewhere in this step.
Note that the addresses need to be aligned as per mmap and mremap requirements.

Please correct me if I am wrong.

I wrote the attached simple program simulating this protocol. It seems
to work as expected. However, mmap'ing with MAP_FIXED would still be
able to dislodge the reserved memory. But that's true with any mapped
segment; not just with reserved memory.

A bit about the program: It reserves a 3MB memory segment and resizes
it to 1MB, 2MB and back to 3MB, thus exercising both shrinking and
enlarging the memory. It forks a child process after resizing the the
memory segment first time. At every step it makes sure that the parent
and child programs can write and read at the boundaries of the resized
memory segment. The program waits for getchar() at these steps. So in
case the program seems to be stuck, try pressing Enter once or twice.

I could verify the memory mappings, their sizes etc. by looking at
/proc/PID/maps and /proc/PID/status but I did not find a way to verify
the amount of memory actually allocated and verify that it's actually
shrinking and expanding. Please let me know how to verify that.

--
Best Wishes,
Ashutosh Bapat

#32

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

12 months ago

In reply to: Ashutosh Bapat (#31)

6 attachment(s)

Re: Changing shared_buffers without restart

Hi Dmitry,

On Tue, Dec 17, 2024 at 7:40 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

I could verify the memory mappings, their sizes etc. by looking at
/proc/PID/maps and /proc/PID/status but I did not find a way to verify
the amount of memory actually allocated and verify that it's actually
shrinking and expanding. Please let me know how to verify that.

As somewhere mentioned upthread, the mmap or mremap by themselves do
not allocate any memory. Writing to the mapped region causes memory to
be allocated and shows up in VmRSS and RssShmem. But it does get
resized if mremap() shrinks the mapped region.

Attached are patches rebased on top of commit
2a7b2d97171dd39dca7cefb91008a3c84ec003ba. I have also fixed
compilation errors. Otherwise I haven't changed anything in the
patches. The last patches adds some TODOs and questions, which I think
we need to address while completing this work, just add for as a
reminder later. The TODO in postgres.c is related to your observation

Another rough edge is that a
backend, executing pg_reload_conf interactively, will not resize
mappings immediately, for some reason it will require another command.

I don't have a solution right now, but at least the comment documents
the reason and points to its origin.

I am next looking at the problem of synchronizing the change across
the backends.

--
Best Wishes,
Ashutosh Bapat

Attachments:

0002-Allow-placing-shared-memory-mapping-with-an-20250113.patchtext/x-patch; charset=US-ASCII; name=0002-Allow-placing-shared-memory-mapping-with-an-20250113.patchDownload

From b5d295fc0a58b47228add95b4b8ad00fc0a66c07 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Wed, 16 Oct 2024 20:21:33 +0200
Subject: [PATCH 2/7] Allow placing shared memory mapping with an offset

Currently the kernel is responsible to chose an address, where to place each
shared memory mapping, which is the lowest possible address that do not clash
with any other mappings. This is considered to be the most portable approach,
but one of the downsides is that there is no place to resize allocated mappings
anymore. Here is how it looks like for one mapping in /proc/$PID/maps,
/dev/zero represents the anonymous shared memory we talk about:

    00400000-00490000         /path/bin/postgres
    ...
    012d9000-0133e000         [heap]
    7f443a800000-7f470a800000 /dev/zero (deleted)
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
    ...
    7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842
    7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted)

By specifying the mapping address directly it's possible to place the
mapping in a way that leaves room for resizing. The idea is first to get
the address chosen by the kernel, then apply some offset derived from
the expected upper limit. Because we base the layout on the address
chosen by the kernel, things like address space randomization should not
be a problem, since the randomization is applied to the mmap base, which
is one per process. The result looks like this:

    012d9000-0133e000         [heap]
    7f443a800000-7f444196c000 /dev/zero (deleted)
    [...free space...]
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2

This approach do not impact the actual memory usage as reported by the kernel.
Here is the output of /proc/$PID/status for the master version with
shared_buffers = 128 MB:

    // Peak virtual memory size, which is described as total pages mapped in mm_struct
    VmPeak:           422780 kB
    // Size of memory portions. It contains RssAnon + RssFile + RssShmem
    VmRSS:             21248 kB
    // Size of resident anonymous memory
    RssAnon:             640 kB
    // Size of resident file mappings
    RssFile:            9728 kB
    // Size of resident shmem memory (includes SysV shm, mapping of tmpfs and
    // shared anonymous mappings)
    RssShmem:          10880 kB

Here is the same for the patch with the shared mapping placed at
an offset 10 GB:

    VmPeak:          1102844 kB
    VmRSS:             21376 kB
    RssAnon:             640 kB
    RssFile:            9856 kB
    RssShmem:          10880 kB

Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup
was created with the memory limit 256 MB, then PostgreSQL was launched withing
this cgroup with shared_buffers = 128 MB:

    $ cd /sys/fs/cgroup
    $ mkdir postgres
    $ cd postres
    $ echo 268435456 > memory.max

    $ echo $MASTER_PID_SHELL > cgroup.procs
    # postgres from the master branch has being successfully launched
    #  from that shell
    $ cat memory.current
    17465344 (~16 MB)
    # stop postgres

    $ echo $PATCH_PID_SHELL > cgroup.procs
    # postgres from the patch has being successfully launched from that shell
    $ cat memory.current
    18219008 (~17 MB)

Note that currently the implementation makes assumptions about the upper limit.
Ideally it should be based on the maximum available memory.
---
 src/backend/port/sysv_shmem.c | 122 +++++++++++++++++++++++++++++++++-
 1 file changed, 121 insertions(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 475c9c8f1a1..bae8f19a755 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -108,6 +108,63 @@ static AnonymousMapping Mappings[ANON_MAPPINGS];
 /* Keeps track of used mapping slots */
 static int next_free_slot = 0;
 
+/*
+ * Anonymous mapping placing (/dev/zero (deleted) below) looks like this:
+ *
+ * 00400000-00490000         /path/bin/postgres
+ * ...
+ * 012d9000-0133e000         [heap]
+ * 7f443a800000-7f470a800000 /dev/zero (deleted)
+ * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
+ * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
+ * ...
+ * 7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842
+ * 7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted)
+ * ...
+ *
+ * We would like to place multiple mappings in such a way, that there will be
+ * enough space between them in the address space to be able to resize up to
+ * certain size, but without counting towards the total memory consumption.
+ *
+ * By letting Linux to chose a mapping address, it will pick up the lowest
+ * possible address that do not clash with any other mappings, which will be
+ * right before locales in the example above. This information (maximum allowed
+ * size of mappings and the lowest mapping address) is enough to place every
+ * mapping as follow:
+ *
+ * - Take the lowest mapping address, which we call later the probe address.
+ * - Substract the offset of the previous mapping.
+ * - Substract the maximum allowed size for the current mapping from the
+ *   address.
+ * - Place the mapping by the resulting address.
+ *
+ * The result would look like this:
+ *
+ * 012d9000-0133e000         [heap]
+ * 7f4426f54000-7f442e010000 /dev/zero (deleted)
+ * [...free space...]
+ * 7f443a800000-7f444196c000 /dev/zero (deleted)
+ * [...free space...]
+ * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
+ * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
+ * ...
+ */
+Size SHMEM_EXTRA_SIZE_LIMIT[1] = {
+	0, 									/* MAIN_SHMEM_SLOT */
+};
+
+/* Remembers offset of the last mapping from the probe address */
+static Size last_offset = 0;
+
+/*
+ * Size of the mapping, which will be used to calculate anonymous mapping
+ * address. It should not be too small, otherwise there is a chance the probe
+ * mapping will be created between other mappings, leaving no room extending
+ * it. But it should not be too large either, in case if there are limitations
+ * on the mapping size. Current value is the default shared_buffers.
+ */
+#define PROBE_MAPPING_SIZE (Size) 128 * 1024 * 1024
+
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
@@ -673,13 +730,76 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 
 	if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
 	{
+		void *probe = NULL;
+
 		/*
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
 		allocsize = mapping->shmem_size;
-		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
+
+		/*
+		 * Try to create mapping at an address, which will allow to extend it
+		 * later:
+		 *
+		 * - First create the temporary probe mapping of a fixed size and let
+		 *   kernel to place it at address of its choice. By the virtue of the
+		 *   probe mapping size we expect it to be located at the lowest
+		 *   possible address, expecting some non mapped space above.
+		 *
+		 * - Unmap the probe mapping, remember the address.
+		 *
+		 * - Create an actual anonymous mapping at that address with the
+		 *   offset. The offset is calculated in such a way to allow growing
+		 *   the mapping withing certain boundaries. For this mapping we use
+		 *   MAP_FIXED_NOREPLACE, which will error out with EEXIST if there is
+		 *   any mapping clash.
+		 *
+		 * - If the last step has failed, fallback to the regular mapping
+		 *   creation and signal that shared buffers could not be resized
+		 *   without a restart.
+		 */
+		probe = mmap(NULL, PROBE_MAPPING_SIZE, PROT_READ | PROT_WRITE,
 				   PG_MMAP_FLAGS, -1, 0);
+
+		if (probe == MAP_FAILED)
+		{
+			mmap_errno = errno;
+			DebugMappings();
+			elog(DEBUG1, "slot[%s]: probe mmap(%zu) failed: %m",
+					MappingName(mapping->shmem_slot), allocsize);
+		}
+		else
+		{
+			Size offset = last_offset + SHMEM_EXTRA_SIZE_LIMIT[next_free_slot] + allocsize;
+			void *mapping_addr = (char *) probe - offset;
+
+			last_offset = offset;
+
+			munmap(probe, PROBE_MAPPING_SIZE);
+
+			ptr = mmap(mapping_addr, allocsize, PROT_READ | PROT_WRITE,
+					   PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, -1, 0);
+			mmap_errno = errno;
+			if (ptr == MAP_FAILED)
+			{
+				DebugMappings();
+				elog(DEBUG1, "slot[%s]: mmap(%zu) at address %p failed: %m",
+					 MappingName(mapping->shmem_slot), allocsize, mapping_addr);
+			}
+
+		}
+	}
+
+	if (ptr == MAP_FAILED)
+	{
+		/*
+		 * Fallback to the portable way of creating a mapping.
+		 */
+		allocsize = mapping->shmem_size;
+
+		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
+						   PG_MMAP_FLAGS, -1, 0);
 		mmap_errno = errno;
 	}
 
-- 
2.34.1

0005-Use-anonymous-files-to-back-shared-memory-s-20250113.patchtext/x-patch; charset=US-ASCII; name=0005-Use-anonymous-files-to-back-shared-memory-s-20250113.patchDownload

From 746970c489f975b0d3add01b8d85d7cdab601b6d Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 15 Oct 2024 16:18:45 +0200
Subject: [PATCH 5/7] Use anonymous files to back shared memory segments

Allow to use anonymous files for shared memory, instead of plain
anonymous memory. Such an anonymous file is created via memfd_create, it
lives in memory, behaves like a regular file and semantically equivalent
to an anonymous memory allocated via mmap with MAP_ANONYMOUS.

Advantages of using anon files are following:

* We've got a file descriptor, which could be used for regular file
  operations (modification, truncation, you name it).

* The file could be given a name, which improves readability when it
  comes to process maps. Here is how it looks like

7f5a2bd04000-7f5a32e52000 rw-s 00000000 00:01 1845 /memfd:strategy (deleted)
7f5a39252000-7f5a4030e000 rw-s 00000000 00:01 1842 /memfd:checkpoint (deleted)
7f5a4670e000-7f5a4d7ba000 rw-s 00000000 00:01 1839 /memfd:iocv (deleted)
7f5a53bba000-7f5a5ad26000 rw-s 00000000 00:01 1836 /memfd:descriptors (deleted)
7f5a9ad26000-7f5aa9d94000 rw-s 00000000 00:01 1833 /memfd:buffers (deleted)
7f5d29d94000-7f5d30e00000 rw-s 00000000 00:01 1830 /memfd:main (deleted)

* By default, Linux will not add file-backed shared mappings into a core dump,
  making it more convenient to work with them in PostgreSQL: no more huge dumps
  to process.

The downside is that memfd_create is Linux specific.
---
 src/backend/port/sysv_shmem.c | 64 ++++++++++++++++++++++++++++++-----
 src/include/portability/mem.h |  2 +-
 2 files changed, 57 insertions(+), 9 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 72e823618ef..b2173e1a078 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -103,6 +103,7 @@ typedef struct AnonymousMapping
 	void *shmem; 				/* Pointer to the start of the mapped memory */
 	void *seg_addr; 			/* SysV shared memory for the header */
 	unsigned long seg_id; 		/* IPC key */
+	int segment_fd; 			/* fd for the backing anon file */
 } AnonymousMapping;
 
 static AnonymousMapping Mappings[ANON_MAPPINGS];
@@ -116,7 +117,7 @@ static int next_free_slot = 0;
  * 00400000-00490000         /path/bin/postgres
  * ...
  * 012d9000-0133e000         [heap]
- * 7f443a800000-7f470a800000 /dev/zero (deleted)
+ * 7f443a800000-7f470a800000 /memfd:main (deleted)
  * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
  * ...
@@ -143,9 +144,9 @@ static int next_free_slot = 0;
  * The result would look like this:
  *
  * 012d9000-0133e000         [heap]
- * 7f4426f54000-7f442e010000 /dev/zero (deleted)
+ * 7f4426f54000-7f442e010000 /memfd:main (deleted)
  * [...free space...]
- * 7f443a800000-7f444196c000 /dev/zero (deleted)
+ * 7f443a800000-7f444196c000 /memfd:buffers (deleted)
  * [...free space...]
  * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
@@ -708,6 +709,18 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
 
+	/*
+	 * Prepare an anonymous file backing the segment. Its size will be
+	 * specified later via ftruncate.
+	 *
+	 * The file behaves like a regular file, but lives in memory. Once all
+	 * references to the file are dropped,  it is automatically released.
+	 * Anonymous memory is used for all backing pages of the file, thus it has
+	 * the same semantics as anonymous memory allocations using mmap with the
+	 * MAP_ANONYMOUS flag.
+	 */
+	mapping->segment_fd = memfd_create(MappingName(mapping->shmem_slot), 0);
+
 #ifndef MAP_HUGETLB
 	/* PGSharedMemoryCreate should have dealt with this case */
 	Assert(huge_pages != HUGE_PAGES_ON);
@@ -725,8 +738,13 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		if (allocsize % hugepagesize != 0)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
+		/*
+		 * Do not use an anonymous file here yet. When adding it, do not forget
+		 * to use ftruncate and flags MFD_HUGETLB & MFD_HUGE_2MB/MFD_HUGE_1GB
+		 * in memfd_create.
+		 */
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+				   PG_MMAP_FLAGS | MAP_ANONYMOUS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 		{
@@ -762,7 +780,8 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		 * - First create the temporary probe mapping of a fixed size and let
 		 *   kernel to place it at address of its choice. By the virtue of the
 		 *   probe mapping size we expect it to be located at the lowest
-		 *   possible address, expecting some non mapped space above.
+		 *   possible address, expecting some non mapped space above. The probe
+		 *   is does not need to be  backed by an anonymous file.
 		 *
 		 * - Unmap the probe mapping, remember the address.
 		 *
@@ -777,7 +796,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		 *   without a restart.
 		 */
 		probe = mmap(NULL, PROBE_MAPPING_SIZE, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS, -1, 0);
+				   PG_MMAP_FLAGS | MAP_ANONYMOUS, -1, 0);
 
 		if (probe == MAP_FAILED)
 		{
@@ -795,8 +814,20 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 
 			munmap(probe, PROBE_MAPPING_SIZE);
 
+			/*
+			 * Specify the segment file size using allocsize, which contains
+			 * potentially modified size.
+			 */
+			if (ftruncate(mapping->segment_fd, allocsize) < 0)
+			{
+				DebugMappings();
+				elog(DEBUG1, "slot[%s]: ftruncate(%zu) failed: %m",
+					 MappingName(mapping->shmem_slot), allocsize);
+
+			}
+
 			ptr = mmap(mapping_addr, allocsize, PROT_READ | PROT_WRITE,
-					   PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, -1, 0);
+					   PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, mapping->segment_fd, 0);
 			mmap_errno = errno;
 			if (ptr == MAP_FAILED)
 			{
@@ -815,8 +846,17 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		 */
 		allocsize = mapping->shmem_size;
 
+		/* Specify the segment file size using allocsize. */
+		if (ftruncate(mapping->segment_fd, allocsize) < 0)
+		{
+			DebugMappings();
+			elog(DEBUG1, "slot[%s]: ftruncate(%zu) failed: %m",
+				 MappingName(mapping->shmem_slot), allocsize);
+
+		}
+
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-						   PG_MMAP_FLAGS, -1, 0);
+						   PG_MMAP_FLAGS, mapping->segment_fd, 0);
 		mmap_errno = errno;
 	}
 
@@ -905,6 +945,14 @@ AnonymousShmemResize(int newval, void *extra)
 		if (m->shmem_size == new_size)
 			continue;
 
+		/* Resize the backing anon file. */
+		if (ftruncate(m->segment_fd, new_size) < 0)
+		{
+			DebugMappings();
+			elog(DEBUG1, "slot[%s]: ftruncate(%zu) failed: %m",
+				 MappingName(m->shmem_slot), new_size);
+		}
+
 		if (mremap(m->shmem, m->shmem_size, new_size, 0) < 0)
 			elog(LOG, "mremap(%p, %zu) failed: %m",
 				 m->shmem, m->shmem_size);
diff --git a/src/include/portability/mem.h b/src/include/portability/mem.h
index 2cd05313b82..50db0da28dc 100644
--- a/src/include/portability/mem.h
+++ b/src/include/portability/mem.h
@@ -38,7 +38,7 @@
 #define MAP_NOSYNC			0
 #endif
 
-#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_HASSEMAPHORE)
 
 /* Some really old systems don't define MAP_FAILED. */
 #ifndef MAP_FAILED
-- 
2.34.1

0001-Allow-to-use-multiple-shared-memory-mapping-20250113.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-to-use-multiple-shared-memory-mapping-20250113.patchDownload

From e7b3e3690a9b9575394cffbde2f7c5e674d28ed9 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Wed, 9 Oct 2024 15:41:32 +0200
Subject: [PATCH 1/7] Allow to use multiple shared memory mappings

Currently all the work with shared memory is done via a single anonymous
memory mapping, which limits ways how the shared memory could be organized.

Introduce possibility to allocate multiple shared memory mappings, where
a single mapping is associated with a specified shared memory slot.
There is only fixed amount of available slots, currently only one main
shared memory slot is allocated. A new shared memory API is introduces,
extended with a slot as a new parameter. As a path of least resistance,
the original API is kept in place, utilizing the main shared memory slot.
---
 src/backend/port/posix_sema.c       |   4 +-
 src/backend/port/sysv_sema.c        |   4 +-
 src/backend/port/sysv_shmem.c       | 138 +++++++++++++++++++---------
 src/backend/port/win32_sema.c       |   2 +-
 src/backend/storage/ipc/ipc.c       |   2 +-
 src/backend/storage/ipc/ipci.c      |  61 ++++++------
 src/backend/storage/ipc/shmem.c     | 135 ++++++++++++++++++---------
 src/backend/storage/lmgr/lwlock.c   |   5 +-
 src/include/storage/buf_internals.h |   1 +
 src/include/storage/ipc.h           |   2 +-
 src/include/storage/pg_sema.h       |   2 +-
 src/include/storage/pg_shmem.h      |  18 ++++
 src/include/storage/shmem.h         |  10 ++
 13 files changed, 261 insertions(+), 123 deletions(-)

diff --git a/src/backend/port/posix_sema.c b/src/backend/port/posix_sema.c
index 64186ec0a7e..b97723d2ede 100644
--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -193,7 +193,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * we don't have to expose the counters to other processes.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_slot)
 {
 	struct stat statbuf;
 
@@ -220,7 +220,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSlot(PGSemaphoreShmemSize(maxSemas), shmem_slot);
 #endif
 
 	numSems = 0;
diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c
index 68835723b90..e6720a6a077 100644
--- a/src/backend/port/sysv_sema.c
+++ b/src/backend/port/sysv_sema.c
@@ -313,7 +313,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * have clobbered.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_slot)
 {
 	struct stat statbuf;
 
@@ -334,7 +334,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSlot(PGSemaphoreShmemSize(maxSemas), shmem_slot);
 	numSharedSemas = 0;
 	maxSharedSemas = maxSemas;
 
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index a5a4511f66d..475c9c8f1a1 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -94,8 +94,19 @@ typedef enum
 unsigned long UsedShmemSegID = 0;
 void	   *UsedShmemSegAddr = NULL;
 
-static Size AnonymousShmemSize;
-static void *AnonymousShmem = NULL;
+typedef struct AnonymousMapping
+{
+	int shmem_slot;
+	Size shmem_size; 			/* Size of the mapping */
+	void *shmem; 				/* Pointer to the start of the mapped memory */
+	void *seg_addr; 			/* SysV shared memory for the header */
+	unsigned long seg_id; 		/* IPC key */
+} AnonymousMapping;
+
+static AnonymousMapping Mappings[ANON_MAPPINGS];
+
+/* Keeps track of used mapping slots */
+static int next_free_slot = 0;
 
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
@@ -104,6 +115,28 @@ static IpcMemoryState PGSharedMemoryAttach(IpcMemoryId shmId,
 										   void *attachAt,
 										   PGShmemHeader **addr);
 
+static const char*
+MappingName(int shmem_slot)
+{
+	switch (shmem_slot)
+	{
+		case MAIN_SHMEM_SLOT:
+			return "main";
+		default:
+			return "unknown";
+	}
+}
+
+static void
+DebugMappings()
+{
+	for(int i = 0; i < next_free_slot; i++)
+	{
+		AnonymousMapping m = Mappings[i];
+		elog(DEBUG1, "Mapping[%s]: addr %p, size %zu",
+			 MappingName(i), m.shmem, m.shmem_size);
+	}
+}
 
 /*
  *	InternalIpcMemoryCreate(memKey, size)
@@ -591,14 +624,13 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
 /*
  * Creates an anonymous mmap()ed shared memory segment.
  *
- * Pass the requested size in *size.  This function will modify *size to the
- * actual size of the allocation, if it ends up allocating a segment that is
- * larger than requested.
+ * This function will modify mapping size to the actual size of the allocation,
+ * if it ends up allocating a segment that is larger than requested.
  */
-static void *
-CreateAnonymousSegment(Size *size)
+static void
+CreateAnonymousSegment(AnonymousMapping *mapping)
 {
-	Size		allocsize = *size;
+	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
 
@@ -623,8 +655,11 @@ CreateAnonymousSegment(Size *size)
 				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
-			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-				 allocsize);
+		{
+			DebugMappings();
+			elog(DEBUG1, "slot[%s]: mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
+				 MappingName(mapping->shmem_slot), allocsize);
+		}
 	}
 #endif
 
@@ -642,7 +677,7 @@ CreateAnonymousSegment(Size *size)
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
-		allocsize = *size;
+		allocsize = mapping->shmem_size;
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
 				   PG_MMAP_FLAGS, -1, 0);
 		mmap_errno = errno;
@@ -651,8 +686,10 @@ CreateAnonymousSegment(Size *size)
 	if (ptr == MAP_FAILED)
 	{
 		errno = mmap_errno;
+		DebugMappings();
 		ereport(FATAL,
-				(errmsg("could not map anonymous shared memory: %m"),
+				(errmsg("slot[%s]: could not map anonymous shared memory: %m",
+						MappingName(mapping->shmem_slot)),
 				 (mmap_errno == ENOMEM) ?
 				 errhint("This error usually means that PostgreSQL's request "
 						 "for a shared memory segment exceeded available memory, "
@@ -663,8 +700,8 @@ CreateAnonymousSegment(Size *size)
 						 allocsize) : 0));
 	}
 
-	*size = allocsize;
-	return ptr;
+	mapping->shmem = ptr;
+	mapping->shmem_size = allocsize;
 }
 
 /*
@@ -674,13 +711,18 @@ CreateAnonymousSegment(Size *size)
 static void
 AnonymousShmemDetach(int status, Datum arg)
 {
-	/* Release anonymous shared memory block, if any. */
-	if (AnonymousShmem != NULL)
+	for(int i = 0; i < next_free_slot; i++)
 	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		AnonymousMapping m = Mappings[i];
+
+		/* Release anonymous shared memory block, if any. */
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
 
@@ -705,6 +747,7 @@ PGSharedMemoryCreate(Size size,
 	PGShmemHeader *hdr;
 	struct stat statbuf;
 	Size		sysvsize;
+	AnonymousMapping *mapping = &Mappings[next_free_slot];
 
 	/*
 	 * We use the data directory's ID info (inode and device numbers) to
@@ -733,11 +776,15 @@ PGSharedMemoryCreate(Size size,
 
 	/* Room for a header? */
 	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
+	mapping->shmem_size = size;
+	mapping->shmem_slot = next_free_slot;
 
 	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
-		AnonymousShmem = CreateAnonymousSegment(&size);
-		AnonymousShmemSize = size;
+		/* On success, mapping data will be modified. */
+		CreateAnonymousSegment(mapping);
+
+		next_free_slot++;
 
 		/* Register on-exit routine to unmap the anonymous segment */
 		on_shmem_exit(AnonymousShmemDetach, (Datum) 0);
@@ -760,7 +807,7 @@ PGSharedMemoryCreate(Size size,
 	 * loop simultaneously.  (CreateDataDirLockFile() does not entirely ensure
 	 * that, but prefer fixing it over coping here.)
 	 */
-	NextShmemSegID = statbuf.st_ino;
+	NextShmemSegID = statbuf.st_ino + next_free_slot;
 
 	for (;;)
 	{
@@ -852,13 +899,13 @@ PGSharedMemoryCreate(Size size,
 	/*
 	 * Initialize space allocation status for segment.
 	 */
-	hdr->totalsize = size;
+	hdr->totalsize = mapping->shmem_size;
 	hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader));
 	*shim = hdr;
 
 	/* Save info for possible future use */
-	UsedShmemSegAddr = memAddress;
-	UsedShmemSegID = (unsigned long) NextShmemSegID;
+	mapping->seg_addr = memAddress;
+	mapping->seg_id = (unsigned long) NextShmemSegID;
 
 	/*
 	 * If AnonymousShmem is NULL here, then we're not using anonymous shared
@@ -866,10 +913,10 @@ PGSharedMemoryCreate(Size size,
 	 * block. Otherwise, the System V shared memory block is only a shim, and
 	 * we must return a pointer to the real block.
 	 */
-	if (AnonymousShmem == NULL)
+	if (mapping->shmem == NULL)
 		return hdr;
-	memcpy(AnonymousShmem, hdr, sizeof(PGShmemHeader));
-	return (PGShmemHeader *) AnonymousShmem;
+	memcpy(mapping->shmem, hdr, sizeof(PGShmemHeader));
+	return (PGShmemHeader *) mapping->shmem;
 }
 
 #ifdef EXEC_BACKEND
@@ -969,23 +1016,28 @@ PGSharedMemoryNoReAttach(void)
 void
 PGSharedMemoryDetach(void)
 {
-	if (UsedShmemSegAddr != NULL)
+	for(int i = 0; i < next_free_slot; i++)
 	{
-		if ((shmdt(UsedShmemSegAddr) < 0)
+		AnonymousMapping m = Mappings[i];
+
+		if (m.seg_addr != NULL)
+		{
+			if ((shmdt(m.seg_addr) < 0)
 #if defined(EXEC_BACKEND) && defined(__CYGWIN__)
-		/* Work-around for cygipc exec bug */
-			&& shmdt(NULL) < 0
+			/* Work-around for cygipc exec bug */
+				&& shmdt(NULL) < 0
 #endif
-			)
-			elog(LOG, "shmdt(%p) failed: %m", UsedShmemSegAddr);
-		UsedShmemSegAddr = NULL;
-	}
+				)
+				elog(LOG, "shmdt(%p) failed: %m", m.seg_addr);
+			m.seg_addr = NULL;
+		}
 
-	if (AnonymousShmem != NULL)
-	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
diff --git a/src/backend/port/win32_sema.c b/src/backend/port/win32_sema.c
index f2b54bdfda0..d62084cc0d9 100644
--- a/src/backend/port/win32_sema.c
+++ b/src/backend/port/win32_sema.c
@@ -44,7 +44,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * process exits.
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_slot)
 {
 	mySemSet = (HANDLE *) malloc(maxSemas * sizeof(HANDLE));
 	if (mySemSet == NULL)
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index b06e4b84528..2aabd4a77f3 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -68,7 +68,7 @@ static void proc_exit_prepare(int code);
  * ----------------------------------------------------------------
  */
 
-#define MAX_ON_EXITS 20
+#define MAX_ON_EXITS 40
 
 struct ONEXIT
 {
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 7783ba854fc..c0e1d94d1f7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -85,7 +85,7 @@ RequestAddinShmemSpace(Size size)
  * required.
  */
 Size
-CalculateShmemSize(int *num_semaphores)
+CalculateShmemSize(int *num_semaphores, int shmem_slot)
 {
 	Size		size;
 	int			numSemas;
@@ -204,33 +204,36 @@ CreateSharedMemoryAndSemaphores(void)
 
 	Assert(!IsUnderPostmaster);
 
-	/* Compute the size of the shared-memory block */
-	size = CalculateShmemSize(&numSemas);
-	elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
-
-	/*
-	 * Create the shmem segment
-	 */
-	seghdr = PGSharedMemoryCreate(size, &shim);
-
-	/*
-	 * Make sure that huge pages are never reported as "unknown" while the
-	 * server is running.
-	 */
-	Assert(strcmp("unknown",
-				  GetConfigOption("huge_pages_status", false, false)) != 0);
-
-	InitShmemAccess(seghdr);
-
-	/*
-	 * Create semaphores
-	 */
-	PGReserveSemaphores(numSemas);
-
-	/*
-	 * Set up shared memory allocation mechanism
-	 */
-	InitShmemAllocation();
+	for(int slot = 0; slot < ANON_MAPPINGS; slot++)
+	{
+		/* Compute the size of the shared-memory block */
+		size = CalculateShmemSize(&numSemas, slot);
+		elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
+
+		/*
+		 * Create the shmem segment
+		 */
+		seghdr = PGSharedMemoryCreate(size, &shim);
+
+		/*
+		 * Make sure that huge pages are never reported as "unknown" while the
+		 * server is running.
+		 */
+		Assert(strcmp("unknown",
+					  GetConfigOption("huge_pages_status", false, false)) != 0);
+
+		InitShmemAccessInSlot(seghdr, slot);
+
+		/*
+		 * Create semaphores
+		 */
+		PGReserveSemaphores(numSemas, slot);
+
+		/*
+		 * Set up shared memory allocation mechanism
+		 */
+		InitShmemAllocationInSlot(slot);
+	}
 
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
@@ -360,7 +363,7 @@ InitializeShmemGUCs(void)
 	/*
 	 * Calculate the shared memory size and round up to the nearest megabyte.
 	 */
-	size_b = CalculateShmemSize(&num_semas);
+	size_b = CalculateShmemSize(&num_semas, MAIN_SHMEM_SLOT);
 	size_mb = add_size(size_b, (1024 * 1024) - 1) / (1024 * 1024);
 	sprintf(buf, "%zu", size_mb);
 	SetConfigOption("shared_memory_size", buf,
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 6d3074594a6..89d8c7baf16 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -75,17 +75,12 @@
 #include "utils/builtins.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
+static void *ShmemAllocRawInSlot(Size size, Size *allocated_size,
+								 int shmem_slot);
 
 /* shared memory global variables */
 
-static PGShmemHeader *ShmemSegHdr;	/* shared mem segment header */
-
-static void *ShmemBase;			/* start address of shared memory */
-
-static void *ShmemEnd;			/* end+1 address of shared memory */
-
-slock_t    *ShmemLock;			/* spinlock for shared memory and LWLock
-								 * allocation */
+ShmemSegment Segments[ANON_MAPPINGS];
 
 static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
 
@@ -96,9 +91,17 @@ static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
 void
 InitShmemAccess(PGShmemHeader *seghdr)
 {
-	ShmemSegHdr = seghdr;
-	ShmemBase = seghdr;
-	ShmemEnd = (char *) ShmemBase + seghdr->totalsize;
+	InitShmemAccessInSlot(seghdr, MAIN_SHMEM_SLOT);
+}
+
+void
+InitShmemAccessInSlot(PGShmemHeader *seghdr, int shmem_slot)
+{
+	ShmemSegment *seg = &Segments[shmem_slot];
+
+	seg->ShmemSegHdr = seghdr;
+	seg->ShmemBase = (void *) seghdr;
+	seg->ShmemEnd = (char *) seg->ShmemBase + seghdr->totalsize;
 }
 
 /*
@@ -109,7 +112,13 @@ InitShmemAccess(PGShmemHeader *seghdr)
 void
 InitShmemAllocation(void)
 {
-	PGShmemHeader *shmhdr = ShmemSegHdr;
+	InitShmemAllocationInSlot(MAIN_SHMEM_SLOT);
+}
+
+void
+InitShmemAllocationInSlot(int shmem_slot)
+{
+	PGShmemHeader *shmhdr = Segments[shmem_slot].ShmemSegHdr;
 	char	   *aligned;
 
 	Assert(shmhdr != NULL);
@@ -118,9 +127,9 @@ InitShmemAllocation(void)
 	 * Initialize the spinlock used by ShmemAlloc.  We must use
 	 * ShmemAllocUnlocked, since obviously ShmemAlloc can't be called yet.
 	 */
-	ShmemLock = (slock_t *) ShmemAllocUnlocked(sizeof(slock_t));
+	Segments[shmem_slot].ShmemLock = (slock_t *) ShmemAllocUnlockedInSlot(sizeof(slock_t), shmem_slot);
 
-	SpinLockInit(ShmemLock);
+	SpinLockInit(Segments[shmem_slot].ShmemLock);
 
 	/*
 	 * Allocations after this point should go through ShmemAlloc, which
@@ -145,11 +154,17 @@ InitShmemAllocation(void)
  */
 void *
 ShmemAlloc(Size size)
+{
+	return ShmemAllocInSlot(size, MAIN_SHMEM_SLOT);
+}
+
+void *
+ShmemAllocInSlot(Size size, int shmem_slot)
 {
 	void	   *newSpace;
 	Size		allocated_size;
 
-	newSpace = ShmemAllocRaw(size, &allocated_size);
+	newSpace = ShmemAllocRawInSlot(size, &allocated_size, shmem_slot);
 	if (!newSpace)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -179,10 +194,17 @@ ShmemAllocNoError(Size size)
  */
 static void *
 ShmemAllocRaw(Size size, Size *allocated_size)
+{
+	return ShmemAllocRawInSlot(size, allocated_size, MAIN_SHMEM_SLOT);
+}
+
+static void *
+ShmemAllocRawInSlot(Size size, Size *allocated_size, int shmem_slot)
 {
 	Size		newStart;
 	Size		newFree;
 	void	   *newSpace;
+	ShmemSegment *seg = &Segments[shmem_slot];
 
 	/*
 	 * Ensure all space is adequately aligned.  We used to only MAXALIGN this
@@ -198,22 +220,22 @@ ShmemAllocRaw(Size size, Size *allocated_size)
 	size = CACHELINEALIGN(size);
 	*allocated_size = size;
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(seg->ShmemSegHdr != NULL);
 
-	SpinLockAcquire(ShmemLock);
+	SpinLockAcquire(seg->ShmemLock);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = seg->ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree <= ShmemSegHdr->totalsize)
+	if (newFree <= seg->ShmemSegHdr->totalsize)
 	{
-		newSpace = (char *) ShmemBase + newStart;
-		ShmemSegHdr->freeoffset = newFree;
+		newSpace = (char *) seg->ShmemBase + newStart;
+		seg->ShmemSegHdr->freeoffset = newFree;
 	}
 	else
 		newSpace = NULL;
 
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(seg->ShmemLock);
 
 	/* note this assert is okay with newSpace == NULL */
 	Assert(newSpace == (void *) CACHELINEALIGN(newSpace));
@@ -231,29 +253,36 @@ ShmemAllocRaw(Size size, Size *allocated_size)
  */
 void *
 ShmemAllocUnlocked(Size size)
+{
+	return ShmemAllocUnlockedInSlot(size, MAIN_SHMEM_SLOT);
+}
+
+void *
+ShmemAllocUnlockedInSlot(Size size, int shmem_slot)
 {
 	Size		newStart;
 	Size		newFree;
 	void	   *newSpace;
+	ShmemSegment *seg = &Segments[shmem_slot];
 
 	/*
 	 * Ensure allocated space is adequately aligned.
 	 */
 	size = MAXALIGN(size);
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(seg->ShmemSegHdr != NULL);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = seg->ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree > ShmemSegHdr->totalsize)
+	if (newFree > seg->ShmemSegHdr->totalsize)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of shared memory (%zu bytes requested)",
 						size)));
-	ShmemSegHdr->freeoffset = newFree;
+	seg->ShmemSegHdr->freeoffset = newFree;
 
-	newSpace = (char *) ShmemBase + newStart;
+	newSpace = (char *) seg->ShmemBase + newStart;
 
 	Assert(newSpace == (void *) MAXALIGN(newSpace));
 
@@ -268,7 +297,13 @@ ShmemAllocUnlocked(Size size)
 bool
 ShmemAddrIsValid(const void *addr)
 {
-	return (addr >= ShmemBase) && (addr < ShmemEnd);
+	return ShmemAddrIsValidInSlot(addr, MAIN_SHMEM_SLOT);
+}
+
+bool
+ShmemAddrIsValidInSlot(const void *addr, int shmem_slot)
+{
+	return (addr >= Segments[shmem_slot].ShmemBase) && (addr < Segments[shmem_slot].ShmemEnd);
 }
 
 /*
@@ -329,6 +364,18 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
+{
+	return ShmemInitHashInSlot(name, init_size, max_size, infoP, hash_flags,
+							   MAIN_SHMEM_SLOT);
+}
+
+HTAB *
+ShmemInitHashInSlot(const char *name,		/* table string name for shmem index */
+			  long init_size,	/* initial table size */
+			  long max_size,	/* max size of the table */
+			  HASHCTL *infoP,	/* info about key and bucket size */
+			  int hash_flags,	/* info about infoP */
+			  int shmem_slot) 	/* in which slot to keep the table */
 {
 	bool		found;
 	void	   *location;
@@ -345,9 +392,9 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 	hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
 
 	/* look it up in the shmem index */
-	location = ShmemInitStruct(name,
+	location = ShmemInitStructInSlot(name,
 							   hash_get_shared_size(infoP, hash_flags),
-							   &found);
+							   &found, shmem_slot);
 
 	/*
 	 * if it already exists, attach to it rather than allocate and initialize
@@ -380,6 +427,13 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
  */
 void *
 ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+{
+	return ShmemInitStructInSlot(name, size, foundPtr, MAIN_SHMEM_SLOT);
+}
+
+void *
+ShmemInitStructInSlot(const char *name, Size size, bool *foundPtr,
+					  int shmem_slot)
 {
 	ShmemIndexEnt *result;
 	void	   *structPtr;
@@ -388,7 +442,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	if (!ShmemIndex)
 	{
-		PGShmemHeader *shmemseghdr = ShmemSegHdr;
+		PGShmemHeader *shmemseghdr = Segments[shmem_slot].ShmemSegHdr;
 
 		/* Must be trying to create/attach to ShmemIndex itself */
 		Assert(strcmp(name, "ShmemIndex") == 0);
@@ -411,7 +465,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 			 * process can be accessing shared memory yet.
 			 */
 			Assert(shmemseghdr->index == NULL);
-			structPtr = ShmemAlloc(size);
+			structPtr = ShmemAllocInSlot(size, shmem_slot);
 			shmemseghdr->index = structPtr;
 			*foundPtr = false;
 		}
@@ -428,8 +482,8 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		LWLockRelease(ShmemIndexLock);
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("could not create ShmemIndex entry for data structure \"%s\"",
-						name)));
+				 errmsg("could not create ShmemIndex entry for data structure \"%s\" in slot %d",
+						name, shmem_slot)));
 	}
 
 	if (*foundPtr)
@@ -454,7 +508,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		Size		allocated_size;
 
 		/* It isn't in the table yet. allocate and initialize it */
-		structPtr = ShmemAllocRaw(size, &allocated_size);
+		structPtr = ShmemAllocRawInSlot(size, &allocated_size, shmem_slot);
 		if (structPtr == NULL)
 		{
 			/* out of memory; remove the failed ShmemIndex entry */
@@ -473,14 +527,13 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	LWLockRelease(ShmemIndexLock);
 
-	Assert(ShmemAddrIsValid(structPtr));
+	Assert(ShmemAddrIsValidInSlot(structPtr, shmem_slot));
 
 	Assert(structPtr == (void *) CACHELINEALIGN(structPtr));
 
 	return structPtr;
 }
 
-
 /*
  * Add two Size values, checking for overflow
  */
@@ -540,7 +593,7 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 	while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
 	{
 		values[0] = CStringGetTextDatum(ent->key);
-		values[1] = Int64GetDatum((char *) ent->location - (char *) ShmemSegHdr);
+		values[1] = Int64GetDatum((char *) ent->location - (char *) Segments[MAIN_SHMEM_SLOT].ShmemSegHdr);
 		values[2] = Int64GetDatum(ent->size);
 		values[3] = Int64GetDatum(ent->allocated_size);
 		named_allocated += ent->allocated_size;
@@ -552,15 +605,15 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 	/* output shared memory allocated but not counted via the shmem index */
 	values[0] = CStringGetTextDatum("<anonymous>");
 	nulls[1] = true;
-	values[2] = Int64GetDatum(ShmemSegHdr->freeoffset - named_allocated);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SLOT].ShmemSegHdr->freeoffset - named_allocated);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
 	/* output as-of-yet unused shared memory */
 	nulls[0] = true;
-	values[1] = Int64GetDatum(ShmemSegHdr->freeoffset);
+	values[1] = Int64GetDatum(Segments[MAIN_SHMEM_SLOT].ShmemSegHdr->freeoffset);
 	nulls[1] = false;
-	values[2] = Int64GetDatum(ShmemSegHdr->totalsize - ShmemSegHdr->freeoffset);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SLOT].ShmemSegHdr->totalsize - Segments[MAIN_SHMEM_SLOT].ShmemSegHdr->freeoffset);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 9cf3e4f4f3a..cd3237b3736 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -81,6 +81,7 @@
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
 #include "postmaster/postmaster.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/proclist.h"
 #include "storage/spin.h"
@@ -607,9 +608,9 @@ LWLockNewTrancheId(void)
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - sizeof(int));
 	/* We use the ShmemLock spinlock to protect LWLockCounter */
-	SpinLockAcquire(ShmemLock);
+	SpinLockAcquire(Segments[MAIN_SHMEM_SLOT].ShmemLock);
 	result = (*LWLockCounter)++;
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[MAIN_SHMEM_SLOT].ShmemLock);
 
 	return result;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index eda6c699212..b25dc0199b8 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -22,6 +22,7 @@
 #include "storage/condition_variable.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
+#include "storage/pg_shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index b2d062781ec..be4b1312888 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -77,7 +77,7 @@ extern void check_on_shmem_exit_lists_are_empty(void);
 /* ipci.c */
 extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;
 
-extern Size CalculateShmemSize(int *num_semaphores);
+extern Size CalculateShmemSize(int *num_semaphores, int shmem_slot);
 extern void CreateSharedMemoryAndSemaphores(void);
 #ifdef EXEC_BACKEND
 extern void AttachSharedMemoryStructs(void);
diff --git a/src/include/storage/pg_sema.h b/src/include/storage/pg_sema.h
index dfef79ac963..081fffaf165 100644
--- a/src/include/storage/pg_sema.h
+++ b/src/include/storage/pg_sema.h
@@ -41,7 +41,7 @@ typedef HANDLE PGSemaphore;
 extern Size PGSemaphoreShmemSize(int maxSemas);
 
 /* Module initialization (called during postmaster start or shmem reinit) */
-extern void PGReserveSemaphores(int maxSemas);
+extern void PGReserveSemaphores(int maxSemas, int shmem_slot);
 
 /* Allocate a PGSemaphore structure with initial count 1 */
 extern PGSemaphore PGSemaphoreCreate(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 3065ff5be71..e968deeef7f 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -25,6 +25,7 @@
 #define PG_SHMEM_H
 
 #include "storage/dsm_impl.h"
+#include "storage/spin.h"
 
 typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 {
@@ -41,6 +42,20 @@ typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 #endif
 } PGShmemHeader;
 
+typedef struct ShmemSegment
+{
+	PGShmemHeader *ShmemSegHdr; 	/* shared mem segment header */
+	void *ShmemBase; 				/* start address of shared memory */
+	void *ShmemEnd; 				/* end+1 address of shared memory */
+	slock_t    *ShmemLock; 			/* spinlock for shared memory and LWLock
+									 * allocation */
+} ShmemSegment;
+
+// Number of available slots for anonymous memory mappings
+#define ANON_MAPPINGS 1
+
+extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -90,4 +105,7 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+/* The main slot, contains everything except buffer blocks and related data. */
+#define MAIN_SHMEM_SLOT 0
+
 #endif							/* PG_SHMEM_H */
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index 8cdbe7a89c8..4261b4039b9 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -29,15 +29,25 @@
 extern PGDLLIMPORT slock_t *ShmemLock;
 struct PGShmemHeader;			/* avoid including storage/pg_shmem.h here */
 extern void InitShmemAccess(struct PGShmemHeader *seghdr);
+extern void InitShmemAccessInSlot(struct PGShmemHeader *seghdr, int shmem_slot);
 extern void InitShmemAllocation(void);
+extern void InitShmemAllocationInSlot(int shmem_slot);
 extern void *ShmemAlloc(Size size);
+extern void *ShmemAllocInSlot(Size size, int shmem_slot);
 extern void *ShmemAllocNoError(Size size);
 extern void *ShmemAllocUnlocked(Size size);
+extern void *ShmemAllocUnlockedInSlot(Size size, int shmem_slot);
 extern bool ShmemAddrIsValid(const void *addr);
+extern bool ShmemAddrIsValidInSlot(const void *addr, int shmem_slot);
 extern void InitShmemIndex(void);
+extern void InitVariableShmemIndex(void);
 extern HTAB *ShmemInitHash(const char *name, long init_size, long max_size,
 						   HASHCTL *infoP, int hash_flags);
+extern HTAB *ShmemInitHashInSlot(const char *name, long init_size, long max_size,
+						   HASHCTL *infoP, int hash_flags, int shmem_slot);
 extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
+extern void *ShmemInitStructInSlot(const char *name, Size size, bool *foundPtr,
+								   int shmem_slot);
 extern Size add_size(Size s1, Size s2);
 extern Size mul_size(Size s1, Size s2);
 
-- 
2.34.1

0004-Allow-to-resize-shared-memory-without-resta-20250113.patchtext/x-patch; charset=US-ASCII; name=0004-Allow-to-resize-shared-memory-without-resta-20250113.patchDownload

From ba93989e755b23f3cef21ad0d11a14231e866b07 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Wed, 16 Oct 2024 20:24:58 +0200
Subject: [PATCH 4/7] Allow to resize shared memory without restart

Add assing hook for shared_buffers to resize shared memory using space,
introduced in the previous commits without requiring PostgreSQL restart.
Size for every shared memory slot is recalculated based on the new
NBuffers, and extended using mremap. After allocating new space, new
shared structures (buffer blocks, descriptors, etc) are allocated as
needed. Here is how it looks like after raising shared_buffers from 128
MB to 512 MB and calling pg_reload_conf():

    -- 128 MB
    7f5a2bd04000-7f5a32e52000  /dev/zero (deleted)
    7f5a39252000-7f5a4030e000  /dev/zero (deleted)
    7f5a4670e000-7f5a4d7ba000  /dev/zero (deleted)
    7f5a53bba000-7f5a5ad26000  /dev/zero (deleted)
    7f5a9ad26000-7f5aa9d94000  /dev/zero (deleted)
    ^ buffers mapping, ~240 MB
    7f5d29d94000-7f5d30e00000  /dev/zero (deleted)

    -- 512 MB
    7f5a2bd04000-7f5a33274000  /dev/zero (deleted)
    7f5a39252000-7f5a4057e000  /dev/zero (deleted)
    7f5a4670e000-7f5a4d9fa000  /dev/zero (deleted)
    7f5a53bba000-7f5a5b1a6000  /dev/zero (deleted)
    7f5a9ad26000-7f5ac1f14000  /dev/zero (deleted)
    ^ buffers mapping, ~625 MB
    7f5d29d94000-7f5d30f80000  /dev/zero (deleted)

The implementation supports only increasing of shared_buffers. For
decreasing the value a similar procedure is needed. But the buffer
blocks with data have to be drained first, so that the actual data set
fits into the new smaller space.

From experiment it turns out that shared mappings have to be extended
separately for each process that uses them. Another rough edge is that a
backend, executing pg_reload_conf interactively, will not resize
mappings immediately, for some reason it will require another command.

Note, that mremap is Linux specific, thus the implementation not very
portable.
---
 src/backend/port/sysv_shmem.c                 | 62 +++++++++++++
 src/backend/storage/buffer/buf_init.c         | 86 +++++++++++++++++++
 src/backend/storage/ipc/ipci.c                | 11 +++
 src/backend/storage/ipc/shmem.c               | 14 ++-
 .../utils/activity/wait_event_names.txt       |  1 +
 src/backend/utils/misc/guc_tables.c           |  4 +-
 src/include/storage/bufmgr.h                  |  1 +
 src/include/storage/lwlocklist.h              |  1 +
 src/include/storage/pg_shmem.h                |  2 +
 9 files changed, 171 insertions(+), 11 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 7157bf95b1a..72e823618ef 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -30,9 +30,11 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "portability/mem.h"
+#include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
@@ -861,6 +863,66 @@ AnonymousShmemDetach(int status, Datum arg)
 	}
 }
 
+/*
+ * An assign callback for shared_buffers GUC -- a somewhat clumsy way of
+ * resizing shared memory without a restart. On NBuffers change use the new
+ * value to recalculate required size for every shmem slot, then base on the
+ * new and old values initialize new buffer blocks.
+ *
+ * The actual slot resizing is done via mremap, which will fail if is not
+ * sufficient space to expand the mapping.
+ *
+ * XXX: For some readon in the current implementation the change is applied to
+ * the backend calling pg_reload_conf only at the backend exit.
+ */
+void
+AnonymousShmemResize(int newval, void *extra)
+{
+	int	numSemas;
+	bool reinit = false;
+	int NBuffersOld = NBuffers;
+
+	/*
+	 * XXX: Currently only increasing of shared_buffers is supported. For
+	 * decreasing something similar has to be done, but buffer blocks with
+	 * data have to be drained first.
+	 */
+	if(NBuffers > newval)
+		return;
+
+	/* XXX: Hack, NBuffers has to be exposed in the the interface for
+	 * memory calculation and buffer blocks reinitialization instead. */
+	NBuffers = newval;
+
+	for(int i = 0; i < next_free_slot; i++)
+	{
+		Size new_size = CalculateShmemSize(&numSemas, i);
+		AnonymousMapping *m = &Mappings[i];
+
+		if (m->shmem == NULL)
+			continue;
+
+		if (m->shmem_size == new_size)
+			continue;
+
+		if (mremap(m->shmem, m->shmem_size, new_size, 0) < 0)
+			elog(LOG, "mremap(%p, %zu) failed: %m",
+				 m->shmem, m->shmem_size);
+		else
+		{
+			reinit = true;
+			m->shmem_size = new_size;
+		}
+	}
+
+	if (reinit)
+	{
+		LWLockAcquire(ShmemResizeLock, LW_EXCLUSIVE);
+		BufferManagerShmemResize(NBuffersOld);
+		LWLockRelease(ShmemResizeLock);
+	}
+}
+
 /*
  * PGSharedMemoryCreate
  *
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index b066e97a0c9..ae58f82937f 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -153,6 +153,92 @@ BufferManagerShmemInit(void)
 						 &backend_flush_after);
 }
 
+/*
+ * Reinitialize shared memory structures, which size depends on NBuffers. It's
+ * similar to BufferManagerShmemInit, but applied only to the buffers in the range
+ * between NBuffersOld and NBuffers.
+ */
+void
+BufferManagerShmemResize(int NBuffersOld)
+{
+	bool		foundBufs,
+				foundDescs,
+				foundIOCV,
+				foundBufCkpt;
+	int			i;
+
+	/* XXX: Only increasing of shared_buffers is supported in this function */
+	if(NBuffersOld > NBuffers)
+		return;
+
+	/* Align descriptors to a cacheline boundary. */
+	BufferDescriptors = (BufferDescPadded *)
+		ShmemInitStructInSlot("Buffer Descriptors",
+						NBuffers * sizeof(BufferDescPadded),
+						&foundDescs, BUFFER_DESCRIPTORS_SHMEM_SLOT);
+
+	/* Align condition variables to cacheline boundary. */
+	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
+		ShmemInitStructInSlot("Buffer IO Condition Variables",
+						NBuffers * sizeof(ConditionVariableMinimallyPadded),
+						&foundIOCV, BUFFER_IOCV_SHMEM_SLOT);
+
+	/*
+	 * The array used to sort to-be-checkpointed buffer ids is located in
+	 * shared memory, to avoid having to allocate significant amounts of
+	 * memory at runtime. As that'd be in the middle of a checkpoint, or when
+	 * the checkpointer is restarted, memory allocation failures would be
+	 * painful.
+	 */
+	CkptBufferIds = (CkptSortItem *)
+		ShmemInitStructInSlot("Checkpoint BufferIds",
+						NBuffers * sizeof(CkptSortItem), &foundBufCkpt,
+						CHECKPOINT_BUFFERS_SHMEM_SLOT);
+
+	/* Align buffer pool on IO page size boundary. */
+	BufferBlocks = (char *)
+		TYPEALIGN(PG_IO_ALIGN_SIZE,
+				  ShmemInitStructInSlot("Buffer Blocks",
+								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  &foundBufs, BUFFERS_SHMEM_SLOT));
+
+	/*
+	 * Initialize the headers for new buffers.
+	 */
+	for (i = NBuffersOld - 1; i < NBuffers; i++)
+	{
+		BufferDesc *buf = GetBufferDescriptor(i);
+
+		ClearBufferTag(&buf->tag);
+
+		pg_atomic_init_u32(&buf->state, 0);
+		buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
+
+		buf->buf_id = i;
+
+		/*
+		 * Initially link all the buffers together as unused. Subsequent
+		 * management of this list is done by freelist.c.
+		 */
+		buf->freeNext = i + 1;
+
+		LWLockInitialize(BufferDescriptorGetContentLock(buf),
+						 LWTRANCHE_BUFFER_CONTENT);
+
+		ConditionVariableInit(BufferDescriptorGetIOCV(buf));
+	}
+
+	/* Correct last entry of linked list */
+	GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
+
+	/* Init other shared buffer-management stuff */
+	StrategyInitialize(!foundDescs);
+
+	/* Initialize per-backend file flush context */
+	WritebackContextInit(&BackendWritebackContext,
+						 &backend_flush_after);
+}
+
 /*
  * BufferManagerShmemSize
  *
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index fd8b44b8161..15d06fd4ca4 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -83,6 +83,9 @@ RequestAddinShmemSpace(Size size)
  *
  * If num_semaphores is not NULL, it will be set to the number of semaphores
  * required.
+ *
+ * XXX: Calculation for non main shared memory slots are incorrect, it includes
+ * more than needed for buffers only.
  */
 Size
 CalculateShmemSize(int *num_semaphores, int shmem_slot)
@@ -149,6 +152,14 @@ CalculateShmemSize(int *num_semaphores, int shmem_slot)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 
+	/*
+	 * XXX: For some reason slightly more memory is needed for larger
+	 * shared_buffers, but this size is enough for any large value I've tested
+	 * with. Is it a mistake in how slots are split, or there was a hidden
+	 * inconsistency in shmem calculation?
+	 */
+	size = add_size(size, 1024 * 1024 * 100);
+
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
 
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 89d8c7baf16..faca7c9a525 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -490,17 +490,13 @@ ShmemInitStructInSlot(const char *name, Size size, bool *foundPtr,
 	{
 		/*
 		 * Structure is in the shmem index so someone else has allocated it
-		 * already.  The size better be the same as the size we are trying to
-		 * initialize to, or there is a name conflict (or worse).
+		 * already. Verify the structure's size:
+		 * - If it's the same, we've found the expected structure.
+		 * - If it's different, we're resizing the expected structure.
 		 */
 		if (result->size != size)
-		{
-			LWLockRelease(ShmemIndexLock);
-			ereport(ERROR,
-					(errmsg("ShmemIndex entry size is wrong for data structure"
-							" \"%s\": expected %zu, actual %zu",
-							name, size, result->size)));
-		}
+			result->size = size;
+
 		structPtr = result->location;
 	}
 	else
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 16144c2b72d..e8ecff5f7f0 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -345,6 +345,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+ShmemResize	"Waiting to resize shared memory."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8cf1afbad20..7a12eedbbd3 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2318,14 +2318,14 @@ struct config_int ConfigureNamesInt[] =
 	 * checking for overflow, so we mustn't allow more than INT_MAX / 2.
 	 */
 	{
-		{"shared_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+		{"shared_buffers", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the number of shared memory buffers used by the server."),
 			NULL,
 			GUC_UNIT_BLOCKS
 		},
 		&NBuffers,
 		16384, 16, INT_MAX / 2,
-		NULL, NULL, NULL
+		NULL, AnonymousShmemResize, NULL
 	},
 
 	{
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 27c4cac8540..ead69a2974c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -302,6 +302,7 @@ extern bool EvictUnpinnedBuffer(Buffer buf);
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(int);
+extern void BufferManagerShmemResize(int);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 6a2f64c54fb..e8d379e4b0b 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, ShmemResize)
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index c0143e38995..c1a96240d79 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -105,6 +105,8 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+extern void AnonymousShmemResize(int newval, void *extra);
+
 /*
  * To be able to dynamically resize largest parts of the data stored in shared
  * memory, we split it into multiple shared memory mappings slots. Each slot
-- 
2.34.1

0003-Introduce-multiple-shmem-slots-for-shared-b-20250113.patchtext/x-patch; charset=US-ASCII; name=0003-Introduce-multiple-shmem-slots-for-shared-b-20250113.patchDownload

From 4217a9e2b1922d22cc54b89bcdd6af9159fa9398 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Wed, 16 Oct 2024 20:24:04 +0200
Subject: [PATCH 3/7] Introduce multiple shmem slots for shared buffers

Add more shmem slots to split shared buffers into following chunks:
* BUFFERS_SHMEM_SLOT: contains buffer blocks
* BUFFER_DESCRIPTORS_SHMEM_SLOT: contains buffer descriptors
* BUFFER_IOCV_SHMEM_SLOT: contains condition variables for buffers
* CHECKPOINT_BUFFERS_SHMEM_SLOT: contains checkpoint buffer ids
* STRATEGY_SHMEM_SLOT: contains buffer strategy status

Size of the corresponding shared data directly depends on NBuffers, meaning
that if we would like to change NBuffers, they have to be resized
correspondingly. Placing each of them in a separate shmem slot allows to
achieve that.

There are some asumptions made about each of shmem slots upper size limit. The
buffer blocks have the largest, while the rest claim less extra room for
resize. Ideally those limits have to be deduced from the maximum allowed shared
memory.
---
 src/backend/port/sysv_shmem.c          | 17 +++++-
 src/backend/storage/buffer/buf_init.c  | 77 +++++++++++++++++---------
 src/backend/storage/buffer/buf_table.c |  5 +-
 src/backend/storage/buffer/freelist.c  |  4 +-
 src/backend/storage/ipc/ipci.c         |  2 +-
 src/include/storage/bufmgr.h           |  2 +-
 src/include/storage/pg_shmem.h         | 23 +++++++-
 7 files changed, 96 insertions(+), 34 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index bae8f19a755..7157bf95b1a 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -149,8 +149,13 @@ static int next_free_slot = 0;
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
  * ...
  */
-Size SHMEM_EXTRA_SIZE_LIMIT[1] = {
+Size SHMEM_EXTRA_SIZE_LIMIT[6] = {
 	0, 									/* MAIN_SHMEM_SLOT */
+	(Size) 1024 * 1024 * 1024 * 10, 	/* BUFFERS_SHMEM_SLOT */
+	(Size) 1024 * 1024 * 1024 * 1, 		/* BUFFER_DESCRIPTORS_SHMEM_SLOT */
+	(Size) 1024 * 1024 * 100, 			/* BUFFER_IOCV_SHMEM_SLOT */
+	(Size) 1024 * 1024 * 100, 			/* CHECKPOINT_BUFFERS_SHMEM_SLOT */
+	(Size) 1024 * 1024 * 100, 			/* STRATEGY_SHMEM_SLOT */
 };
 
 /* Remembers offset of the last mapping from the probe address */
@@ -179,6 +184,16 @@ MappingName(int shmem_slot)
 	{
 		case MAIN_SHMEM_SLOT:
 			return "main";
+		case BUFFERS_SHMEM_SLOT:
+			return "buffers";
+		case BUFFER_DESCRIPTORS_SHMEM_SLOT:
+			return "descriptors";
+		case BUFFER_IOCV_SHMEM_SLOT:
+			return "iocv";
+		case CHECKPOINT_BUFFERS_SHMEM_SLOT:
+			return "checkpoint";
+		case STRATEGY_SHMEM_SLOT:
+			return "strategy";
 		default:
 			return "unknown";
 	}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 56761a8eedc..b066e97a0c9 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -61,7 +61,10 @@ CkptSortItem *CkptBufferIds;
  * Initialize shared buffer pool
  *
  * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend).
+ * postmaster, or in a standalone backend). Size of data structures initialized
+ * here depends on NBuffers, and to be able to change NBuffers without a
+ * restart we store each structure into a separate shared memory slot, which
+ * could be resized on demand.
  */
 void
 BufferManagerShmemInit(void)
@@ -73,22 +76,22 @@ BufferManagerShmemInit(void)
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
+		ShmemInitStructInSlot("Buffer Descriptors",
 						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+						&foundDescs, BUFFER_DESCRIPTORS_SHMEM_SLOT);
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
 		TYPEALIGN(PG_IO_ALIGN_SIZE,
-				  ShmemInitStruct("Buffer Blocks",
+				  ShmemInitStructInSlot("Buffer Blocks",
 								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
-								  &foundBufs));
+								  &foundBufs, BUFFERS_SHMEM_SLOT));
 
 	/* Align condition variables to cacheline boundary. */
 	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
-		ShmemInitStruct("Buffer IO Condition Variables",
+		ShmemInitStructInSlot("Buffer IO Condition Variables",
 						NBuffers * sizeof(ConditionVariableMinimallyPadded),
-						&foundIOCV);
+						&foundIOCV, BUFFER_IOCV_SHMEM_SLOT);
 
 	/*
 	 * The array used to sort to-be-checkpointed buffer ids is located in
@@ -98,8 +101,9 @@ BufferManagerShmemInit(void)
 	 * painful.
 	 */
 	CkptBufferIds = (CkptSortItem *)
-		ShmemInitStruct("Checkpoint BufferIds",
-						NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
+		ShmemInitStructInSlot("Checkpoint BufferIds",
+						NBuffers * sizeof(CkptSortItem), &foundBufCkpt,
+						CHECKPOINT_BUFFERS_SHMEM_SLOT);
 
 	if (foundDescs || foundBufs || foundIOCV || foundBufCkpt)
 	{
@@ -153,33 +157,54 @@ BufferManagerShmemInit(void)
  * BufferManagerShmemSize
  *
  * compute the size of shared memory for the buffer pool including
- * data pages, buffer descriptors, hash tables, etc.
+ * data pages, buffer descriptors, hash tables, etc. based on the
+ * shared memory slot. The main slot must not allocate anything
+ * related to buffers, every other slot will receive part of the
+ * data.
  */
 Size
-BufferManagerShmemSize(void)
+BufferManagerShmemSize(int shmem_slot)
 {
 	Size		size = 0;
 
-	/* size of buffer descriptors */
-	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
-	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_slot == MAIN_SHMEM_SLOT)
+		return size;
+ 
+	if (shmem_slot == BUFFER_DESCRIPTORS_SHMEM_SLOT)
+	{
+		/* size of buffer descriptors */
+		size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
+		/* to allow aligning buffer descriptors */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
 
-	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
-	size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	if (shmem_slot == BUFFERS_SHMEM_SLOT)
+	{
+		/* size of data pages, plus alignment padding */
+		size = add_size(size, PG_IO_ALIGN_SIZE);
+		size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	}
 
-	/* size of stuff controlled by freelist.c */
-	size = add_size(size, StrategyShmemSize());
+	if (shmem_slot == STRATEGY_SHMEM_SLOT)
+	{
+		/* size of stuff controlled by freelist.c */
+		size = add_size(size, StrategyShmemSize());
+	}
 
-	/* size of I/O condition variables */
-	size = add_size(size, mul_size(NBuffers,
+	if (shmem_slot == BUFFER_IOCV_SHMEM_SLOT)
+	{
+		/* size of I/O condition variables */
+		size = add_size(size, mul_size(NBuffers,
 								   sizeof(ConditionVariableMinimallyPadded)));
-	/* to allow aligning the above */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+		/* to allow aligning the above */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
 
-	/* size of checkpoint sort array in bufmgr.c */
-	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	if (shmem_slot == CHECKPOINT_BUFFERS_SHMEM_SLOT)
+	{
+		/* size of checkpoint sort array in bufmgr.c */
+		size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	}
 
 	return size;
 }
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 141dd724802..ff761574aa4 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -59,10 +59,11 @@ InitBufTable(int size)
 	info.entrysize = sizeof(BufferLookupEnt);
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
-	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
+	SharedBufHash = ShmemInitHashInSlot("Shared Buffer Lookup Table",
 								  size, size,
 								  &info,
-								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION,
+								  STRATEGY_SHMEM_SLOT);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index dffdd57e9b5..325606dae71 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -491,9 +491,9 @@ StrategyInitialize(bool init)
 	 * Get or create the shared strategy control block
 	 */
 	StrategyControl = (BufferStrategyControl *)
-		ShmemInitStruct("Buffer Strategy Status",
+		ShmemInitStructInSlot("Buffer Strategy Status",
 						sizeof(BufferStrategyControl),
-						&found);
+						&found, STRATEGY_SHMEM_SLOT);
 
 	if (!found)
 	{
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index c0e1d94d1f7..fd8b44b8161 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -112,7 +112,7 @@ CalculateShmemSize(int *num_semaphores, int shmem_slot)
 											 sizeof(ShmemIndexEnt)));
 	size = add_size(size, dsm_estimate_size());
 	size = add_size(size, DSMRegistryShmemSize());
-	size = add_size(size, BufferManagerShmemSize());
+	size = add_size(size, BufferManagerShmemSize(shmem_slot));
 	size = add_size(size, LockManagerShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index eb0fba4230b..27c4cac8540 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -301,7 +301,7 @@ extern bool EvictUnpinnedBuffer(Buffer buf);
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
-extern Size BufferManagerShmemSize(void);
+extern Size BufferManagerShmemSize(int);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index e968deeef7f..c0143e38995 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -52,7 +52,7 @@ typedef struct ShmemSegment
 } ShmemSegment;
 
 // Number of available slots for anonymous memory mappings
-#define ANON_MAPPINGS 1
+#define ANON_MAPPINGS 6
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 
@@ -105,7 +105,28 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+/*
+ * To be able to dynamically resize largest parts of the data stored in shared
+ * memory, we split it into multiple shared memory mappings slots. Each slot
+ * contains only certain part of the data, which size depends on NBuffers.
+ */
+
 /* The main slot, contains everything except buffer blocks and related data. */
 #define MAIN_SHMEM_SLOT 0
 
+/* Buffer blocks */
+#define BUFFERS_SHMEM_SLOT 1
+
+/* Buffer descriptors */
+#define BUFFER_DESCRIPTORS_SHMEM_SLOT 2
+
+/* Condition variables for buffers */
+#define BUFFER_IOCV_SHMEM_SLOT 3
+
+/* Checkpoint BufferIds */
+#define CHECKPOINT_BUFFERS_SHMEM_SLOT 4
+
+/* Buffer strategy status */
+#define STRATEGY_SHMEM_SLOT 5
+
 #endif							/* PG_SHMEM_H */
-- 
2.34.1

0006-Add-TODOs-and-questions-about-previous-comm-20250113.patchtext/x-patch; charset=US-ASCII; name=0006-Add-TODOs-and-questions-about-previous-comm-20250113.patchDownload

From f33d7888253650c9f10634c8c28ea10c2e3d0fd8 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Mon, 6 Jan 2025 14:40:51 +0530
Subject: [PATCH 6/7] Add TODOs and questions about previous commits

The commit just marks the places which need more work or whether the
code raises some questions. This is not an exhaustive list of TODOs.
More TODOs may come up as I work with these patches further.

Ashutosh Bapat
---
 src/backend/storage/buffer/buf_init.c         |  9 +++++---
 src/backend/storage/ipc/ipc.c                 |  4 ++++
 src/backend/storage/ipc/ipci.c                | 11 +++++++++-
 src/backend/storage/ipc/shmem.c               | 22 ++++++++++++++++---
 src/backend/storage/lmgr/lwlock.c             |  5 +++++
 src/backend/tcop/postgres.c                   |  6 +++++
 .../utils/activity/wait_event_names.txt       |  1 +
 src/include/storage/buf_internals.h           |  5 +++++
 src/include/storage/pg_shmem.h                | 18 ++++++++++-----
 9 files changed, 69 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ae58f82937f..bf74b4b01a7 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -155,8 +155,11 @@ BufferManagerShmemInit(void)
 
 /*
  * Reinitialize shared memory structures, which size depends on NBuffers. It's
- * similar to BufferManagerShmemInit, but applied only to the buffers in the range
- * between NBuffersOld and NBuffers.
+ * similar to BufferManagerShmemInit, but applied only to the buffers in the
+ * range between NBuffersOld and NBuffers.
+ *
+ * TODO: Avoid code duplication with BufferManagerShmemInit() and also assess
+ * which functionality in the latter is required in this function.
  */
 void
 BufferManagerShmemResize(int NBuffersOld)
@@ -255,7 +258,7 @@ BufferManagerShmemSize(int shmem_slot)
 
 	if (shmem_slot == MAIN_SHMEM_SLOT)
 		return size;
- 
+
 	if (shmem_slot == BUFFER_DESCRIPTORS_SHMEM_SLOT)
 	{
 		/* size of buffer descriptors */
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index 2aabd4a77f3..556eb469f4f 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -68,6 +68,10 @@ static void proc_exit_prepare(int code);
  * ----------------------------------------------------------------
  */
 
+/*
+ * TODO: Why do we need to increase this by 20? I didn't notice any new calls to
+ * on_shmem_exit or on_proc_exit or before_shmem_exit.
+ */
 #define MAX_ON_EXITS 40
 
 struct ONEXIT
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 15d06fd4ca4..e076f96ebf2 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -204,6 +204,12 @@ AttachSharedMemoryStructs(void)
 /*
  * CreateSharedMemoryAndSemaphores
  *		Creates and initializes shared memory and semaphores.
+ *
+ * TODO: IMO this function should be rewritten to calculate the size of each
+ * shared memory slot or mapping. Instead of passing slot number to
+ * CalculateShmemSize, we should instead let each shared memory module use their
+ * own slot number and update the required sizes in the corresponding mapping.
+ * Then allocate shared memory in each of the mappings.
  */
 void
 CreateSharedMemoryAndSemaphores(void)
@@ -222,7 +228,10 @@ CreateSharedMemoryAndSemaphores(void)
 		elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
 
 		/*
-		 * Create the shmem segment
+		 * Create the shmem segment.
+		 * TODO: while each slot will return a different shim, only the last one
+		 * is passed to dsm_postmaster_startup(). Is that right? Shouldn't we
+		 * pass all of them or none.
 		 */
 		seghdr = PGSharedMemoryCreate(size, &shim);
 
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index faca7c9a525..c1dde378329 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -82,6 +82,7 @@ static void *ShmemAllocRawInSlot(Size size, Size *allocated_size,
 
 ShmemSegment Segments[ANON_MAPPINGS];
 
+/*TODO: shouldn't this be part of the ShmemSegment structure? */
 static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
 
 
@@ -490,9 +491,18 @@ ShmemInitStructInSlot(const char *name, Size size, bool *foundPtr,
 	{
 		/*
 		 * Structure is in the shmem index so someone else has allocated it
-		 * already. Verify the structure's size:
-		 * - If it's the same, we've found the expected structure.
-		 * - If it's different, we're resizing the expected structure.
+		 * already. Verify the structure's size: - If it's the same, we've found
+		 * the expected structure.  - If it's different, we're resizing the
+		 * expected structure.
+		 *
+		 * TODO: This works because every structure that needs to be resized
+		 * resides in a shmem slot by itself. But it won't work if a slot
+		 * contains more structures, that need to be resized, placed in adjacent
+		 * memory. Also we are not updating the Shmem stats like freeoffset. I
+		 * think we will keep all resizable structures in a slot for themselves,
+		 * and not have a hash table in such slots since resizing the hash table
+		 * itself might cause memory to be allocated next to the resizable
+		 * structure making it difficult to resize it.
 		 */
 		if (result->size != size)
 			result->size = size;
@@ -584,6 +594,12 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 
 	hash_seq_init(&hstat, ShmemIndex);
 
+	/*
+	 * TODO: For the sake of completeness we should rotate through all the slots
+	 * (after saving slotwise ShmemIndex, if any). Do we want to also output
+	 * shmem slot name, but that would expose the slotified structure of shared
+	 * memory.
+	 */
 	/* output all allocated entries */
 	memset(nulls, 0, sizeof(nulls));
 	while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index cd3237b3736..0be59074709 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -608,6 +608,11 @@ LWLockNewTrancheId(void)
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - sizeof(int));
 	/* We use the ShmemLock spinlock to protect LWLockCounter */
+	/*
+	 * TODO: We have retained ShmemLock global variable, should we use it here
+	 * instead of main segment lock? We will need spinlock init on the global
+	 * one if yes.
+	 */
 	SpinLockAcquire(Segments[MAIN_SHMEM_SLOT].ShmemLock);
 	result = (*LWLockCounter)++;
 	SpinLockRelease(Segments[MAIN_SHMEM_SLOT].ShmemLock);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 85902788181..8c89928203b 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4656,6 +4656,12 @@ PostgresMain(const char *dbname, const char *username)
 		/*
 		 * (6) check for any other interesting events that happened while we
 		 * slept.
+		 * TODO: When a backend is waiting for a command, it won't reload
+		 * configuration and hence wouldn't notice change in shared_buffers. The
+		 * change is only noticed after the command is received and the control
+		 * comes here. We may need to improve this in case we want to resize
+		 * shared buffers or perform of part of that operation in assign_hook
+		 * implementation (e.g. AnonymousShmemResize()).
 		 */
 		if (ConfigReloadPending)
 		{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e8ecff5f7f0..acd94c3616c 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -345,6 +345,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+# TODO, not used anywhere, do we need it?
 ShmemResize	"Waiting to resize shared memory."
 
 #
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b25dc0199b8..6a352d3942e 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -22,6 +22,11 @@
 #include "storage/condition_variable.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
+/*
+ * TODO: this header files doesn't use anything in pg_shmem.h but the files which
+ * include this file may. We should include pg_shmem.h in those files rather than
+ * here.
+ */
 #include "storage/pg_shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index c1a96240d79..39521208fb9 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -42,6 +42,10 @@ typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 #endif
 } PGShmemHeader;
 
+/*
+ * TODO: should we define it in shmem.c where the previous global variables were
+ * declared? Do we need this structure outside shmem.c?
+ */
 typedef struct ShmemSegment
 {
 	PGShmemHeader *ShmemSegHdr; 	/* shared mem segment header */
@@ -51,11 +55,6 @@ typedef struct ShmemSegment
 									 * allocation */
 } ShmemSegment;
 
-// Number of available slots for anonymous memory mappings
-#define ANON_MAPPINGS 6
-
-extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
-
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -111,6 +110,10 @@ extern void AnonymousShmemResize(int newval, void *extra);
  * To be able to dynamically resize largest parts of the data stored in shared
  * memory, we split it into multiple shared memory mappings slots. Each slot
  * contains only certain part of the data, which size depends on NBuffers.
+ *
+ * TODO: convert this into an enum with a sentinel symbol ANON_MAPPINGS, which
+ * itself should be renamed to NUM_ANON_MAPPINGS or NUM_SHMEM_SEGMENTS or
+ * something that indicates that it's the number of shared memory segments.
  */
 
 /* The main slot, contains everything except buffer blocks and related data. */
@@ -131,4 +134,9 @@ extern void AnonymousShmemResize(int newval, void *extra);
 /* Buffer strategy status */
 #define STRATEGY_SHMEM_SLOT 5
 
+// Number of available slots for anonymous memory mappings
+#define ANON_MAPPINGS 6
+
+extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
+
 #endif							/* PG_SHMEM_H */
-- 
2.34.1

#33

Dmitry Dolgov

9erthalion6@gmail.com

11 months ago

In reply to: Dmitry Dolgov (#1)

6 attachment(s)

Re: Changing shared_buffers without restart

On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
changing shared memory mapping layout. Any feedback is appreciated.

Hi,

Here is a new version of the patch, which contains a proposal about how to
coordinate shared memory resizing between backends. The rest is more or less
the same, a feedback about coordination is appreciated. It's a lot to read, but
the main difference is about:

1. Allowing to decouple a GUC value change from actually applying it, sort of a
"pending" change. The idea is to let a custom logic be triggered on an assign
hook, and then take responsibility for what happens later and how it's going to
be applied. This allows to use regular GUC infrastructure in cases where value
change requires some complicated processing. I was trying to make the change
not so invasive, plus it's missing GUC reporting yet.

2. Shared memory resizing patch became more complicated thanks to some
coordination between backends. The current implementation was chosen from few
more or less equal alternatives, which are evolving along following lines:

* There should be one "coordinator" process overseeing the change. Having
postmaster to fulfill this role like in this patch seems like a natural idea,
but it poses certain challenges since it doesn't have locking infrastructure.
Another option would be to elect a single backend to be a coordinator, which
will handle the postmaster as a special case. If there will ever be a
"coordinator" worker in Postgres, that would be useful here.

* The coordinator uses EmitProcSignalBarrier to reach out to all other backends
and trigger the resize process. Backends join a Barrier to synchronize and wait
untill everyone is finished.

* There is some resizing state stored in shared memory, which is there to
handle backends that were for some reason late or didn't receive the signal.
What to store there is open for discussion.

* Since we want to make sure all processes share the same understanding of what
NBuffers value is, any failure is mostly a hard stop, since to rollback the
change coordination is needed as well and sounds a bit too complicated for now.

We've tested this change manually for now, although it might be useful to try
out injection points. The testing strategy, which has caught plenty of bugs,
was simply to run pgbench workload against a running instance and change
shared_buffers on the fly. Some more subtle cases were verified by manually
injecting delays to trigger expected scenarios.

To reiterate, here is patches breakdown:

Patch 4 is a new addition to handle "pending" GUC changes.

Patch 5 actually does resizing. It's shared memory specific of course, and
utilized Linux specific mremap, meaning open portability questions.

Patch 6 is somewhat independent, but quite convenient to have. It also utilizes
Linux specific call memfd_create.

I would like to get some feedback on the synchronization part. While waiting
I'll proceed implementing shared memory address space reservation and Ashutosh
will continue with buffer eviction to support shared memory reduction.

Attachments:

v2-0001-Allow-to-use-multiple-shared-memory-mappings.patchtext/plain; charset=us-asciiDownload

From d88185fb3b4a3a0e102a3af52f4fb5564468db15 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Wed, 19 Feb 2025 17:43:13 +0100
Subject: [PATCH v2 1/6] Allow to use multiple shared memory mappings

Currently all the work with shared memory is done via a single anonymous
memory mapping, which limits ways how the shared memory could be organized.

Introduce possibility to allocate multiple shared memory mappings, where
a single mapping is associated with a specified shared memory segment.
There is only fixed amount of available segments, currently only one
main shared memory segment is allocated. A new shared memory API is
introduces, extended with a segment as a new parameter. As a path of
least resistance, the original API is kept in place, utilizing the main
shared memory segment.
---
 src/backend/port/posix_sema.c       |   4 +-
 src/backend/port/sysv_sema.c        |   4 +-
 src/backend/port/sysv_shmem.c       | 138 ++++++++++++++++++---------
 src/backend/port/win32_sema.c       |   2 +-
 src/backend/storage/ipc/ipc.c       |   4 +-
 src/backend/storage/ipc/ipci.c      |  63 +++++++------
 src/backend/storage/ipc/shmem.c     | 141 +++++++++++++++++++---------
 src/backend/storage/lmgr/lwlock.c   |   5 +-
 src/include/storage/buf_internals.h |   1 +
 src/include/storage/ipc.h           |   2 +-
 src/include/storage/pg_sema.h       |   2 +-
 src/include/storage/pg_shmem.h      |  18 ++++
 src/include/storage/shmem.h         |  12 +++
 13 files changed, 272 insertions(+), 124 deletions(-)

diff --git a/src/backend/port/posix_sema.c b/src/backend/port/posix_sema.c
index 269c7460817..401e1113fa1 100644
--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -193,7 +193,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * we don't have to expose the counters to other processes.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	struct stat statbuf;
 
@@ -220,7 +220,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSegment(PGSemaphoreShmemSize(maxSemas), shmem_segment);
 #endif
 
 	numSems = 0;
diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c
index f7c8638aec5..b6301463ac7 100644
--- a/src/backend/port/sysv_sema.c
+++ b/src/backend/port/sysv_sema.c
@@ -313,7 +313,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * have clobbered.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	struct stat statbuf;
 
@@ -334,7 +334,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSegment(PGSemaphoreShmemSize(maxSemas), shmem_segment);
 	numSharedSemas = 0;
 	maxSharedSemas = maxSemas;
 
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..843b1b3220f 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -94,8 +94,19 @@ typedef enum
 unsigned long UsedShmemSegID = 0;
 void	   *UsedShmemSegAddr = NULL;
 
-static Size AnonymousShmemSize;
-static void *AnonymousShmem = NULL;
+typedef struct AnonymousMapping
+{
+	int shmem_segment;
+	Size shmem_size; 			/* Size of the mapping */
+	void *shmem; 				/* Pointer to the start of the mapped memory */
+	void *seg_addr; 			/* SysV shared memory for the header */
+	unsigned long seg_id; 		/* IPC key */
+} AnonymousMapping;
+
+static AnonymousMapping Mappings[ANON_MAPPINGS];
+
+/* Keeps track of used mapping segments */
+static int next_free_segment = 0;
 
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
@@ -104,6 +115,28 @@ static IpcMemoryState PGSharedMemoryAttach(IpcMemoryId shmId,
 										   void *attachAt,
 										   PGShmemHeader **addr);
 
+static const char*
+MappingName(int shmem_segment)
+{
+	switch (shmem_segment)
+	{
+		case MAIN_SHMEM_SEGMENT:
+			return "main";
+		default:
+			return "unknown";
+	}
+}
+
+static void
+DebugMappings()
+{
+	for(int i = 0; i < next_free_segment; i++)
+	{
+		AnonymousMapping m = Mappings[i];
+		elog(DEBUG1, "Mapping[%s]: addr %p, size %zu",
+			 MappingName(i), m.shmem, m.shmem_size);
+	}
+}
 
 /*
  *	InternalIpcMemoryCreate(memKey, size)
@@ -591,14 +624,13 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
 /*
  * Creates an anonymous mmap()ed shared memory segment.
  *
- * Pass the requested size in *size.  This function will modify *size to the
- * actual size of the allocation, if it ends up allocating a segment that is
- * larger than requested.
+ * This function will modify mapping size to the actual size of the allocation,
+ * if it ends up allocating a segment that is larger than requested.
  */
-static void *
-CreateAnonymousSegment(Size *size)
+static void
+CreateAnonymousSegment(AnonymousMapping *mapping)
 {
-	Size		allocsize = *size;
+	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
 
@@ -623,8 +655,11 @@ CreateAnonymousSegment(Size *size)
 				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
-			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-				 allocsize);
+		{
+			DebugMappings();
+			elog(DEBUG1, "segment[%s]: mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
+				 MappingName(mapping->shmem_segment), allocsize);
+		}
 	}
 #endif
 
@@ -642,7 +677,7 @@ CreateAnonymousSegment(Size *size)
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
-		allocsize = *size;
+		allocsize = mapping->shmem_size;
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
 				   PG_MMAP_FLAGS, -1, 0);
 		mmap_errno = errno;
@@ -651,8 +686,10 @@ CreateAnonymousSegment(Size *size)
 	if (ptr == MAP_FAILED)
 	{
 		errno = mmap_errno;
+		DebugMappings();
 		ereport(FATAL,
-				(errmsg("could not map anonymous shared memory: %m"),
+				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
+						MappingName(mapping->shmem_segment)),
 				 (mmap_errno == ENOMEM) ?
 				 errhint("This error usually means that PostgreSQL's request "
 						 "for a shared memory segment exceeded available memory, "
@@ -663,8 +700,8 @@ CreateAnonymousSegment(Size *size)
 						 allocsize) : 0));
 	}
 
-	*size = allocsize;
-	return ptr;
+	mapping->shmem = ptr;
+	mapping->shmem_size = allocsize;
 }
 
 /*
@@ -674,13 +711,18 @@ CreateAnonymousSegment(Size *size)
 static void
 AnonymousShmemDetach(int status, Datum arg)
 {
-	/* Release anonymous shared memory block, if any. */
-	if (AnonymousShmem != NULL)
+	for(int i = 0; i < next_free_segment; i++)
 	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		AnonymousMapping m = Mappings[i];
+
+		/* Release anonymous shared memory block, if any. */
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
 
@@ -705,6 +747,7 @@ PGSharedMemoryCreate(Size size,
 	PGShmemHeader *hdr;
 	struct stat statbuf;
 	Size		sysvsize;
+	AnonymousMapping *mapping = &Mappings[next_free_segment];
 
 	/*
 	 * We use the data directory's ID info (inode and device numbers) to
@@ -733,11 +776,15 @@ PGSharedMemoryCreate(Size size,
 
 	/* Room for a header? */
 	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
+	mapping->shmem_size = size;
+	mapping->shmem_segment = next_free_segment;
 
 	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
-		AnonymousShmem = CreateAnonymousSegment(&size);
-		AnonymousShmemSize = size;
+		/* On success, mapping data will be modified. */
+		CreateAnonymousSegment(mapping);
+
+		next_free_segment++;
 
 		/* Register on-exit routine to unmap the anonymous segment */
 		on_shmem_exit(AnonymousShmemDetach, (Datum) 0);
@@ -760,7 +807,7 @@ PGSharedMemoryCreate(Size size,
 	 * loop simultaneously.  (CreateDataDirLockFile() does not entirely ensure
 	 * that, but prefer fixing it over coping here.)
 	 */
-	NextShmemSegID = statbuf.st_ino;
+	NextShmemSegID = statbuf.st_ino + next_free_segment;
 
 	for (;;)
 	{
@@ -852,13 +899,13 @@ PGSharedMemoryCreate(Size size,
 	/*
 	 * Initialize space allocation status for segment.
 	 */
-	hdr->totalsize = size;
+	hdr->totalsize = mapping->shmem_size;
 	hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader));
 	*shim = hdr;
 
 	/* Save info for possible future use */
-	UsedShmemSegAddr = memAddress;
-	UsedShmemSegID = (unsigned long) NextShmemSegID;
+	mapping->seg_addr = memAddress;
+	mapping->seg_id = (unsigned long) NextShmemSegID;
 
 	/*
 	 * If AnonymousShmem is NULL here, then we're not using anonymous shared
@@ -866,10 +913,10 @@ PGSharedMemoryCreate(Size size,
 	 * block. Otherwise, the System V shared memory block is only a shim, and
 	 * we must return a pointer to the real block.
 	 */
-	if (AnonymousShmem == NULL)
+	if (mapping->shmem == NULL)
 		return hdr;
-	memcpy(AnonymousShmem, hdr, sizeof(PGShmemHeader));
-	return (PGShmemHeader *) AnonymousShmem;
+	memcpy(mapping->shmem, hdr, sizeof(PGShmemHeader));
+	return (PGShmemHeader *) mapping->shmem;
 }
 
 #ifdef EXEC_BACKEND
@@ -969,23 +1016,28 @@ PGSharedMemoryNoReAttach(void)
 void
 PGSharedMemoryDetach(void)
 {
-	if (UsedShmemSegAddr != NULL)
+	for(int i = 0; i < next_free_segment; i++)
 	{
-		if ((shmdt(UsedShmemSegAddr) < 0)
+		AnonymousMapping m = Mappings[i];
+
+		if (m.seg_addr != NULL)
+		{
+			if ((shmdt(m.seg_addr) < 0)
 #if defined(EXEC_BACKEND) && defined(__CYGWIN__)
-		/* Work-around for cygipc exec bug */
-			&& shmdt(NULL) < 0
+			/* Work-around for cygipc exec bug */
+				&& shmdt(NULL) < 0
 #endif
-			)
-			elog(LOG, "shmdt(%p) failed: %m", UsedShmemSegAddr);
-		UsedShmemSegAddr = NULL;
-	}
+				)
+				elog(LOG, "shmdt(%p) failed: %m", m.seg_addr);
+			m.seg_addr = NULL;
+		}
 
-	if (AnonymousShmem != NULL)
-	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
diff --git a/src/backend/port/win32_sema.c b/src/backend/port/win32_sema.c
index 5854ad1f54d..e7365ff8060 100644
--- a/src/backend/port/win32_sema.c
+++ b/src/backend/port/win32_sema.c
@@ -44,7 +44,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * process exits.
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	mySemSet = (HANDLE *) malloc(maxSemas * sizeof(HANDLE));
 	if (mySemSet == NULL)
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index e4d5b944e12..9d526eb43fd 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -61,6 +61,8 @@ static void proc_exit_prepare(int code);
  * but provide some additional features we need --- in particular,
  * we want to register callbacks to invoke when we are disconnecting
  * from a broken shared-memory context but not exiting the postmaster.
+ * Maximum number of such exit callbacks depends on the number of shared
+ * segments.
  *
  * Callback functions can take zero, one, or two args: the first passed
  * arg is the integer exitcode, the second is the Datum supplied when
@@ -68,7 +70,7 @@ static void proc_exit_prepare(int code);
  * ----------------------------------------------------------------
  */
 
-#define MAX_ON_EXITS 20
+#define MAX_ON_EXITS 40
 
 struct ONEXIT
 {
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..4f6c707c204 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -85,7 +85,7 @@ RequestAddinShmemSpace(Size size)
  * required.
  */
 Size
-CalculateShmemSize(int *num_semaphores)
+CalculateShmemSize(int *num_semaphores, int shmem_segment)
 {
 	Size		size;
 	int			numSemas;
@@ -204,33 +204,38 @@ CreateSharedMemoryAndSemaphores(void)
 
 	Assert(!IsUnderPostmaster);
 
-	/* Compute the size of the shared-memory block */
-	size = CalculateShmemSize(&numSemas);
-	elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
-
-	/*
-	 * Create the shmem segment
-	 */
-	seghdr = PGSharedMemoryCreate(size, &shim);
-
-	/*
-	 * Make sure that huge pages are never reported as "unknown" while the
-	 * server is running.
-	 */
-	Assert(strcmp("unknown",
-				  GetConfigOption("huge_pages_status", false, false)) != 0);
-
-	InitShmemAccess(seghdr);
-
-	/*
-	 * Create semaphores
-	 */
-	PGReserveSemaphores(numSemas);
-
-	/*
-	 * Set up shared memory allocation mechanism
-	 */
-	InitShmemAllocation();
+	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
+	{
+		/* Compute the size of the shared-memory block */
+		size = CalculateShmemSize(&numSemas, segment);
+		elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
+
+		/*
+		 * Create the shmem segment.
+		 *
+		 * XXX: Do multiple shims are needed, one per segment?
+		 */
+		seghdr = PGSharedMemoryCreate(size, &shim);
+
+		/*
+		 * Make sure that huge pages are never reported as "unknown" while the
+		 * server is running.
+		 */
+		Assert(strcmp("unknown",
+					  GetConfigOption("huge_pages_status", false, false)) != 0);
+
+		InitShmemAccessInSegment(seghdr, segment);
+
+		/*
+		 * Create semaphores
+		 */
+		PGReserveSemaphores(numSemas, segment);
+
+		/*
+		 * Set up shared memory allocation mechanism
+		 */
+		InitShmemAllocationInSegment(segment);
+	}
 
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
@@ -360,7 +365,7 @@ InitializeShmemGUCs(void)
 	/*
 	 * Calculate the shared memory size and round up to the nearest megabyte.
 	 */
-	size_b = CalculateShmemSize(&num_semas);
+	size_b = CalculateShmemSize(&num_semas, MAIN_SHMEM_SEGMENT);
 	size_mb = add_size(size_b, (1024 * 1024) - 1) / (1024 * 1024);
 	sprintf(buf, "%zu", size_mb);
 	SetConfigOption("shared_memory_size", buf,
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..389abc82519 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -75,19 +75,19 @@
 #include "utils/builtins.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
+static void *ShmemAllocRawInSegment(Size size, Size *allocated_size,
+								 int shmem_segment);
 
 /* shared memory global variables */
 
-static PGShmemHeader *ShmemSegHdr;	/* shared mem segment header */
+ShmemSegment Segments[ANON_MAPPINGS];
 
-static void *ShmemBase;			/* start address of shared memory */
-
-static void *ShmemEnd;			/* end+1 address of shared memory */
-
-slock_t    *ShmemLock;			/* spinlock for shared memory and LWLock
-								 * allocation */
-
-static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/*
+ * Primary index hashtable for shmem, for simplicity we use a single for all
+ * shared memory segments. There can be performance consequences of that, and
+ * an alternative option would be to have one index per shared memory segments.
+ */
+static HTAB *ShmemIndex = NULL;
 
 
 /*
@@ -96,9 +96,17 @@ static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
 void
 InitShmemAccess(PGShmemHeader *seghdr)
 {
-	ShmemSegHdr = seghdr;
-	ShmemBase = seghdr;
-	ShmemEnd = (char *) ShmemBase + seghdr->totalsize;
+	InitShmemAccessInSegment(seghdr, MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAccessInSegment(PGShmemHeader *seghdr, int shmem_segment)
+{
+	PGShmemHeader *shmhdr = (PGShmemHeader *) seghdr;
+	ShmemSegment *seg = &Segments[shmem_segment];
+	seg->ShmemSegHdr = shmhdr;
+	seg->ShmemBase = (void *) shmhdr;
+	seg->ShmemEnd = (char *) seg->ShmemBase + shmhdr->totalsize;
 }
 
 /*
@@ -109,7 +117,13 @@ InitShmemAccess(PGShmemHeader *seghdr)
 void
 InitShmemAllocation(void)
 {
-	PGShmemHeader *shmhdr = ShmemSegHdr;
+	InitShmemAllocationInSegment(MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAllocationInSegment(int shmem_segment)
+{
+	PGShmemHeader *shmhdr = Segments[shmem_segment].ShmemSegHdr;
 	char	   *aligned;
 
 	Assert(shmhdr != NULL);
@@ -118,9 +132,9 @@ InitShmemAllocation(void)
 	 * Initialize the spinlock used by ShmemAlloc.  We must use
 	 * ShmemAllocUnlocked, since obviously ShmemAlloc can't be called yet.
 	 */
-	ShmemLock = (slock_t *) ShmemAllocUnlocked(sizeof(slock_t));
+	Segments[shmem_segment].ShmemLock = (slock_t *) ShmemAllocUnlockedInSegment(sizeof(slock_t), shmem_segment);
 
-	SpinLockInit(ShmemLock);
+	SpinLockInit(Segments[shmem_segment].ShmemLock);
 
 	/*
 	 * Allocations after this point should go through ShmemAlloc, which
@@ -145,11 +159,17 @@ InitShmemAllocation(void)
  */
 void *
 ShmemAlloc(Size size)
+{
+	return ShmemAllocInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocInSegment(Size size, int shmem_segment)
 {
 	void	   *newSpace;
 	Size		allocated_size;
 
-	newSpace = ShmemAllocRaw(size, &allocated_size);
+	newSpace = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
 	if (!newSpace)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -179,6 +199,12 @@ ShmemAllocNoError(Size size)
  */
 static void *
 ShmemAllocRaw(Size size, Size *allocated_size)
+{
+	return ShmemAllocRawInSegment(size, allocated_size, MAIN_SHMEM_SEGMENT);
+}
+
+static void *
+ShmemAllocRawInSegment(Size size, Size *allocated_size, int shmem_segment)
 {
 	Size		newStart;
 	Size		newFree;
@@ -198,22 +224,22 @@ ShmemAllocRaw(Size size, Size *allocated_size)
 	size = CACHELINEALIGN(size);
 	*allocated_size = size;
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-	SpinLockAcquire(ShmemLock);
+	SpinLockAcquire(Segments[shmem_segment].ShmemLock);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree <= ShmemSegHdr->totalsize)
+	if (newFree <= Segments[shmem_segment].ShmemSegHdr->totalsize)
 	{
-		newSpace = (char *) ShmemBase + newStart;
-		ShmemSegHdr->freeoffset = newFree;
+		newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
+		Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 	}
 	else
 		newSpace = NULL;
 
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[shmem_segment].ShmemLock);
 
 	/* note this assert is okay with newSpace == NULL */
 	Assert(newSpace == (void *) CACHELINEALIGN(newSpace));
@@ -231,6 +257,12 @@ ShmemAllocRaw(Size size, Size *allocated_size)
  */
 void *
 ShmemAllocUnlocked(Size size)
+{
+	return ShmemAllocUnlockedInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocUnlockedInSegment(Size size, int shmem_segment)
 {
 	Size		newStart;
 	Size		newFree;
@@ -241,19 +273,19 @@ ShmemAllocUnlocked(Size size)
 	 */
 	size = MAXALIGN(size);
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree > ShmemSegHdr->totalsize)
+	if (newFree > Segments[shmem_segment].ShmemSegHdr->totalsize)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of shared memory (%zu bytes requested)",
 						size)));
-	ShmemSegHdr->freeoffset = newFree;
+	Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 
-	newSpace = (char *) ShmemBase + newStart;
+	newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
 
 	Assert(newSpace == (void *) MAXALIGN(newSpace));
 
@@ -268,7 +300,13 @@ ShmemAllocUnlocked(Size size)
 bool
 ShmemAddrIsValid(const void *addr)
 {
-	return (addr >= ShmemBase) && (addr < ShmemEnd);
+	return ShmemAddrIsValidInSegment(addr, MAIN_SHMEM_SEGMENT);
+}
+
+bool
+ShmemAddrIsValidInSegment(const void *addr, int shmem_segment)
+{
+	return (addr >= Segments[shmem_segment].ShmemBase) && (addr < Segments[shmem_segment].ShmemEnd);
 }
 
 /*
@@ -329,6 +367,18 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
+{
+	return ShmemInitHashInSegment(name, init_size, max_size, infoP, hash_flags,
+							   MAIN_SHMEM_SEGMENT);
+}
+
+HTAB *
+ShmemInitHashInSegment(const char *name,		/* table string name for shmem index */
+			  long init_size,		/* initial table size */
+			  long max_size,		/* max size of the table */
+			  HASHCTL *infoP,		/* info about key and bucket size */
+			  int hash_flags,		/* info about infoP */
+			  int shmem_segment) 	/* in which segment to keep the table */
 {
 	bool		found;
 	void	   *location;
@@ -345,9 +395,9 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 	hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
 
 	/* look it up in the shmem index */
-	location = ShmemInitStruct(name,
+	location = ShmemInitStructInSegment(name,
 							   hash_get_shared_size(infoP, hash_flags),
-							   &found);
+							   &found, shmem_segment);
 
 	/*
 	 * if it already exists, attach to it rather than allocate and initialize
@@ -380,6 +430,13 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
  */
 void *
 ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+{
+	return ShmemInitStructInSegment(name, size, foundPtr, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
+					  int shmem_segment)
 {
 	ShmemIndexEnt *result;
 	void	   *structPtr;
@@ -388,7 +445,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	if (!ShmemIndex)
 	{
-		PGShmemHeader *shmemseghdr = ShmemSegHdr;
+		PGShmemHeader *shmemseghdr = Segments[shmem_segment].ShmemSegHdr;
 
 		/* Must be trying to create/attach to ShmemIndex itself */
 		Assert(strcmp(name, "ShmemIndex") == 0);
@@ -411,7 +468,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 			 * process can be accessing shared memory yet.
 			 */
 			Assert(shmemseghdr->index == NULL);
-			structPtr = ShmemAlloc(size);
+			structPtr = ShmemAllocInSegment(size, shmem_segment);
 			shmemseghdr->index = structPtr;
 			*foundPtr = false;
 		}
@@ -428,8 +485,8 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		LWLockRelease(ShmemIndexLock);
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("could not create ShmemIndex entry for data structure \"%s\"",
-						name)));
+				 errmsg("could not create ShmemIndex entry for data structure \"%s\" in segment %d",
+						name, shmem_segment)));
 	}
 
 	if (*foundPtr)
@@ -454,7 +511,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		Size		allocated_size;
 
 		/* It isn't in the table yet. allocate and initialize it */
-		structPtr = ShmemAllocRaw(size, &allocated_size);
+		structPtr = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
 		if (structPtr == NULL)
 		{
 			/* out of memory; remove the failed ShmemIndex entry */
@@ -473,14 +530,13 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	LWLockRelease(ShmemIndexLock);
 
-	Assert(ShmemAddrIsValid(structPtr));
+	Assert(ShmemAddrIsValidInSegment(structPtr, shmem_segment));
 
 	Assert(structPtr == (void *) CACHELINEALIGN(structPtr));
 
 	return structPtr;
 }
 
-
 /*
  * Add two Size values, checking for overflow
  */
@@ -537,10 +593,11 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 
 	/* output all allocated entries */
 	memset(nulls, 0, sizeof(nulls));
+	/* XXX: take all shared memory segments into account. */
 	while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
 	{
 		values[0] = CStringGetTextDatum(ent->key);
-		values[1] = Int64GetDatum((char *) ent->location - (char *) ShmemSegHdr);
+		values[1] = Int64GetDatum((char *) ent->location - (char *) Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr);
 		values[2] = Int64GetDatum(ent->size);
 		values[3] = Int64GetDatum(ent->allocated_size);
 		named_allocated += ent->allocated_size;
@@ -552,15 +609,15 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 	/* output shared memory allocated but not counted via the shmem index */
 	values[0] = CStringGetTextDatum("<anonymous>");
 	nulls[1] = true;
-	values[2] = Int64GetDatum(ShmemSegHdr->freeoffset - named_allocated);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset - named_allocated);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
 	/* output as-of-yet unused shared memory */
 	nulls[0] = true;
-	values[1] = Int64GetDatum(ShmemSegHdr->freeoffset);
+	values[1] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
 	nulls[1] = false;
-	values[2] = Int64GetDatum(ShmemSegHdr->totalsize - ShmemSegHdr->freeoffset);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->totalsize - Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index f1e74f184f1..40aa4014b5f 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -81,6 +81,7 @@
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
 #include "postmaster/postmaster.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/proclist.h"
 #include "storage/spin.h"
@@ -607,9 +608,9 @@ LWLockNewTrancheId(void)
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - sizeof(int));
 	/* We use the ShmemLock spinlock to protect LWLockCounter */
-	SpinLockAcquire(ShmemLock);
+	SpinLockAcquire(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 	result = (*LWLockCounter)++;
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 
 	return result;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 1a65342177d..4595f5a9676 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -22,6 +22,7 @@
 #include "storage/condition_variable.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
+#include "storage/pg_shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index e0f5f92e947..c0439f2206b 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -77,7 +77,7 @@ extern void check_on_shmem_exit_lists_are_empty(void);
 /* ipci.c */
 extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;
 
-extern Size CalculateShmemSize(int *num_semaphores);
+extern Size CalculateShmemSize(int *num_semaphores, int shmem_segment);
 extern void CreateSharedMemoryAndSemaphores(void);
 #ifdef EXEC_BACKEND
 extern void AttachSharedMemoryStructs(void);
diff --git a/src/include/storage/pg_sema.h b/src/include/storage/pg_sema.h
index fa6ca35a51f..8ae9637fcd0 100644
--- a/src/include/storage/pg_sema.h
+++ b/src/include/storage/pg_sema.h
@@ -41,7 +41,7 @@ typedef HANDLE PGSemaphore;
 extern Size PGSemaphoreShmemSize(int maxSemas);
 
 /* Module initialization (called during postmaster start or shmem reinit) */
-extern void PGReserveSemaphores(int maxSemas);
+extern void PGReserveSemaphores(int maxSemas, int shmem_segment);
 
 /* Allocate a PGSemaphore structure with initial count 1 */
 extern PGSemaphore PGSemaphoreCreate(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..138078c29c5 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -25,6 +25,7 @@
 #define PG_SHMEM_H
 
 #include "storage/dsm_impl.h"
+#include "storage/spin.h"
 
 typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 {
@@ -41,6 +42,20 @@ typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 #endif
 } PGShmemHeader;
 
+typedef struct ShmemSegment
+{
+	PGShmemHeader *ShmemSegHdr; 	/* shared mem segment header */
+	void *ShmemBase; 				/* start address of shared memory */
+	void *ShmemEnd; 				/* end+1 address of shared memory */
+	slock_t    *ShmemLock; 			/* spinlock for shared memory and LWLock
+									 * allocation */
+} ShmemSegment;
+
+/* Number of available segments for anonymous memory mappings */
+#define ANON_MAPPINGS 1
+
+extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -90,4 +105,7 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+/* The main segment, contains everything except buffer blocks and related data. */
+#define MAIN_SHMEM_SEGMENT 0
+
 #endif							/* PG_SHMEM_H */
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index 904a336b851..5929f140236 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -29,15 +29,27 @@
 extern PGDLLIMPORT slock_t *ShmemLock;
 struct PGShmemHeader;			/* avoid including storage/pg_shmem.h here */
 extern void InitShmemAccess(struct PGShmemHeader *seghdr);
+extern void InitShmemAccessInSegment(struct PGShmemHeader *seghdr,
+									 int shmem_segment);
 extern void InitShmemAllocation(void);
+extern void InitShmemAllocationInSegment(int shmem_segment);
 extern void *ShmemAlloc(Size size);
+extern void *ShmemAllocInSegment(Size size, int shmem_segment);
 extern void *ShmemAllocNoError(Size size);
 extern void *ShmemAllocUnlocked(Size size);
+extern void *ShmemAllocUnlockedInSegment(Size size, int shmem_segment);
 extern bool ShmemAddrIsValid(const void *addr);
+extern bool ShmemAddrIsValidInSegment(const void *addr, int shmem_segment);
 extern void InitShmemIndex(void);
+extern void InitVariableShmemIndex(void);
 extern HTAB *ShmemInitHash(const char *name, long init_size, long max_size,
 						   HASHCTL *infoP, int hash_flags);
+extern HTAB *ShmemInitHashInSegment(const char *name, long init_size,
+									long max_size, HASHCTL *infoP,
+									int hash_flags, int shmem_segment);
 extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
+extern void *ShmemInitStructInSegment(const char *name, Size size,
+									  bool *foundPtr, int shmem_segment);
 extern Size add_size(Size s1, Size s2);
 extern Size mul_size(Size s1, Size s2);
 

base-commit: 80d7f990496b1c7be61d9a00a2635b7d96b96197
-- 
2.45.1

v2-0002-Allow-placing-shared-memory-mapping-with-an-offse.patchtext/plain; charset=us-asciiDownload

From 7543fcdfc8ca1a0e1c85f397eb6dddfe1426b379 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Wed, 16 Oct 2024 20:21:33 +0200
Subject: [PATCH v2 2/6] Allow placing shared memory mapping with an offset

Currently the kernel is responsible to chose an address, where to place each
shared memory mapping, which is the lowest possible address that do not clash
with any other mappings. This is considered to be the most portable approach,
but one of the downsides is that there is no place to resize allocated mappings
anymore. Here is how it looks like for one mapping in /proc/$PID/maps,
/dev/zero represents the anonymous shared memory we talk about:

    00400000-00490000         /path/bin/postgres
    ...
    012d9000-0133e000         [heap]
    7f443a800000-7f470a800000 /dev/zero (deleted)
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
    ...
    7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842
    7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted)

By specifying the mapping address directly it's possible to place the
mapping in a way that leaves room for resizing. The idea is first to get
the address chosen by the kernel, then apply some offset derived from
the expected upper limit. Because we base the layout on the address
chosen by the kernel, things like address space randomization should not
be a problem, since the randomization is applied to the mmap base, which
is one per process. The result looks like this:

    012d9000-0133e000         [heap]
    7f443a800000-7f444196c000 /dev/zero (deleted)
    [...free space...]
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2

This approach do not impact the actual memory usage as reported by the kernel.
Here is the output of /proc/$PID/status for the master version with
shared_buffers = 128 MB:

    // Peak virtual memory size, which is described as total pages mapped in mm_struct
    VmPeak:           422780 kB
    // Size of memory portions. It contains RssAnon + RssFile + RssShmem
    VmRSS:             21248 kB
    // Size of resident anonymous memory
    RssAnon:             640 kB
    // Size of resident file mappings
    RssFile:            9728 kB
    // Size of resident shmem memory (includes SysV shm, mapping of tmpfs and
    // shared anonymous mappings)
    RssShmem:          10880 kB

Here is the same for the patch with the shared mapping placed at
an offset 10 GB:

    VmPeak:          1102844 kB
    VmRSS:             21376 kB
    RssAnon:             640 kB
    RssFile:            9856 kB
    RssShmem:          10880 kB

Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup
was created with the memory limit 256 MB, then PostgreSQL was launched withing
this cgroup with shared_buffers = 128 MB:

    $ cd /sys/fs/cgroup
    $ mkdir postgres
    $ cd postres
    $ echo 268435456 > memory.max

    $ echo $MASTER_PID_SHELL > cgroup.procs
    # postgres from the master branch has being successfully launched
    #  from that shell
    $ cat memory.current
    17465344 (~16 MB)
    # stop postgres

    $ echo $PATCH_PID_SHELL > cgroup.procs
    # postgres from the patch has being successfully launched from that shell
    $ cat memory.current
    18219008 (~17 MB)

Note that currently the implementation makes assumptions about the upper limit.
Ideally it should be based on the maximum available memory.
---
 src/backend/port/sysv_shmem.c | 120 +++++++++++++++++++++++++++++++++-
 1 file changed, 119 insertions(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 843b1b3220f..62f01d8218a 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -108,6 +108,63 @@ static AnonymousMapping Mappings[ANON_MAPPINGS];
 /* Keeps track of used mapping segments */
 static int next_free_segment = 0;
 
+/*
+ * Anonymous mapping placing (/dev/zero (deleted) below) looks like this:
+ *
+ * 00400000-00490000         /path/bin/postgres
+ * ...
+ * 012d9000-0133e000         [heap]
+ * 7f443a800000-7f470a800000 /dev/zero (deleted)
+ * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
+ * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
+ * ...
+ * 7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842
+ * 7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted)
+ * ...
+ *
+ * We would like to place multiple mappings in such a way, that there will be
+ * enough space between them in the address space to be able to resize up to
+ * certain size, but without counting towards the total memory consumption.
+ *
+ * By letting Linux to chose a mapping address, it will pick up the lowest
+ * possible address that do not clash with any other mappings, which will be
+ * right before locales in the example above. This information (maximum allowed
+ * size of mappings and the lowest mapping address) is enough to place every
+ * mapping as follow:
+ *
+ * - Take the lowest mapping address, which we call later the probe address.
+ * - Substract the offset of the previous mapping.
+ * - Substract the maximum allowed size for the current mapping from the
+ *   address.
+ * - Place the mapping by the resulting address.
+ *
+ * The result would look like this:
+ *
+ * 012d9000-0133e000         [heap]
+ * 7f4426f54000-7f442e010000 /dev/zero (deleted)
+ * [...free space...]
+ * 7f443a800000-7f444196c000 /dev/zero (deleted)
+ * [...free space...]
+ * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
+ * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
+ * ...
+ */
+Size SHMEM_EXTRA_SIZE_LIMIT[1] = {
+	0, 									/* MAIN_SHMEM_SLOT */
+};
+
+/* Remembers offset of the last mapping from the probe address */
+static Size last_offset = 0;
+
+/*
+ * Size of the mapping, which will be used to calculate anonymous mapping
+ * address. It should not be too small, otherwise there is a chance the probe
+ * mapping will be created between other mappings, leaving no room extending
+ * it. But it should not be too large either, in case if there are limitations
+ * on the mapping size. Current value is the default shared_buffers.
+ */
+#define PROBE_MAPPING_SIZE (Size) 128 * 1024 * 1024
+
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
@@ -673,13 +730,74 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 
 	if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
 	{
+		void *probe = NULL;
+
 		/*
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
 		allocsize = mapping->shmem_size;
-		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
+
+		/*
+		 * Try to create mapping at an address, which will allow to extend it
+		 * later:
+		 *
+		 * - First create the temporary probe mapping of a fixed size and let
+		 *   kernel to place it at address of its choice. By the virtue of the
+		 *   probe mapping size we expect it to be located at the lowest
+		 *   possible address, expecting some non mapped space above.
+		 *
+		 * - Unmap the probe mapping, remember the address.
+		 *
+		 * - Create an actual anonymous mapping at that address with the
+		 *   offset. The offset is calculated in such a way to allow growing
+		 *   the mapping withing certain boundaries. For this mapping we use
+		 *   MAP_FIXED_NOREPLACE, which will error out with EEXIST if there is
+		 *   any mapping clash.
+		 *
+		 * - If the last step has failed, fallback to the regular mapping
+		 *   creation and signal that shared buffers could not be resized
+		 *   without a restart.
+		 */
+		probe = mmap(NULL, PROBE_MAPPING_SIZE, PROT_READ | PROT_WRITE,
 				   PG_MMAP_FLAGS, -1, 0);
+
+		if (probe == MAP_FAILED)
+		{
+			mmap_errno = errno;
+			DebugMappings();
+			elog(DEBUG1, "segment[%s]: probe mmap(%zu) failed: %m",
+					MappingName(mapping->shmem_segment), allocsize);
+		}
+		else
+		{
+			Size offset = last_offset + SHMEM_EXTRA_SIZE_LIMIT[next_free_segment] + allocsize;
+			last_offset = offset;
+
+			munmap(probe, PROBE_MAPPING_SIZE);
+
+			ptr = mmap(probe - offset, allocsize, PROT_READ | PROT_WRITE,
+					   PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, -1, 0);
+			mmap_errno = errno;
+			if (ptr == MAP_FAILED)
+			{
+				DebugMappings();
+				elog(DEBUG1, "segment[%s]: mmap(%zu) at address %p failed: %m",
+					 MappingName(mapping->shmem_segment), allocsize, probe - offset);
+			}
+
+		}
+	}
+
+	if (ptr == MAP_FAILED)
+	{
+		/*
+		 * Fallback to the portable way of creating a mapping.
+		 */
+		allocsize = mapping->shmem_size;
+
+		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
+						   PG_MMAP_FLAGS, -1, 0);
 		mmap_errno = errno;
 	}
 
-- 
2.45.1

v2-0003-Introduce-multiple-shmem-segments-for-shared-buff.patchtext/plain; charset=us-asciiDownload

From d7af86299878acb73019ac699ef57c120199f1ee Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 24 Feb 2025 20:08:28 +0100
Subject: [PATCH v2 3/6] Introduce multiple shmem segments for shared buffers

Add more shmem segments to split shared buffers into following chunks:
* BUFFERS_SHMEM_SEGMENT: contains buffer blocks
* BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
* BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
* CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
* STRATEGY_SHMEM_SEGMENT: contains buffer strategy status

Size of the corresponding shared data directly depends on NBuffers,
meaning that if we would like to change NBuffers, they have to be
resized correspondingly. Placing each of them in a separate shmem
segment allows to achieve that.

There are some asumptions made about each of shmem segments upper size
limit. The buffer blocks have the largest, while the rest claim less
extra room for resize. Ideally those limits have to be deduced from the
maximum allowed shared memory.
---
 src/backend/port/sysv_shmem.c          | 19 ++++++-
 src/backend/storage/buffer/buf_init.c  | 79 +++++++++++++++++---------
 src/backend/storage/buffer/buf_table.c |  5 +-
 src/backend/storage/buffer/freelist.c  |  4 +-
 src/backend/storage/ipc/ipci.c         |  2 +-
 src/include/storage/bufmgr.h           |  2 +-
 src/include/storage/pg_shmem.h         | 24 +++++++-
 7 files changed, 99 insertions(+), 36 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 62f01d8218a..59aa67cb135 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -149,8 +149,13 @@ static int next_free_segment = 0;
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
  * ...
  */
-Size SHMEM_EXTRA_SIZE_LIMIT[1] = {
-	0, 									/* MAIN_SHMEM_SLOT */
+Size SHMEM_EXTRA_SIZE_LIMIT[6] = {
+	0, 									/* MAIN_SHMEM_SEGMENT */
+	(Size) 1024 * 1024 * 1024 * 10, 	/* BUFFERS_SHMEM_SEGMENT */
+	(Size) 1024 * 1024 * 1024 * 1, 		/* BUFFER_DESCRIPTORS_SHMEM_SEGMENT */
+	(Size) 1024 * 1024 * 100, 			/* BUFFER_IOCV_SHMEM_SEGMENT */
+	(Size) 1024 * 1024 * 100, 			/* CHECKPOINT_BUFFERS_SHMEM_SEGMENT */
+	(Size) 1024 * 1024 * 100, 			/* STRATEGY_SHMEM_SEGMENT */
 };
 
 /* Remembers offset of the last mapping from the probe address */
@@ -179,6 +184,16 @@ MappingName(int shmem_segment)
 	{
 		case MAIN_SHMEM_SEGMENT:
 			return "main";
+		case BUFFERS_SHMEM_SEGMENT:
+			return "buffers";
+		case BUFFER_DESCRIPTORS_SHMEM_SEGMENT:
+			return "descriptors";
+		case BUFFER_IOCV_SHMEM_SEGMENT:
+			return "iocv";
+		case CHECKPOINT_BUFFERS_SHMEM_SEGMENT:
+			return "checkpoint";
+		case STRATEGY_SHMEM_SEGMENT:
+			return "strategy";
 		default:
 			return "unknown";
 	}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1f8e03190..f5b9290a640 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -61,7 +61,10 @@ CkptSortItem *CkptBufferIds;
  * Initialize shared buffer pool
  *
  * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend).
+ * postmaster, or in a standalone backend). Size of data structures initialized
+ * here depends on NBuffers, and to be able to change NBuffers without a
+ * restart we store each structure into a separate shared memory segment, which
+ * could be resized on demand.
  */
 void
 BufferManagerShmemInit(void)
@@ -73,22 +76,22 @@ BufferManagerShmemInit(void)
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
+		ShmemInitStructInSegment("Buffer Descriptors",
 						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+						&foundDescs, BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
 		TYPEALIGN(PG_IO_ALIGN_SIZE,
-				  ShmemInitStruct("Buffer Blocks",
+				  ShmemInitStructInSegment("Buffer Blocks",
 								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
-								  &foundBufs));
+								  &foundBufs, BUFFERS_SHMEM_SEGMENT));
 
 	/* Align condition variables to cacheline boundary. */
 	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
-		ShmemInitStruct("Buffer IO Condition Variables",
+		ShmemInitStructInSegment("Buffer IO Condition Variables",
 						NBuffers * sizeof(ConditionVariableMinimallyPadded),
-						&foundIOCV);
+						&foundIOCV, BUFFER_IOCV_SHMEM_SEGMENT);
 
 	/*
 	 * The array used to sort to-be-checkpointed buffer ids is located in
@@ -98,8 +101,9 @@ BufferManagerShmemInit(void)
 	 * painful.
 	 */
 	CkptBufferIds = (CkptSortItem *)
-		ShmemInitStruct("Checkpoint BufferIds",
-						NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
+		ShmemInitStructInSegment("Checkpoint BufferIds",
+						NBuffers * sizeof(CkptSortItem), &foundBufCkpt,
+						CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
 
 	if (foundDescs || foundBufs || foundIOCV || foundBufCkpt)
 	{
@@ -153,33 +157,54 @@ BufferManagerShmemInit(void)
  * BufferManagerShmemSize
  *
  * compute the size of shared memory for the buffer pool including
- * data pages, buffer descriptors, hash tables, etc.
+ * data pages, buffer descriptors, hash tables, etc. based on the
+ * shared memory segment. The main segment must not allocate anything
+ * related to buffers, every other segment will receive part of the
+ * data.
  */
 Size
-BufferManagerShmemSize(void)
+BufferManagerShmemSize(int shmem_segment)
 {
 	Size		size = 0;
 
-	/* size of buffer descriptors */
-	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
-	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_segment == MAIN_SHMEM_SEGMENT)
+		return size;
 
-	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
-	size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	if (shmem_segment == BUFFER_DESCRIPTORS_SHMEM_SEGMENT)
+	{
+		/* size of buffer descriptors */
+		size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
+		/* to allow aligning buffer descriptors */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
 
-	/* size of stuff controlled by freelist.c */
-	size = add_size(size, StrategyShmemSize());
+	if (shmem_segment == BUFFERS_SHMEM_SEGMENT)
+	{
+		/* size of data pages, plus alignment padding */
+		size = add_size(size, PG_IO_ALIGN_SIZE);
+		size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	}
 
-	/* size of I/O condition variables */
-	size = add_size(size, mul_size(NBuffers,
-								   sizeof(ConditionVariableMinimallyPadded)));
-	/* to allow aligning the above */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_segment == STRATEGY_SHMEM_SEGMENT)
+	{
+		/* size of stuff controlled by freelist.c */
+		size = add_size(size, StrategyShmemSize());
+	}
 
-	/* size of checkpoint sort array in bufmgr.c */
-	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	if (shmem_segment == BUFFER_IOCV_SHMEM_SEGMENT)
+	{
+		/* size of I/O condition variables */
+		size = add_size(size, mul_size(NBuffers,
+									   sizeof(ConditionVariableMinimallyPadded)));
+		/* to allow aligning the above */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
+
+	if (shmem_segment == CHECKPOINT_BUFFERS_SHMEM_SEGMENT)
+	{
+		/* size of checkpoint sort array in bufmgr.c */
+		size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	}
 
 	return size;
 }
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index a50955d5286..ac449954dab 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -59,10 +59,11 @@ InitBufTable(int size)
 	info.entrysize = sizeof(BufferLookupEnt);
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
-	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
+	SharedBufHash = ShmemInitHashInSegment("Shared Buffer Lookup Table",
 								  size, size,
 								  &info,
-								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION,
+								  STRATEGY_SHMEM_SEGMENT);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 336715b6c63..4919a92f2be 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -491,9 +491,9 @@ StrategyInitialize(bool init)
 	 * Get or create the shared strategy control block
 	 */
 	StrategyControl = (BufferStrategyControl *)
-		ShmemInitStruct("Buffer Strategy Status",
+		ShmemInitStructInSegment("Buffer Strategy Status",
 						sizeof(BufferStrategyControl),
-						&found);
+						&found, STRATEGY_SHMEM_SEGMENT);
 
 	if (!found)
 	{
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 4f6c707c204..68778522591 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -112,7 +112,7 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
 											 sizeof(ShmemIndexEnt)));
 	size = add_size(size, dsm_estimate_size());
 	size = add_size(size, DSMRegistryShmemSize());
-	size = add_size(size, BufferManagerShmemSize());
+	size = add_size(size, BufferManagerShmemSize(shmem_segment));
 	size = add_size(size, LockManagerShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7c1e4316dde..bb7fe02e243 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -297,7 +297,7 @@ extern bool EvictUnpinnedBuffer(Buffer buf);
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
-extern Size BufferManagerShmemSize(void);
+extern Size BufferManagerShmemSize(int);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 138078c29c5..ba0192baf95 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -52,7 +52,7 @@ typedef struct ShmemSegment
 } ShmemSegment;
 
 /* Number of available segments for anonymous memory mappings */
-#define ANON_MAPPINGS 1
+#define ANON_MAPPINGS 6
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 
@@ -105,7 +105,29 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+/*
+ * To be able to dynamically resize largest parts of the data stored in shared
+ * memory, we split it into multiple shared memory mappings segments. Each
+ * segment contains only certain part of the data, which size depends on
+ * NBuffers.
+ */
+
 /* The main segment, contains everything except buffer blocks and related data. */
 #define MAIN_SHMEM_SEGMENT 0
 
+/* Buffer blocks */
+#define BUFFERS_SHMEM_SEGMENT 1
+
+/* Buffer descriptors */
+#define BUFFER_DESCRIPTORS_SHMEM_SEGMENT 2
+
+/* Condition variables for buffers */
+#define BUFFER_IOCV_SHMEM_SEGMENT 3
+
+/* Checkpoint BufferIds */
+#define CHECKPOINT_BUFFERS_SHMEM_SEGMENT 4
+
+/* Buffer strategy status */
+#define STRATEGY_SHMEM_SEGMENT 5
+
 #endif							/* PG_SHMEM_H */
-- 
2.45.1

v2-0004-Introduce-pending-flag-for-GUC-assign-hooks.patchtext/plain; charset=us-asciiDownload

From 0173967e8b0fd6c23b158c34b92651fc37ab7660 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Wed, 19 Feb 2025 17:45:40 +0100
Subject: [PATCH v2 4/6] Introduce pending flag for GUC assign hooks

Currently an assing hook can perform some preprocessing of a new value,
but it cannot change the behavior, which dictates that the new value
will be applied immediately after the hook. Certain GUC options (like
shared_buffers, coming in subsequent patches) may need coordinating work
between backends to change, meaning we cannot apply it right away.

Add a new flag "pending" for an assign hook to allow the hook indicate
exactly that. If the pending flag is set after the hook, the new value
will not be applied and it's handling becomes the hook's implementation
responsibility.

Note, that this also requires changes in the way how GUCs are getting
reported, but the patch does not cover that yet.
---
 src/backend/access/transam/xlog.c    |  2 +-
 src/backend/commands/variable.c      |  2 +-
 src/backend/libpq/pqcomm.c           |  8 ++--
 src/backend/tcop/postgres.c          |  2 +-
 src/backend/utils/misc/guc.c         | 59 +++++++++++++++++++---------
 src/backend/utils/misc/stack_depth.c |  2 +-
 src/include/utils/guc.h              |  2 +-
 src/include/utils/guc_hooks.h        | 16 ++++----
 8 files changed, 57 insertions(+), 36 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9bf5ba7509..ff82ba0a53d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2188,7 +2188,7 @@ CalculateCheckpointSegments(void)
 }
 
 void
-assign_max_wal_size(int newval, void *extra)
+assign_max_wal_size(int newval, void *extra, bool *pending)
 {
 	max_wal_size_mb = newval;
 	CalculateCheckpointSegments();
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 4ad6e236d69..f24c2a0d252 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1143,7 +1143,7 @@ check_cluster_name(char **newval, void **extra, GucSource source)
  * GUC assign_hook for maintenance_io_concurrency
  */
 void
-assign_maintenance_io_concurrency(int newval, void *extra)
+assign_maintenance_io_concurrency(int newval, void *extra, bool *pending)
 {
 #ifdef USE_PREFETCH
 	/*
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 61ea3722ae2..cdf21847d7e 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -1949,7 +1949,7 @@ pq_settcpusertimeout(int timeout, Port *port)
  * GUC assign_hook for tcp_keepalives_idle
  */
 void
-assign_tcp_keepalives_idle(int newval, void *extra)
+assign_tcp_keepalives_idle(int newval, void *extra, bool *pending)
 {
 	/*
 	 * The kernel API provides no way to test a value without setting it; and
@@ -1982,7 +1982,7 @@ show_tcp_keepalives_idle(void)
  * GUC assign_hook for tcp_keepalives_interval
  */
 void
-assign_tcp_keepalives_interval(int newval, void *extra)
+assign_tcp_keepalives_interval(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_setkeepalivesinterval(newval, MyProcPort);
@@ -2005,7 +2005,7 @@ show_tcp_keepalives_interval(void)
  * GUC assign_hook for tcp_keepalives_count
  */
 void
-assign_tcp_keepalives_count(int newval, void *extra)
+assign_tcp_keepalives_count(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_setkeepalivescount(newval, MyProcPort);
@@ -2028,7 +2028,7 @@ show_tcp_keepalives_count(void)
  * GUC assign_hook for tcp_user_timeout
  */
 void
-assign_tcp_user_timeout(int newval, void *extra)
+assign_tcp_user_timeout(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_settcpusertimeout(newval, MyProcPort);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 1149d89d7a1..13fb8c31702 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3555,7 +3555,7 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 
 /* GUC assign hook for transaction_timeout */
 void
-assign_transaction_timeout(int newval, void *extra)
+assign_transaction_timeout(int newval, void *extra, bool *pending)
 {
 	if (IsTransactionState())
 	{
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 12192445218..bab1c5d08f6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1679,6 +1679,7 @@ InitializeOneGUCOption(struct config_generic *gconf)
 				struct config_int *conf = (struct config_int *) gconf;
 				int			newval = conf->boot_val;
 				void	   *extra = NULL;
+				bool 	   pending = false;
 
 				Assert(newval >= conf->min);
 				Assert(newval <= conf->max);
@@ -1687,9 +1688,13 @@ InitializeOneGUCOption(struct config_generic *gconf)
 					elog(FATAL, "failed to initialize %s to %d",
 						 conf->gen.name, newval);
 				if (conf->assign_hook)
-					conf->assign_hook(newval, extra);
-				*conf->variable = conf->reset_val = newval;
-				conf->gen.extra = conf->reset_extra = extra;
+					conf->assign_hook(newval, extra, &pending);
+
+				if (!pending)
+				{
+					*conf->variable = conf->reset_val = newval;
+					conf->gen.extra = conf->reset_extra = extra;
+				}
 				break;
 			}
 		case PGC_REAL:
@@ -2041,13 +2046,18 @@ ResetAllOptions(void)
 			case PGC_INT:
 				{
 					struct config_int *conf = (struct config_int *) gconf;
+					bool 			  pending = false;
 
 					if (conf->assign_hook)
 						conf->assign_hook(conf->reset_val,
-										  conf->reset_extra);
-					*conf->variable = conf->reset_val;
-					set_extra_field(&conf->gen, &conf->gen.extra,
-									conf->reset_extra);
+										  conf->reset_extra,
+										  &pending);
+					if (!pending)
+					{
+						*conf->variable = conf->reset_val;
+						set_extra_field(&conf->gen, &conf->gen.extra,
+										conf->reset_extra);
+					}
 					break;
 				}
 			case PGC_REAL:
@@ -2424,16 +2434,21 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
 							struct config_int *conf = (struct config_int *) gconf;
 							int			newval = newvalue.val.intval;
 							void	   *newextra = newvalue.extra;
+							bool 	    pending = false;
 
 							if (*conf->variable != newval ||
 								conf->gen.extra != newextra)
 							{
 								if (conf->assign_hook)
-									conf->assign_hook(newval, newextra);
-								*conf->variable = newval;
-								set_extra_field(&conf->gen, &conf->gen.extra,
-												newextra);
-								changed = true;
+									conf->assign_hook(newval, newextra, &pending);
+
+								if (!pending)
+								{
+									*conf->variable = newval;
+									set_extra_field(&conf->gen, &conf->gen.extra,
+													newextra);
+									changed = true;
+								}
 							}
 							break;
 						}
@@ -3850,18 +3865,24 @@ set_config_with_handle(const char *name, config_handle *handle,
 
 				if (changeVal)
 				{
+					bool pending = false;
+
 					/* Save old value to support transaction abort */
 					if (!makeDefault)
 						push_old_value(&conf->gen, action);
 
 					if (conf->assign_hook)
-						conf->assign_hook(newval, newextra);
-					*conf->variable = newval;
-					set_extra_field(&conf->gen, &conf->gen.extra,
-									newextra);
-					set_guc_source(&conf->gen, source);
-					conf->gen.scontext = context;
-					conf->gen.srole = srole;
+						conf->assign_hook(newval, newextra, &pending);
+
+					if (!pending)
+					{
+						*conf->variable = newval;
+						set_extra_field(&conf->gen, &conf->gen.extra,
+										newextra);
+						set_guc_source(&conf->gen, source);
+						conf->gen.scontext = context;
+						conf->gen.srole = srole;
+					}
 				}
 				if (makeDefault)
 				{
diff --git a/src/backend/utils/misc/stack_depth.c b/src/backend/utils/misc/stack_depth.c
index 8f7cf531fbc..ef59ae62008 100644
--- a/src/backend/utils/misc/stack_depth.c
+++ b/src/backend/utils/misc/stack_depth.c
@@ -156,7 +156,7 @@ check_max_stack_depth(int *newval, void **extra, GucSource source)
 
 /* GUC assign hook for max_stack_depth */
 void
-assign_max_stack_depth(int newval, void *extra)
+assign_max_stack_depth(int newval, void *extra, bool *pending)
 {
 	ssize_t		newval_bytes = newval * (ssize_t) 1024;
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 1233e07d7da..ce9f258100d 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -187,7 +187,7 @@ typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource sourc
 typedef bool (*GucEnumCheckHook) (int *newval, void **extra, GucSource source);
 
 typedef void (*GucBoolAssignHook) (bool newval, void *extra);
-typedef void (*GucIntAssignHook) (int newval, void *extra);
+typedef void (*GucIntAssignHook) (int newval, void *extra, bool *pending);
 typedef void (*GucRealAssignHook) (double newval, void *extra);
 typedef void (*GucStringAssignHook) (const char *newval, void *extra);
 typedef void (*GucEnumAssignHook) (int newval, void *extra);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 951451a9765..3e380f29e5a 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -81,12 +81,12 @@ extern void assign_log_timezone(const char *newval, void *extra);
 extern const char *show_log_timezone(void);
 extern bool check_maintenance_io_concurrency(int *newval, void **extra,
 											 GucSource source);
-extern void assign_maintenance_io_concurrency(int newval, void *extra);
+extern void assign_maintenance_io_concurrency(int newval, void *extra, bool *pending);
 extern bool check_max_slot_wal_keep_size(int *newval, void **extra,
 										 GucSource source);
-extern void assign_max_wal_size(int newval, void *extra);
+extern void assign_max_wal_size(int newval, void *extra, bool *pending);
 extern bool check_max_stack_depth(int *newval, void **extra, GucSource source);
-extern void assign_max_stack_depth(int newval, void *extra);
+extern void assign_max_stack_depth(int newval, void *extra, bool *pending);
 extern bool check_multixact_member_buffers(int *newval, void **extra,
 										   GucSource source);
 extern bool check_multixact_offset_buffers(int *newval, void **extra,
@@ -141,13 +141,13 @@ extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 extern void assign_syslog_facility(int newval, void *extra);
 extern void assign_syslog_ident(const char *newval, void *extra);
-extern void assign_tcp_keepalives_count(int newval, void *extra);
+extern void assign_tcp_keepalives_count(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_count(void);
-extern void assign_tcp_keepalives_idle(int newval, void *extra);
+extern void assign_tcp_keepalives_idle(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_idle(void);
-extern void assign_tcp_keepalives_interval(int newval, void *extra);
+extern void assign_tcp_keepalives_interval(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_interval(void);
-extern void assign_tcp_user_timeout(int newval, void *extra);
+extern void assign_tcp_user_timeout(int newval, void *extra, bool *pending);
 extern const char *show_tcp_user_timeout(void);
 extern bool check_temp_buffers(int *newval, void **extra, GucSource source);
 extern bool check_temp_tablespaces(char **newval, void **extra,
@@ -163,7 +163,7 @@ extern bool check_transaction_buffers(int *newval, void **extra, GucSource sourc
 extern bool check_transaction_deferrable(bool *newval, void **extra, GucSource source);
 extern bool check_transaction_isolation(int *newval, void **extra, GucSource source);
 extern bool check_transaction_read_only(bool *newval, void **extra, GucSource source);
-extern void assign_transaction_timeout(int newval, void *extra);
+extern void assign_transaction_timeout(int newval, void *extra, bool *pending);
 extern const char *show_unix_socket_permissions(void);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern bool check_wal_consistency_checking(char **newval, void **extra,
-- 
2.45.1

v2-0005-Allow-to-resize-shared-memory-without-restart.patchtext/plain; charset=us-asciiDownload

From 78ea0efde8799445b90a70ca321e40b75fea52c9 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Thu, 20 Feb 2025 21:12:26 +0100
Subject: [PATCH v2 5/6] Allow to resize shared memory without restart

Add assing hook for shared_buffers to resize shared memory using space,
introduced in the previous commits without requiring PostgreSQL restart.
Essentially the implementation is based on two mechanisms: a global
Barrier to coordinate backends that simultaneously change
shared_buffers, and pieces in shared memory to coordinate backends that
are too late to the party for some reason.

The resize process looks like this:

* The GUC assign hook sets a flag to let the Postmaster know that resize
  was requested.

* Postmaster verifies the flag in the event loop, and starts the resize
  by emitting a ProcSignal barrier. Afterwards it does shared memory
  resize itself.

* All the backends, that participate in ProcSignal mechanism,
  recalculate shared memory size based on the new NBuffers and extend it
  using mremap.

* When finished, a backend waits on a global ShmemControl barrier,
  untill all backends will be finished as well. This way we ensure three
  stages with clear boundaries: before the resize, when all processes
  use old NBuffers; during the resize, when processes have mix of old
  and new NBuffers, and wait until it's done; after the resize, when all
  processes use new NBuffers.

* After all backends are using new value, one backend will initialize
  new shared structures (buffer blocks, descriptors, etc) as needed and
  broadcast new value of NBuffers via ShmemControl in shared memory.
  Other backends are waiting for this operation to finish as well. Then
  the barrier is lifted and everything goes as usual.

Here is how it looks like after raising shared_buffers from 128 MB to
512 MB and calling pg_reload_conf():

    -- 128 MB
    7f5a2bd04000-7f5a32e52000  /dev/zero (deleted)
    7f5a39252000-7f5a4030e000  /dev/zero (deleted)
    7f5a4670e000-7f5a4d7ba000  /dev/zero (deleted)
    7f5a53bba000-7f5a5ad26000  /dev/zero (deleted)
    7f5a9ad26000-7f5aa9d94000  /dev/zero (deleted)
    ^ buffers mapping, ~240 MB
    7f5d29d94000-7f5d30e00000  /dev/zero (deleted)

    -- 512 MB
    7f5a2bd04000-7f5a33274000  /dev/zero (deleted)
    7f5a39252000-7f5a4057e000  /dev/zero (deleted)
    7f5a4670e000-7f5a4d9fa000  /dev/zero (deleted)
    7f5a53bba000-7f5a5b1a6000  /dev/zero (deleted)
    7f5a9ad26000-7f5ac1f14000  /dev/zero (deleted)
    ^ buffers mapping, ~625 MB
    7f5d29d94000-7f5d30f80000  /dev/zero (deleted)

The implementation supports only increasing of shared_buffers. For
decreasing the value a similar procedure is needed. But the buffer
blocks with data have to be drained first, so that the actual data set
fits into the new smaller space.

From experiment it turns out that shared mappings have to be extended
separately for each process that uses them. Another rough edge is that a
backend blocked on ReadCommand will not apply shared_buffers change
until it reads something.

Note, that mremap is Linux specific, thus the implementation not very
portable.

Authors: Dmitrii Dolgov, Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c                 | 300 ++++++++++++++++++
 src/backend/postmaster/postmaster.c           |  15 +
 src/backend/storage/buffer/buf_init.c         | 152 ++++++++-
 src/backend/storage/ipc/ipci.c                |  11 +
 src/backend/storage/ipc/procsignal.c          |  45 +++
 src/backend/storage/ipc/shmem.c               |  14 +-
 src/backend/tcop/postgres.c                   |  15 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/backend/utils/misc/guc_tables.c           |   4 +-
 src/include/storage/bufmgr.h                  |   1 +
 src/include/storage/ipc.h                     |   2 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/storage/pg_shmem.h                |  24 ++
 src/include/storage/procsignal.h              |   1 +
 src/tools/pgindent/typedefs.list              |   1 +
 15 files changed, 577 insertions(+), 12 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 59aa67cb135..35a8ff92175 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -30,13 +30,17 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "portability/mem.h"
+#include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
 #include "utils/pidfile.h"
+#include "utils/wait_event.h"
 
 
 /*
@@ -105,6 +109,13 @@ typedef struct AnonymousMapping
 
 static AnonymousMapping Mappings[ANON_MAPPINGS];
 
+/* Flag telling postmaster that resize is needed */
+volatile bool pending_pm_shmem_resize = false;
+
+/* Keeps track of the previous NBuffers value */
+static int NBuffersOld = -1;
+static int NBuffersPending = -1;
+
 /* Keeps track of used mapping segments */
 static int next_free_segment = 0;
 
@@ -859,6 +870,274 @@ AnonymousShmemDetach(int status, Datum arg)
 	}
 }
 
+/*
+ * Resize all shared memory segments based on the current NBuffers value, which
+ * is is applied from NBuffersPending. The actual segment resizing is done via
+ * mremap, which will fail if is not sufficient space to expand the mapping.
+ * When finished, based on the new and old values initialize new buffer blocks
+ * if any.
+ *
+ * If reinitializing took place, as the last step this function broadcasts
+ * NSharedBuffers to it's new value, allowing any other backends to rely on
+ * this new value and skip buffers reinitialization.
+ */
+static bool
+AnonymousShmemResize(void)
+{
+	int	numSemas;
+	bool reinit = false;
+	NBuffers = NBuffersPending;
+
+	elog(DEBUG1, "Resize shmem from %d to %d", NBuffersOld, NBuffers);
+
+	/*
+	 * XXX: Where to reset the flag is still an open question. E.g. do we
+	 * consider a no-op when NBuffers is equal to NBuffersOld a genuine resize
+	 * and reset the flag?
+	 */
+	pending_pm_shmem_resize = false;
+
+	/*
+	 * XXX: Currently only increasing of shared_buffers is supported. For
+	 * decreasing something similar has to be done, but buffer blocks with
+	 * data have to be drained first.
+	 */
+	if(NBuffersOld > NBuffers)
+		return false;
+
+	for(int i = 0; i < next_free_segment; i++)
+	{
+		/* Note that CalculateShmemSize indirectly depends on NBuffers */
+		Size new_size = CalculateShmemSize(&numSemas, i);
+		AnonymousMapping *m = &Mappings[i];
+
+		if (m->shmem == NULL)
+			continue;
+
+		if (m->shmem_size == new_size)
+			continue;
+
+
+		/*
+		 * Fail hard if faced any issues. In theory we could try to handle this
+		 * more gracefully and proceed with shared memory as before, but some
+		 * other backends might have succeeded and have different size. If we
+		 * would like to go this way, to be consistent we would need to
+		 * synchronize again, and it's not clear if it's worth the effort.
+		 */
+		if (mremap(m->shmem, m->shmem_size, new_size, 0) < 0)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not resize shared memory %p to %d (%zu): %m",
+							m->shmem, NBuffers, m->shmem_size)));
+		else
+		{
+			reinit = true;
+			m->shmem_size = new_size;
+		}
+	}
+
+	if (reinit)
+	{
+		if(IsUnderPostmaster &&
+			LWLockConditionalAcquire(ShmemResizeLock, LW_EXCLUSIVE))
+		{
+			/*
+			 * If the new NBuffers was already broadcasted, the buffer pool was
+			 * already initialized before.
+			 *
+			 * Since we're not on a hot path, we use lwlocks and do not need to
+			 * involve memory barrier.
+			 */
+			if(pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers)
+			{
+				/*
+				 * Allow the first backend that managed to get the lock to
+				 * reinitialize the new portion of buffer pool. Every other
+				 * process will wait on the shared barrier for that to finish,
+				 * since it's a part of the SHMEM_RESIZE_DONE phase.
+				 *
+				 * XXX: This is the right place for buffer eviction as well.
+				 */
+				ResizeBufferPool(NBuffersOld, true);
+
+				/* If all fine, broadcast the new value */
+				pg_atomic_write_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
+			}
+			else
+				ResizeBufferPool(NBuffersOld, false);
+
+			LWLockRelease(ShmemResizeLock);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * We are asked to resize shared memory. Do the resize and make sure to wait on
+ * the provided barrier until all simultaneously participating backends finish
+ * resizing as well, otherwise we face danger of inconsistency between
+ * backends.
+ *
+ * XXX: If a backend is blocked on ReadCommand in PostgresMain, it will not
+ * proceed with AnonymousShmemResize after receiving SIGHUP, until something
+ * will be sent.
+ */
+bool
+ProcessBarrierShmemResize(Barrier *barrier)
+{
+	elog(DEBUG1, "Handle a barrier for shmem resizing from %d to %d, %d",
+		 NBuffersOld, NBuffersPending, pending_pm_shmem_resize);
+
+	/* Wait until we have seen the new NBuffers value */
+	if (!pending_pm_shmem_resize)
+		return false;
+
+	/*
+	 * After attaching to the barrier we could be in any of states:
+	 *
+	 * - Initial SHMEM_RESIZE_REQUESTED, nothing has been done yet
+	 * - SHMEM_RESIZE_START, some of the backends have started to resize
+	 * - SHMEM_RESIZE_DONE, participating backends have finished resizing
+	 * - SHMEM_RESIZE_REQUESTED after the reset, the shared memory was already
+	 *   resized
+	 *
+	 * The first three states take place while the actual resize is in
+	 * progress, and all we need to do is join and proceed with resizing. This
+	 * way all simultaneously participating backends will remap and wait until
+	 * one of them initialize new buffers.
+	 *
+	 * The last state happens when we are too late and everything is already
+	 * done. In that case proceed as well, relying on AnonymousShmemResize not
+	 * reinitialize anything since the NSharedBuffers is already broadcasted.
+	 */
+	BarrierAttach(barrier);
+
+	/* First phase means the resize has begun, SHMEM_RESIZE_START */
+	BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START);
+
+	/* XXX: Split mremap and buffer reinitialization into two barrier phases */
+	AnonymousShmemResize();
+
+	/* The second phase means the resize has finished, SHMEM_RESIZE_DONE */
+	BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_DONE);
+
+	/* Allow the last backend to reset the barrier */
+	if (BarrierArriveAndDetach(barrier))
+		ResetShmemBarrier();
+
+	return true;
+}
+
+/*
+ * GUC assign hook for shared_buffers. It's recommended for an assign hook to
+ * be as minimal as possible, thus we just request shared memory resize and
+ * remember the previous value.
+ */
+void
+assign_shared_buffers(int newval, void *extra, bool *pending)
+{
+	elog(DEBUG1, "Received SIGHUP for shmem resizing");
+
+	/* Request shared memory resize only when it was initialized */
+	if (next_free_segment != 0)
+	{
+		elog(DEBUG1, "Set pending signal");
+		pending_pm_shmem_resize = true;
+		*pending = true;
+		NBuffersPending = newval;
+	}
+
+	NBuffersOld = NBuffers;
+}
+
+/*
+ * Test if we have somehow missed a shmem resize signal and NBuffers value
+ * differs from NSharedBuffers. If yes, catchup and do resize.
+ */
+void
+AdjustShmemSize(void)
+{
+	uint32 NSharedBuffers = pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers);
+
+	if (NSharedBuffers != NBuffers)
+	{
+		/*
+		 * If the broadcasted shared_buffers is different from the one we see,
+		 * it could be that the backend has missed a resize signal. To avoid
+		 * any inconsistency, adjust the shared mappings, before having a
+		 * chance to access the buffer pool.
+		 */
+		ereport(LOG,
+				(errmsg("shared_buffers has been changed from %d to %d, "
+						"resize shared memory",
+						NBuffers, NSharedBuffers)));
+		NBuffers = NSharedBuffers;
+		AnonymousShmemResize();
+	}
+}
+
+/*
+ * Coordinate all existing processes to make sure they all will have consistent
+ * view of shared memory size. Must be called only in postmaster.
+ */
+void
+CoordinateShmemResize(void)
+{
+	elog(DEBUG1, "Coordinating shmem resize from %d to %d",
+		 NBuffersOld, NBuffers);
+	Assert(!IsUnderPostmaster);
+
+	/*
+	 * If the value did not change, or shared memory segments are not
+	 * initialized yet, skip the resize.
+	 */
+	if (NBuffersPending == NBuffersOld || next_free_segment == 0)
+	{
+		elog(DEBUG1, "Skip resizing, new %d, old %d, free segment %d",
+			 NBuffers, NBuffersOld, next_free_segment);
+		return;
+	}
+
+	/*
+	 * Shared memory resize requires some coordination done by postmaster,
+	 * and consists of three phases:
+	 *
+	 * - Before the resize all existing backends have the same old NBuffers.
+	 * - When resize is in progress, backends are expected to have a
+	 *   mixture of old a new values. They're not allowed to touch buffer
+	 *   pool during this time frame.
+	 * - After resize has been finished, all existing backends, that can access
+	 *   the buffer pool, are expected to have the same new value of NBuffers.
+	 *   There might still be some backends, that are sleeping or for some
+	 *   other reason not doing any work yet and have old NBuffers -- but as
+	 *   soon as they will get some time slice, they will acquire the new
+	 *   value.
+	 */
+	elog(DEBUG1, "Emit a barrier for shmem resizing");
+	EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SHMEM_RESIZE);
+
+	AnonymousShmemResize();
+
+	/*
+	 * Normally we would call WaitForProcSignalBarrier here to wait until every
+	 * backend has reported on the ProcSignalBarrier. But for shared memory
+	 * resize we don't need this, as every participating backend will
+	 * synchronize on the ProcSignal barrier, and there is no sequential logic
+	 * we have to perform afterwards. In fact even if we would like to wait
+	 * here, it wouldn't be possible -- we're in the postmaster, without any
+	 * waiting infrastructure available.
+	 *
+	 * If at some point it will turn out that waiting is essential, we would
+	 * need to consider some alternatives. E.g. it could be a designated
+	 * coordination process, which is not a postmaster. Another option would be
+	 * to introduce a CoordinateShmemResize lock and allow only one process to
+	 * take it (this probably would have to be something different than
+	 * LWLocks, since they block interrupts, and coordination relies on them).
+	 */
+}
+
 /*
  * PGSharedMemoryCreate
  *
@@ -1174,3 +1453,24 @@ PGSharedMemoryDetach(void)
 		}
 	}
 }
+
+void
+WaitOnShmemBarrier(int phase)
+{
+	Barrier *barrier = &ShmemCtrl->Barrier;
+
+	if (BarrierPhase(barrier) == phase)
+	{
+		ereport(LOG,
+				(errmsg("ProcSignal barrier is in phase %d, waiting", phase)));
+		BarrierAttach(barrier);
+		BarrierArriveAndWait(barrier, 0);
+		BarrierDetach(barrier);
+	}
+}
+
+void
+ResetShmemBarrier(void)
+{
+	BarrierInit(&ShmemCtrl->Barrier, 0);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index bb22b13adef..f3e508141b2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -418,6 +418,7 @@ static void process_pm_pmsignal(void);
 static void process_pm_child_exit(void);
 static void process_pm_reload_request(void);
 static void process_pm_shutdown_request(void);
+static void process_pm_shmem_resize(void);
 static void dummy_handler(SIGNAL_ARGS);
 static void CleanupBackend(PMChild *bp, int exitstatus);
 static void HandleChildCrash(int pid, int exitstatus, const char *procname);
@@ -1680,6 +1681,9 @@ ServerLoop(void)
 			if (pending_pm_pmsignal)
 				process_pm_pmsignal();
 
+			if (pending_pm_shmem_resize)
+				process_pm_shmem_resize();
+
 			if (events[i].events & WL_SOCKET_ACCEPT)
 			{
 				ClientSocket s;
@@ -2026,6 +2030,17 @@ process_pm_reload_request(void)
 	}
 }
 
+static void
+process_pm_shmem_resize(void)
+{
+	/*
+	 * Failure to resize is considered to be fatal and will not be
+	 * retried, which means we can disable pending flag right here.
+	 */
+	pending_pm_shmem_resize = false;
+	CoordinateShmemResize();
+}
+
 /*
  * pg_ctl uses SIGTERM, SIGINT and SIGQUIT to request different types of
  * shutdown.
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index f5b9290a640..b7de0ab6b0d 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -23,6 +23,41 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
+/*
+ * Currently broadcasted value of NBuffers in shared memory.
+ *
+ * Most of the time this value is going to be equal to NBuffers. But if
+ * postmaster is resizing shared memory and a new backend was created
+ * at the same time, there is a possibility for the new backend to inherit the
+ * old NBuffers value, but miss the resize signal if ProcSignal infrastructure
+ * was not initialized yet. Consider this situation:
+ *
+ *     Postmaster ------> New Backend
+ *         |                   |
+ *         |                Launch
+ *         |                   |
+ *         |             Inherit NBuffers
+ *         |                   |
+ *     Resize NBuffers         |
+ *         |                   |
+ *     Emit Barrier            |
+ *         |            Init ProcSignal
+ *         |                   |
+ *     Finish resize           |
+ *         |                   |
+ *     New NBuffers       Old NBuffers
+ *
+ * In this case the backend is not yet ready to receive a signal from
+ * EmitProcSignalBarrier, and will be ignored. The same happens if ProcSignal
+ * is initialized even later, after the resizing was finished.
+ *
+ * To address resulting inconsistency, postmaster broadcasts the current
+ * NBuffers value via shared memory. Every new backend has to verify this value
+ * before it will access the buffer pool: if it differs from its own value,
+ * this indicates a shared memory resize has happened and the backend has to
+ * first synchronize with rest of the pack.
+ */
+ShmemControl *ShmemCtrl = NULL;
 
 /*
  * Data Structures:
@@ -72,7 +107,19 @@ BufferManagerShmemInit(void)
 	bool		foundBufs,
 				foundDescs,
 				foundIOCV,
-				foundBufCkpt;
+				foundBufCkpt,
+				foundShmemCtrl;
+
+	ShmemCtrl = (ShmemControl *)
+		ShmemInitStruct("Shmem Control", sizeof(ShmemControl),
+						&foundShmemCtrl);
+
+	if (!foundShmemCtrl)
+	{
+		/* Initialize with the currently known value */
+		pg_atomic_init_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
+		BarrierInit(&ShmemCtrl->Barrier, 0);
+	}
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
@@ -153,6 +200,109 @@ BufferManagerShmemInit(void)
 						 &backend_flush_after);
 }
 
+/*
+ * Reinitialize shared memory structures, which size depends on NBuffers. It's
+ * similar to InitBufferPool, but applied only to the buffers in the range
+ * between NBuffersOld and NBuffers.
+ *
+ * NBuffersOld tells what was the original value of NBuffersOld. It will be
+ * used to identify new and not yet initialized buffers.
+ *
+ * initNew flag indicates that the caller wants new buffers to be initialized.
+ * No locks are taking in this function, it is the caller responsibility to
+ * make sure only one backend can work with new buffers.
+ */
+void
+ResizeBufferPool(int NBuffersOld, bool initNew)
+{
+	bool		foundBufs,
+				foundDescs,
+				foundIOCV,
+				foundBufCkpt;
+	int			i;
+	elog(DEBUG1, "Resizing buffer pool from %d to %d", NBuffersOld, NBuffers);
+
+	/* XXX: Only increasing of shared_buffers is supported in this function */
+	if(NBuffersOld > NBuffers)
+		return;
+
+	/* Align descriptors to a cacheline boundary. */
+	BufferDescriptors = (BufferDescPadded *)
+		ShmemInitStructInSegment("Buffer Descriptors",
+						NBuffers * sizeof(BufferDescPadded),
+						&foundDescs, BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
+
+	/* Align condition variables to cacheline boundary. */
+	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
+		ShmemInitStructInSegment("Buffer IO Condition Variables",
+						NBuffers * sizeof(ConditionVariableMinimallyPadded),
+						&foundIOCV, BUFFER_IOCV_SHMEM_SEGMENT);
+
+	/*
+	 * The array used to sort to-be-checkpointed buffer ids is located in
+	 * shared memory, to avoid having to allocate significant amounts of
+	 * memory at runtime. As that'd be in the middle of a checkpoint, or when
+	 * the checkpointer is restarted, memory allocation failures would be
+	 * painful.
+	 */
+	CkptBufferIds = (CkptSortItem *)
+		ShmemInitStructInSegment("Checkpoint BufferIds",
+						NBuffers * sizeof(CkptSortItem), &foundBufCkpt,
+						CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
+
+	/* Align buffer pool on IO page size boundary. */
+	BufferBlocks = (char *)
+		TYPEALIGN(PG_IO_ALIGN_SIZE,
+				  ShmemInitStructInSegment("Buffer Blocks",
+								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  &foundBufs, BUFFERS_SHMEM_SEGMENT));
+
+	/*
+	 * It's enough to only resize shmem structures, if some other backend will
+	 * do initialization of new buffers for us.
+	 */
+	if (!initNew)
+		return;
+
+	elog(DEBUG1, "Initialize new buffers");
+
+	/*
+	 * Initialize the headers for new buffers.
+	 */
+	for (i = NBuffersOld; i < NBuffers; i++)
+	{
+		BufferDesc *buf = GetBufferDescriptor(i);
+
+		ClearBufferTag(&buf->tag);
+
+		pg_atomic_init_u32(&buf->state, 0);
+		buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
+
+		buf->buf_id = i;
+
+		/*
+		 * Initially link all the buffers together as unused. Subsequent
+		 * management of this list is done by freelist.c.
+		 */
+		buf->freeNext = i + 1;
+
+		LWLockInitialize(BufferDescriptorGetContentLock(buf),
+						 LWTRANCHE_BUFFER_CONTENT);
+
+		ConditionVariableInit(BufferDescriptorGetIOCV(buf));
+	}
+
+	/* Correct last entry of linked list */
+	GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
+
+	/* Init other shared buffer-management stuff */
+	StrategyInitialize(!foundDescs);
+
+	/* Initialize per-backend file flush context */
+	WritebackContextInit(&BackendWritebackContext,
+						 &backend_flush_after);
+}
+
 /*
  * BufferManagerShmemSize
  *
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 68778522591..a2c635f288e 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -83,6 +83,9 @@ RequestAddinShmemSpace(Size size)
  *
  * If num_semaphores is not NULL, it will be set to the number of semaphores
  * required.
+ *
+ * XXX: Calculation for non main shared memory segments are incorrect, it
+ * includes more than needed for buffers only.
  */
 Size
 CalculateShmemSize(int *num_semaphores, int shmem_segment)
@@ -149,6 +152,14 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 
+	/*
+	 * XXX: For some reason slightly more memory is needed for larger
+	 * shared_buffers, but this size is enough for any large value I've tested
+	 * with. Is it a mistake in how slots are split, or there was a hidden
+	 * inconsistency in shmem calculation?
+	 */
+	size = add_size(size, 1024 * 1024 * 100);
+
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
 
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7401b6e625e..bec0e00f901 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -27,6 +27,7 @@
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
+#include "storage/pg_shmem.h"
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "storage/smgr.h"
@@ -108,6 +109,10 @@ static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
 
+#ifdef DEBUG_SHMEM_RESIZE
+bool delay_proc_signal_init = false;
+#endif
+
 /*
  * ProcSignalShmemSize
  *		Compute space needed for ProcSignal's shared memory
@@ -168,6 +173,42 @@ ProcSignalInit(bool cancel_key_valid, int32 cancel_key)
 	ProcSignalSlot *slot;
 	uint64		barrier_generation;
 
+#ifdef DEBUG_SHMEM_RESIZE
+	/*
+	 * Introduced for debugging purposes. You can change the variable at
+	 * runtime using gdb, then start new backends with delayed ProcSignal
+	 * initialization. Simple pg_usleep wont work here due to SIGHUP interrupt
+	 * needed for testing. Taken from pg_sleep;
+	 */
+	if (delay_proc_signal_init)
+	{
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+		float8		endtime = GetNowFloat() + 5;
+
+		for (;;)
+		{
+			float8		delay;
+			long		delay_ms;
+
+			CHECK_FOR_INTERRUPTS();
+
+			delay = endtime - GetNowFloat();
+			if (delay >= 600.0)
+				delay_ms = 600000;
+			else if (delay > 0.0)
+				delay_ms = (long) (delay * 1000.0);
+			else
+				break;
+
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 delay_ms,
+							 WAIT_EVENT_PG_SLEEP);
+			ResetLatch(MyLatch);
+		}
+	}
+#endif
+
 	if (MyProcNumber < 0)
 		elog(ERROR, "MyProcNumber not set");
 	if (MyProcNumber >= NumProcSignalSlots)
@@ -573,6 +614,10 @@ ProcessProcSignalBarrier(void)
 					case PROCSIGNAL_BARRIER_SMGRRELEASE:
 						processed = ProcessBarrierSmgrRelease();
 						break;
+					case PROCSIGNAL_BARRIER_SHMEM_RESIZE:
+						processed = ProcessBarrierShmemResize(
+								&ShmemCtrl->Barrier);
+						break;
 				}
 
 				/*
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 389abc82519..226b38ba979 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -493,17 +493,13 @@ ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
 	{
 		/*
 		 * Structure is in the shmem index so someone else has allocated it
-		 * already.  The size better be the same as the size we are trying to
-		 * initialize to, or there is a name conflict (or worse).
+		 * already. Verify the structure's size:
+		 * - If it's the same, we've found the expected structure.
+		 * - If it's different, we're resizing the expected structure.
 		 */
 		if (result->size != size)
-		{
-			LWLockRelease(ShmemIndexLock);
-			ereport(ERROR,
-					(errmsg("ShmemIndex entry size is wrong for data structure"
-							" \"%s\": expected %zu, actual %zu",
-							name, size, result->size)));
-		}
+			result->size = size;
+
 		structPtr = result->location;
 	}
 	else
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 13fb8c31702..04cdd0d24d8 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -62,6 +62,7 @@
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -4267,6 +4268,20 @@ PostgresMain(const char *dbname, const char *username)
 	 */
 	BeginReportingGUCOptions();
 
+	/*
+	 * Verify the shared barrier, if it's still active: join and wait.
+	 *
+	 * XXX: Any potential race condition if not a single backend has
+	 * incremented the barrier phase?
+	 */
+	WaitOnShmemBarrier(SHMEM_RESIZE_START);
+
+	/*
+	 * After waiting on the barrier above we guaranteed to have NSharedBuffers
+	 * broadcasted, so we can use it in the function below.
+	 */
+	AdjustShmemSize();
+
 	/*
 	 * Also set up handler to log session end; we have to wait till now to be
 	 * sure Log_disconnections has its final value.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..012acb98169 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -154,6 +154,8 @@ REPLICATION_ORIGIN_DROP	"Waiting for a replication origin to become inactive so
 REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it can be dropped."
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
+SHMEM_RESIZE_START	"Waiting for other backends to start resizing shared memory."
+SHMEM_RESIZE_DONE	"Waiting for other backends to finish resizing shared memory."
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
 WAL_RECEIVER_EXIT	"Waiting for the WAL receiver to exit."
 WAL_RECEIVER_WAIT_START	"Waiting for startup process to send initial data for streaming replication."
@@ -346,6 +348,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+ShmemResize	"Waiting to resize shared memory."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 3cde94a1759..efdaa71c8fb 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2339,14 +2339,14 @@ struct config_int ConfigureNamesInt[] =
 	 * checking for overflow, so we mustn't allow more than INT_MAX / 2.
 	 */
 	{
-		{"shared_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+		{"shared_buffers", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the number of shared memory buffers used by the server."),
 			NULL,
 			GUC_UNIT_BLOCKS
 		},
 		&NBuffers,
 		16384, 16, INT_MAX / 2,
-		NULL, NULL, NULL
+		NULL, assign_shared_buffers, NULL
 	},
 
 	{
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index bb7fe02e243..fff80214822 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -298,6 +298,7 @@ extern bool EvictUnpinnedBuffer(Buffer buf);
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(int);
+extern void ResizeBufferPool(int, bool);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index c0439f2206b..5f5b45c88bd 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -64,6 +64,7 @@ typedef void (*shmem_startup_hook_type) (void);
 /* ipc.c */
 extern PGDLLIMPORT bool proc_exit_inprogress;
 extern PGDLLIMPORT bool shmem_exit_inprogress;
+extern PGDLLIMPORT volatile bool pending_pm_shmem_resize;
 
 extern void proc_exit(int code) pg_attribute_noreturn();
 extern void shmem_exit(int code);
@@ -83,5 +84,6 @@ extern void CreateSharedMemoryAndSemaphores(void);
 extern void AttachSharedMemoryStructs(void);
 #endif
 extern void InitializeShmemGUCs(void);
+extern void CoordinateShmemResize(void);
 
 #endif							/* IPC_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..61e89c6e8fd 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, ShmemResize)
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index ba0192baf95..b597df0d3a3 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -24,6 +24,7 @@
 #ifndef PG_SHMEM_H
 #define PG_SHMEM_H
 
+#include "storage/barrier.h"
 #include "storage/dsm_impl.h"
 #include "storage/spin.h"
 
@@ -56,6 +57,23 @@ typedef struct ShmemSegment
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 
+/*
+ * ShmemControl is shared between backends and helps to coordinate shared
+ * memory resize.
+ */
+typedef struct
+{
+	pg_atomic_uint32 	NSharedBuffers;
+	Barrier 			Barrier;
+} ShmemControl;
+
+extern PGDLLIMPORT ShmemControl *ShmemCtrl;
+
+/* The phases for shared memory resizing, used by for ProcSignal barrier. */
+#define SHMEM_RESIZE_REQUESTED			0
+#define SHMEM_RESIZE_START				1
+#define SHMEM_RESIZE_DONE				2
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -105,6 +123,12 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+bool ProcessBarrierShmemResize(Barrier *barrier);
+void assign_shared_buffers(int newval, void *extra, bool *pending);
+void AdjustShmemSize(void);
+extern void WaitOnShmemBarrier(int phase);
+extern void ResetShmemBarrier(void);
+
 /*
  * To be able to dynamically resize largest parts of the data stored in shared
  * memory, we split it into multiple shared memory mappings segments. Each
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 022fd8ed933..4c9973dc2d9 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -54,6 +54,7 @@ typedef enum
 typedef enum
 {
 	PROCSIGNAL_BARRIER_SMGRRELEASE, /* ask smgr to close files */
+	PROCSIGNAL_BARRIER_SHMEM_RESIZE, /* ask backends to resize shared memory */
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fb39c915d76..5bf6d099808 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2671,6 +2671,7 @@ ShellTypeInfo
 ShippableCacheEntry
 ShippableCacheKey
 ShmemIndexEnt
+ShmemControl
 ShutdownForeignScan_function
 ShutdownInformation
 ShutdownMode
-- 
2.45.1

v2-0006-Use-anonymous-files-to-back-shared-memory-segment.patchtext/plain; charset=us-asciiDownload

From e511bab55891a2d60152e913df6c20e20314e71b Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sun, 23 Feb 2025 14:42:39 +0100
Subject: [PATCH v2 6/6] Use anonymous files to back shared memory segments

Allow to use anonymous files for shared memory, instead of plain
anonymous memory. Such an anonymous file is created via memfd_create, it
lives in memory, behaves like a regular file and semantically equivalent
to an anonymous memory allocated via mmap with MAP_ANONYMOUS.

Advantages of using anon files are following:

* We've got a file descriptor, which could be used for regular file
  operations (modification, truncation, you name it).

* The file could be given a name, which improves readability when it
  comes to process maps. Here is how it looks like

7f5a2bd04000-7f5a32e52000 rw-s 00000000 00:01 1845 /memfd:strategy (deleted)
7f5a39252000-7f5a4030e000 rw-s 00000000 00:01 1842 /memfd:checkpoint (deleted)
7f5a4670e000-7f5a4d7ba000 rw-s 00000000 00:01 1839 /memfd:iocv (deleted)
7f5a53bba000-7f5a5ad26000 rw-s 00000000 00:01 1836 /memfd:descriptors (deleted)
7f5a9ad26000-7f5aa9d94000 rw-s 00000000 00:01 1833 /memfd:buffers (deleted)
7f5d29d94000-7f5d30e00000 rw-s 00000000 00:01 1830 /memfd:main (deleted)

* By default, Linux will not add file-backed shared mappings into a core dump,
  making it more convenient to work with them in PostgreSQL: no more huge dumps
  to process.

The downside is that memfd_create is Linux specific.
---
 src/backend/port/sysv_shmem.c | 46 +++++++++++++++++++++++++++++------
 src/include/portability/mem.h |  2 +-
 2 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 35a8ff92175..8864866f26c 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -105,6 +105,7 @@ typedef struct AnonymousMapping
 	void *shmem; 				/* Pointer to the start of the mapped memory */
 	void *seg_addr; 			/* SysV shared memory for the header */
 	unsigned long seg_id; 		/* IPC key */
+	int segment_fd; 			/* fd for the backing anon file */
 } AnonymousMapping;
 
 static AnonymousMapping Mappings[ANON_MAPPINGS];
@@ -125,7 +126,7 @@ static int next_free_segment = 0;
  * 00400000-00490000         /path/bin/postgres
  * ...
  * 012d9000-0133e000         [heap]
- * 7f443a800000-7f470a800000 /dev/zero (deleted)
+ * 7f443a800000-7f470a800000 /memfd:main (deleted)
  * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
  * ...
@@ -152,9 +153,9 @@ static int next_free_segment = 0;
  * The result would look like this:
  *
  * 012d9000-0133e000         [heap]
- * 7f4426f54000-7f442e010000 /dev/zero (deleted)
+ * 7f4426f54000-7f442e010000 /memfd:main (deleted)
  * [...free space...]
- * 7f443a800000-7f444196c000 /dev/zero (deleted)
+ * 7f443a800000-7f444196c000 /memfd:buffers (deleted)
  * [...free space...]
  * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
@@ -717,6 +718,18 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
 
+	/*
+	 * Prepare an anonymous file backing the segment. Its size will be
+	 * specified later via ftruncate.
+	 *
+	 * The file behaves like a regular file, but lives in memory. Once all
+	 * references to the file are dropped,  it is automatically released.
+	 * Anonymous memory is used for all backing pages of the file, thus it has
+	 * the same semantics as anonymous memory allocations using mmap with the
+	 * MAP_ANONYMOUS flag.
+	 */
+	mapping->segment_fd = memfd_create(MappingName(mapping->shmem_segment), 0);
+
 #ifndef MAP_HUGETLB
 	/* PGSharedMemoryCreate should have dealt with this case */
 	Assert(huge_pages != HUGE_PAGES_ON);
@@ -734,8 +747,13 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		if (allocsize % hugepagesize != 0)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
+		/*
+		 * Do not use an anonymous file here yet. When adding it, do not forget
+		 * to use ftruncate and flags MFD_HUGETLB & MFD_HUGE_2MB/MFD_HUGE_1GB
+		 * in memfd_create.
+		 */
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+				   PG_MMAP_FLAGS | MAP_ANONYMOUS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 		{
@@ -771,7 +789,8 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		 * - First create the temporary probe mapping of a fixed size and let
 		 *   kernel to place it at address of its choice. By the virtue of the
 		 *   probe mapping size we expect it to be located at the lowest
-		 *   possible address, expecting some non mapped space above.
+		 *   possible address, expecting some non mapped space above. The probe
+		 *   is does not need to be  backed by an anonymous file.
 		 *
 		 * - Unmap the probe mapping, remember the address.
 		 *
@@ -786,7 +805,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		 *   without a restart.
 		 */
 		probe = mmap(NULL, PROBE_MAPPING_SIZE, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS, -1, 0);
+				   PG_MMAP_FLAGS | MAP_ANONYMOUS, -1, 0);
 
 		if (probe == MAP_FAILED)
 		{
@@ -802,8 +821,14 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 
 			munmap(probe, PROBE_MAPPING_SIZE);
 
+			/*
+			 * Specify the segment file size using allocsize, which contains
+			 * potentially modified size.
+			 */
+			ftruncate(mapping->segment_fd, allocsize);
+
 			ptr = mmap(probe - offset, allocsize, PROT_READ | PROT_WRITE,
-					   PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, -1, 0);
+					   PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, mapping->segment_fd, 0);
 			mmap_errno = errno;
 			if (ptr == MAP_FAILED)
 			{
@@ -822,8 +847,11 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		 */
 		allocsize = mapping->shmem_size;
 
+		/* Specify the segment file size using allocsize. */
+		ftruncate(mapping->segment_fd, allocsize);
+
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-						   PG_MMAP_FLAGS, -1, 0);
+						   PG_MMAP_FLAGS, mapping->segment_fd, 0);
 		mmap_errno = errno;
 	}
 
@@ -917,6 +945,8 @@ AnonymousShmemResize(void)
 		if (m->shmem_size == new_size)
 			continue;
 
+		/* Resize the backing anon file. */
+		ftruncate(m->segment_fd, new_size);
 
 		/*
 		 * Fail hard if faced any issues. In theory we could try to handle this
diff --git a/src/include/portability/mem.h b/src/include/portability/mem.h
index ef9800732d9..40588ff6968 100644
--- a/src/include/portability/mem.h
+++ b/src/include/portability/mem.h
@@ -38,7 +38,7 @@
 #define MAP_NOSYNC			0
 #endif
 
-#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_HASSEMAPHORE)
 
 /* Some really old systems don't define MAP_FAILED. */
 #ifndef MAP_FAILED
-- 
2.45.1

#34

Dmitry Dolgov

9erthalion6@gmail.com

11 months ago

In reply to: Dmitry Dolgov (#33)

Re: Changing shared_buffers without restart

On Tue, Feb 25, 2025 at 10:52:05AM GMT, Dmitry Dolgov wrote:

On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
changing shared memory mapping layout. Any feedback is appreciated.

Hi,

Here is a new version of the patch, which contains a proposal about how to
coordinate shared memory resizing between backends. The rest is more or less
the same, a feedback about coordination is appreciated. It's a lot to read, but
the main difference is about:

Just one note, there are still couple of compilation warnings in the
code, which I haven't addressed yet. Those will go away in the next
version.

#35

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

11 months ago

In reply to: Dmitry Dolgov (#34)

11 attachment(s)

Re: Changing shared_buffers without restart

On Thu, Feb 27, 2025 at 1:58 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Tue, Feb 25, 2025 at 10:52:05AM GMT, Dmitry Dolgov wrote:

On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
changing shared memory mapping layout. Any feedback is appreciated.

Hi,

Here is a new version of the patch, which contains a proposal about how to
coordinate shared memory resizing between backends. The rest is more or less
the same, a feedback about coordination is appreciated. It's a lot to read, but
the main difference is about:

Just one note, there are still couple of compilation warnings in the
code, which I haven't addressed yet. Those will go away in the next
version.

PFA the patchset which implements shrinking shared buffers.
0001-0006 are same as the previous patchset
0007 fixes compilation warnings from previous patches - I think those
should be absorbed into their respective patches
0008 adds TODOs that need some code changes or at least need some
consideration. Some of them might point to the causes of Assertion
failures seen with this patch set.
0009 adds WIP support for shrinking shared buffers - I think this
should be absorbed into 0005
0010 WIP fix for Assertion failures seen from BgBufferSync() - I am
still investigating those.

I am using the attached script to shake the patch well. It runs
pgbench and concurrently resizes the shared_buffers. I am seeing
Assertion failures when running the script in both cases, expanding
and shrinking the buffers. I am investigating "failed
Assert("strategy_delta >= 0")," next.

--
Best Wishes,
Ashutosh Bapat

Attachments:

0004-Introduce-pending-flag-for-GUC-assign-hooks-20250228.patchtext/x-patch; charset=US-ASCII; name=0004-Introduce-pending-flag-for-GUC-assign-hooks-20250228.patchDownload

From 7de04680820a8aa3de7b13e4a0f33471a51524fe Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Wed, 19 Feb 2025 17:45:40 +0100
Subject: [PATCH 04/11] Introduce pending flag for GUC assign hooks

Currently an assing hook can perform some preprocessing of a new value,
but it cannot change the behavior, which dictates that the new value
will be applied immediately after the hook. Certain GUC options (like
shared_buffers, coming in subsequent patches) may need coordinating work
between backends to change, meaning we cannot apply it right away.

Add a new flag "pending" for an assign hook to allow the hook indicate
exactly that. If the pending flag is set after the hook, the new value
will not be applied and it's handling becomes the hook's implementation
responsibility.

Note, that this also requires changes in the way how GUCs are getting
reported, but the patch does not cover that yet.
---
 src/backend/access/transam/xlog.c    |  2 +-
 src/backend/commands/variable.c      |  2 +-
 src/backend/libpq/pqcomm.c           |  8 ++--
 src/backend/tcop/postgres.c          |  2 +-
 src/backend/utils/misc/guc.c         | 59 +++++++++++++++++++---------
 src/backend/utils/misc/stack_depth.c |  2 +-
 src/include/utils/guc.h              |  2 +-
 src/include/utils/guc_hooks.h        | 16 ++++----
 8 files changed, 57 insertions(+), 36 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 75d5554c77c..cf79df49cb7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2267,7 +2267,7 @@ CalculateCheckpointSegments(void)
 }
 
 void
-assign_max_wal_size(int newval, void *extra)
+assign_max_wal_size(int newval, void *extra, bool *pending)
 {
 	max_wal_size_mb = newval;
 	CalculateCheckpointSegments();
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 4ad6e236d69..f24c2a0d252 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1143,7 +1143,7 @@ check_cluster_name(char **newval, void **extra, GucSource source)
  * GUC assign_hook for maintenance_io_concurrency
  */
 void
-assign_maintenance_io_concurrency(int newval, void *extra)
+assign_maintenance_io_concurrency(int newval, void *extra, bool *pending)
 {
 #ifdef USE_PREFETCH
 	/*
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index bddd6465de2..3e6d83c8775 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -1949,7 +1949,7 @@ pq_settcpusertimeout(int timeout, Port *port)
  * GUC assign_hook for tcp_keepalives_idle
  */
 void
-assign_tcp_keepalives_idle(int newval, void *extra)
+assign_tcp_keepalives_idle(int newval, void *extra, bool *pending)
 {
 	/*
 	 * The kernel API provides no way to test a value without setting it; and
@@ -1982,7 +1982,7 @@ show_tcp_keepalives_idle(void)
  * GUC assign_hook for tcp_keepalives_interval
  */
 void
-assign_tcp_keepalives_interval(int newval, void *extra)
+assign_tcp_keepalives_interval(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_setkeepalivesinterval(newval, MyProcPort);
@@ -2005,7 +2005,7 @@ show_tcp_keepalives_interval(void)
  * GUC assign_hook for tcp_keepalives_count
  */
 void
-assign_tcp_keepalives_count(int newval, void *extra)
+assign_tcp_keepalives_count(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_setkeepalivescount(newval, MyProcPort);
@@ -2028,7 +2028,7 @@ show_tcp_keepalives_count(void)
  * GUC assign_hook for tcp_user_timeout
  */
 void
-assign_tcp_user_timeout(int newval, void *extra)
+assign_tcp_user_timeout(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_settcpusertimeout(newval, MyProcPort);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 1149d89d7a1..13fb8c31702 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3555,7 +3555,7 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 
 /* GUC assign hook for transaction_timeout */
 void
-assign_transaction_timeout(int newval, void *extra)
+assign_transaction_timeout(int newval, void *extra, bool *pending)
 {
 	if (IsTransactionState())
 	{
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 12192445218..bab1c5d08f6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1679,6 +1679,7 @@ InitializeOneGUCOption(struct config_generic *gconf)
 				struct config_int *conf = (struct config_int *) gconf;
 				int			newval = conf->boot_val;
 				void	   *extra = NULL;
+				bool 	   pending = false;
 
 				Assert(newval >= conf->min);
 				Assert(newval <= conf->max);
@@ -1687,9 +1688,13 @@ InitializeOneGUCOption(struct config_generic *gconf)
 					elog(FATAL, "failed to initialize %s to %d",
 						 conf->gen.name, newval);
 				if (conf->assign_hook)
-					conf->assign_hook(newval, extra);
-				*conf->variable = conf->reset_val = newval;
-				conf->gen.extra = conf->reset_extra = extra;
+					conf->assign_hook(newval, extra, &pending);
+
+				if (!pending)
+				{
+					*conf->variable = conf->reset_val = newval;
+					conf->gen.extra = conf->reset_extra = extra;
+				}
 				break;
 			}
 		case PGC_REAL:
@@ -2041,13 +2046,18 @@ ResetAllOptions(void)
 			case PGC_INT:
 				{
 					struct config_int *conf = (struct config_int *) gconf;
+					bool 			  pending = false;
 
 					if (conf->assign_hook)
 						conf->assign_hook(conf->reset_val,
-										  conf->reset_extra);
-					*conf->variable = conf->reset_val;
-					set_extra_field(&conf->gen, &conf->gen.extra,
-									conf->reset_extra);
+										  conf->reset_extra,
+										  &pending);
+					if (!pending)
+					{
+						*conf->variable = conf->reset_val;
+						set_extra_field(&conf->gen, &conf->gen.extra,
+										conf->reset_extra);
+					}
 					break;
 				}
 			case PGC_REAL:
@@ -2424,16 +2434,21 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
 							struct config_int *conf = (struct config_int *) gconf;
 							int			newval = newvalue.val.intval;
 							void	   *newextra = newvalue.extra;
+							bool 	    pending = false;
 
 							if (*conf->variable != newval ||
 								conf->gen.extra != newextra)
 							{
 								if (conf->assign_hook)
-									conf->assign_hook(newval, newextra);
-								*conf->variable = newval;
-								set_extra_field(&conf->gen, &conf->gen.extra,
-												newextra);
-								changed = true;
+									conf->assign_hook(newval, newextra, &pending);
+
+								if (!pending)
+								{
+									*conf->variable = newval;
+									set_extra_field(&conf->gen, &conf->gen.extra,
+													newextra);
+									changed = true;
+								}
 							}
 							break;
 						}
@@ -3850,18 +3865,24 @@ set_config_with_handle(const char *name, config_handle *handle,
 
 				if (changeVal)
 				{
+					bool pending = false;
+
 					/* Save old value to support transaction abort */
 					if (!makeDefault)
 						push_old_value(&conf->gen, action);
 
 					if (conf->assign_hook)
-						conf->assign_hook(newval, newextra);
-					*conf->variable = newval;
-					set_extra_field(&conf->gen, &conf->gen.extra,
-									newextra);
-					set_guc_source(&conf->gen, source);
-					conf->gen.scontext = context;
-					conf->gen.srole = srole;
+						conf->assign_hook(newval, newextra, &pending);
+
+					if (!pending)
+					{
+						*conf->variable = newval;
+						set_extra_field(&conf->gen, &conf->gen.extra,
+										newextra);
+						set_guc_source(&conf->gen, source);
+						conf->gen.scontext = context;
+						conf->gen.srole = srole;
+					}
 				}
 				if (makeDefault)
 				{
diff --git a/src/backend/utils/misc/stack_depth.c b/src/backend/utils/misc/stack_depth.c
index 8f7cf531fbc..ef59ae62008 100644
--- a/src/backend/utils/misc/stack_depth.c
+++ b/src/backend/utils/misc/stack_depth.c
@@ -156,7 +156,7 @@ check_max_stack_depth(int *newval, void **extra, GucSource source)
 
 /* GUC assign hook for max_stack_depth */
 void
-assign_max_stack_depth(int newval, void *extra)
+assign_max_stack_depth(int newval, void *extra, bool *pending)
 {
 	ssize_t		newval_bytes = newval * (ssize_t) 1024;
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 1233e07d7da..ce9f258100d 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -187,7 +187,7 @@ typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource sourc
 typedef bool (*GucEnumCheckHook) (int *newval, void **extra, GucSource source);
 
 typedef void (*GucBoolAssignHook) (bool newval, void *extra);
-typedef void (*GucIntAssignHook) (int newval, void *extra);
+typedef void (*GucIntAssignHook) (int newval, void *extra, bool *pending);
 typedef void (*GucRealAssignHook) (double newval, void *extra);
 typedef void (*GucStringAssignHook) (const char *newval, void *extra);
 typedef void (*GucEnumAssignHook) (int newval, void *extra);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 87999218d68..1fb4c519c7f 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -81,12 +81,12 @@ extern void assign_log_timezone(const char *newval, void *extra);
 extern const char *show_log_timezone(void);
 extern bool check_maintenance_io_concurrency(int *newval, void **extra,
 											 GucSource source);
-extern void assign_maintenance_io_concurrency(int newval, void *extra);
+extern void assign_maintenance_io_concurrency(int newval, void *extra, bool *pending);
 extern bool check_max_slot_wal_keep_size(int *newval, void **extra,
 										 GucSource source);
-extern void assign_max_wal_size(int newval, void *extra);
+extern void assign_max_wal_size(int newval, void *extra, bool *pending);
 extern bool check_max_stack_depth(int *newval, void **extra, GucSource source);
-extern void assign_max_stack_depth(int newval, void *extra);
+extern void assign_max_stack_depth(int newval, void *extra, bool *pending);
 extern bool check_multixact_member_buffers(int *newval, void **extra,
 										   GucSource source);
 extern bool check_multixact_offset_buffers(int *newval, void **extra,
@@ -141,13 +141,13 @@ extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 extern void assign_syslog_facility(int newval, void *extra);
 extern void assign_syslog_ident(const char *newval, void *extra);
-extern void assign_tcp_keepalives_count(int newval, void *extra);
+extern void assign_tcp_keepalives_count(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_count(void);
-extern void assign_tcp_keepalives_idle(int newval, void *extra);
+extern void assign_tcp_keepalives_idle(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_idle(void);
-extern void assign_tcp_keepalives_interval(int newval, void *extra);
+extern void assign_tcp_keepalives_interval(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_interval(void);
-extern void assign_tcp_user_timeout(int newval, void *extra);
+extern void assign_tcp_user_timeout(int newval, void *extra, bool *pending);
 extern const char *show_tcp_user_timeout(void);
 extern bool check_temp_buffers(int *newval, void **extra, GucSource source);
 extern bool check_temp_tablespaces(char **newval, void **extra,
@@ -163,7 +163,7 @@ extern bool check_transaction_buffers(int *newval, void **extra, GucSource sourc
 extern bool check_transaction_deferrable(bool *newval, void **extra, GucSource source);
 extern bool check_transaction_isolation(int *newval, void **extra, GucSource source);
 extern bool check_transaction_read_only(bool *newval, void **extra, GucSource source);
-extern void assign_transaction_timeout(int newval, void *extra);
+extern void assign_transaction_timeout(int newval, void *extra, bool *pending);
 extern const char *show_unix_socket_permissions(void);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern bool check_wal_consistency_checking(char **newval, void **extra,
-- 
2.34.1

0003-Introduce-multiple-shmem-segments-for-share-20250228.patchtext/x-patch; charset=US-ASCII; name=0003-Introduce-multiple-shmem-segments-for-share-20250228.patchDownload

From df9ed8581fbe8cf4cf1f6bb53e5ced94c7bfe19d Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 24 Feb 2025 20:08:28 +0100
Subject: [PATCH 03/11] Introduce multiple shmem segments for shared buffers

Add more shmem segments to split shared buffers into following chunks:
* BUFFERS_SHMEM_SEGMENT: contains buffer blocks
* BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
* BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
* CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
* STRATEGY_SHMEM_SEGMENT: contains buffer strategy status

Size of the corresponding shared data directly depends on NBuffers,
meaning that if we would like to change NBuffers, they have to be
resized correspondingly. Placing each of them in a separate shmem
segment allows to achieve that.

There are some asumptions made about each of shmem segments upper size
limit. The buffer blocks have the largest, while the rest claim less
extra room for resize. Ideally those limits have to be deduced from the
maximum allowed shared memory.
---
 src/backend/port/sysv_shmem.c          | 19 ++++++-
 src/backend/storage/buffer/buf_init.c  | 79 +++++++++++++++++---------
 src/backend/storage/buffer/buf_table.c |  5 +-
 src/backend/storage/buffer/freelist.c  |  4 +-
 src/backend/storage/ipc/ipci.c         |  2 +-
 src/include/storage/bufmgr.h           |  2 +-
 src/include/storage/pg_shmem.h         | 24 +++++++-
 7 files changed, 99 insertions(+), 36 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 62f01d8218a..59aa67cb135 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -149,8 +149,13 @@ static int next_free_segment = 0;
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
  * ...
  */
-Size SHMEM_EXTRA_SIZE_LIMIT[1] = {
-	0, 									/* MAIN_SHMEM_SLOT */
+Size SHMEM_EXTRA_SIZE_LIMIT[6] = {
+	0, 									/* MAIN_SHMEM_SEGMENT */
+	(Size) 1024 * 1024 * 1024 * 10, 	/* BUFFERS_SHMEM_SEGMENT */
+	(Size) 1024 * 1024 * 1024 * 1, 		/* BUFFER_DESCRIPTORS_SHMEM_SEGMENT */
+	(Size) 1024 * 1024 * 100, 			/* BUFFER_IOCV_SHMEM_SEGMENT */
+	(Size) 1024 * 1024 * 100, 			/* CHECKPOINT_BUFFERS_SHMEM_SEGMENT */
+	(Size) 1024 * 1024 * 100, 			/* STRATEGY_SHMEM_SEGMENT */
 };
 
 /* Remembers offset of the last mapping from the probe address */
@@ -179,6 +184,16 @@ MappingName(int shmem_segment)
 	{
 		case MAIN_SHMEM_SEGMENT:
 			return "main";
+		case BUFFERS_SHMEM_SEGMENT:
+			return "buffers";
+		case BUFFER_DESCRIPTORS_SHMEM_SEGMENT:
+			return "descriptors";
+		case BUFFER_IOCV_SHMEM_SEGMENT:
+			return "iocv";
+		case CHECKPOINT_BUFFERS_SHMEM_SEGMENT:
+			return "checkpoint";
+		case STRATEGY_SHMEM_SEGMENT:
+			return "strategy";
 		default:
 			return "unknown";
 	}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1f8e03190..f5b9290a640 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -61,7 +61,10 @@ CkptSortItem *CkptBufferIds;
  * Initialize shared buffer pool
  *
  * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend).
+ * postmaster, or in a standalone backend). Size of data structures initialized
+ * here depends on NBuffers, and to be able to change NBuffers without a
+ * restart we store each structure into a separate shared memory segment, which
+ * could be resized on demand.
  */
 void
 BufferManagerShmemInit(void)
@@ -73,22 +76,22 @@ BufferManagerShmemInit(void)
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
+		ShmemInitStructInSegment("Buffer Descriptors",
 						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+						&foundDescs, BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
 		TYPEALIGN(PG_IO_ALIGN_SIZE,
-				  ShmemInitStruct("Buffer Blocks",
+				  ShmemInitStructInSegment("Buffer Blocks",
 								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
-								  &foundBufs));
+								  &foundBufs, BUFFERS_SHMEM_SEGMENT));
 
 	/* Align condition variables to cacheline boundary. */
 	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
-		ShmemInitStruct("Buffer IO Condition Variables",
+		ShmemInitStructInSegment("Buffer IO Condition Variables",
 						NBuffers * sizeof(ConditionVariableMinimallyPadded),
-						&foundIOCV);
+						&foundIOCV, BUFFER_IOCV_SHMEM_SEGMENT);
 
 	/*
 	 * The array used to sort to-be-checkpointed buffer ids is located in
@@ -98,8 +101,9 @@ BufferManagerShmemInit(void)
 	 * painful.
 	 */
 	CkptBufferIds = (CkptSortItem *)
-		ShmemInitStruct("Checkpoint BufferIds",
-						NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
+		ShmemInitStructInSegment("Checkpoint BufferIds",
+						NBuffers * sizeof(CkptSortItem), &foundBufCkpt,
+						CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
 
 	if (foundDescs || foundBufs || foundIOCV || foundBufCkpt)
 	{
@@ -153,33 +157,54 @@ BufferManagerShmemInit(void)
  * BufferManagerShmemSize
  *
  * compute the size of shared memory for the buffer pool including
- * data pages, buffer descriptors, hash tables, etc.
+ * data pages, buffer descriptors, hash tables, etc. based on the
+ * shared memory segment. The main segment must not allocate anything
+ * related to buffers, every other segment will receive part of the
+ * data.
  */
 Size
-BufferManagerShmemSize(void)
+BufferManagerShmemSize(int shmem_segment)
 {
 	Size		size = 0;
 
-	/* size of buffer descriptors */
-	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
-	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_segment == MAIN_SHMEM_SEGMENT)
+		return size;
 
-	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
-	size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	if (shmem_segment == BUFFER_DESCRIPTORS_SHMEM_SEGMENT)
+	{
+		/* size of buffer descriptors */
+		size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
+		/* to allow aligning buffer descriptors */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
 
-	/* size of stuff controlled by freelist.c */
-	size = add_size(size, StrategyShmemSize());
+	if (shmem_segment == BUFFERS_SHMEM_SEGMENT)
+	{
+		/* size of data pages, plus alignment padding */
+		size = add_size(size, PG_IO_ALIGN_SIZE);
+		size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	}
 
-	/* size of I/O condition variables */
-	size = add_size(size, mul_size(NBuffers,
-								   sizeof(ConditionVariableMinimallyPadded)));
-	/* to allow aligning the above */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_segment == STRATEGY_SHMEM_SEGMENT)
+	{
+		/* size of stuff controlled by freelist.c */
+		size = add_size(size, StrategyShmemSize());
+	}
 
-	/* size of checkpoint sort array in bufmgr.c */
-	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	if (shmem_segment == BUFFER_IOCV_SHMEM_SEGMENT)
+	{
+		/* size of I/O condition variables */
+		size = add_size(size, mul_size(NBuffers,
+									   sizeof(ConditionVariableMinimallyPadded)));
+		/* to allow aligning the above */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
+
+	if (shmem_segment == CHECKPOINT_BUFFERS_SHMEM_SEGMENT)
+	{
+		/* size of checkpoint sort array in bufmgr.c */
+		size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	}
 
 	return size;
 }
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index a50955d5286..ac449954dab 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -59,10 +59,11 @@ InitBufTable(int size)
 	info.entrysize = sizeof(BufferLookupEnt);
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
-	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
+	SharedBufHash = ShmemInitHashInSegment("Shared Buffer Lookup Table",
 								  size, size,
 								  &info,
-								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION,
+								  STRATEGY_SHMEM_SEGMENT);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 336715b6c63..4919a92f2be 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -491,9 +491,9 @@ StrategyInitialize(bool init)
 	 * Get or create the shared strategy control block
 	 */
 	StrategyControl = (BufferStrategyControl *)
-		ShmemInitStruct("Buffer Strategy Status",
+		ShmemInitStructInSegment("Buffer Strategy Status",
 						sizeof(BufferStrategyControl),
-						&found);
+						&found, STRATEGY_SHMEM_SEGMENT);
 
 	if (!found)
 	{
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 4f6c707c204..68778522591 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -112,7 +112,7 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
 											 sizeof(ShmemIndexEnt)));
 	size = add_size(size, dsm_estimate_size());
 	size = add_size(size, DSMRegistryShmemSize());
-	size = add_size(size, BufferManagerShmemSize());
+	size = add_size(size, BufferManagerShmemSize(shmem_segment));
 	size = add_size(size, LockManagerShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7c1e4316dde..bb7fe02e243 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -297,7 +297,7 @@ extern bool EvictUnpinnedBuffer(Buffer buf);
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
-extern Size BufferManagerShmemSize(void);
+extern Size BufferManagerShmemSize(int);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 138078c29c5..ba0192baf95 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -52,7 +52,7 @@ typedef struct ShmemSegment
 } ShmemSegment;
 
 /* Number of available segments for anonymous memory mappings */
-#define ANON_MAPPINGS 1
+#define ANON_MAPPINGS 6
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 
@@ -105,7 +105,29 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+/*
+ * To be able to dynamically resize largest parts of the data stored in shared
+ * memory, we split it into multiple shared memory mappings segments. Each
+ * segment contains only certain part of the data, which size depends on
+ * NBuffers.
+ */
+
 /* The main segment, contains everything except buffer blocks and related data. */
 #define MAIN_SHMEM_SEGMENT 0
 
+/* Buffer blocks */
+#define BUFFERS_SHMEM_SEGMENT 1
+
+/* Buffer descriptors */
+#define BUFFER_DESCRIPTORS_SHMEM_SEGMENT 2
+
+/* Condition variables for buffers */
+#define BUFFER_IOCV_SHMEM_SEGMENT 3
+
+/* Checkpoint BufferIds */
+#define CHECKPOINT_BUFFERS_SHMEM_SEGMENT 4
+
+/* Buffer strategy status */
+#define STRATEGY_SHMEM_SEGMENT 5
+
 #endif							/* PG_SHMEM_H */
-- 
2.34.1

0005-Allow-to-resize-shared-memory-without-resta-20250228.patchtext/x-patch; charset=US-ASCII; name=0005-Allow-to-resize-shared-memory-without-resta-20250228.patchDownload

From 71b4321af793a8a5682d5f9b1490fb3718e74759 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Thu, 20 Feb 2025 21:12:26 +0100
Subject: [PATCH 05/11] Allow to resize shared memory without restart

Add assing hook for shared_buffers to resize shared memory using space,
introduced in the previous commits without requiring PostgreSQL restart.
Essentially the implementation is based on two mechanisms: a global
Barrier to coordinate backends that simultaneously change
shared_buffers, and pieces in shared memory to coordinate backends that
are too late to the party for some reason.

The resize process looks like this:

* The GUC assign hook sets a flag to let the Postmaster know that resize
  was requested.

* Postmaster verifies the flag in the event loop, and starts the resize
  by emitting a ProcSignal barrier. Afterwards it does shared memory
  resize itself.

* All the backends, that participate in ProcSignal mechanism,
  recalculate shared memory size based on the new NBuffers and extend it
  using mremap.

* When finished, a backend waits on a global ShmemControl barrier,
  untill all backends will be finished as well. This way we ensure three
  stages with clear boundaries: before the resize, when all processes
  use old NBuffers; during the resize, when processes have mix of old
  and new NBuffers, and wait until it's done; after the resize, when all
  processes use new NBuffers.

* After all backends are using new value, one backend will initialize
  new shared structures (buffer blocks, descriptors, etc) as needed and
  broadcast new value of NBuffers via ShmemControl in shared memory.
  Other backends are waiting for this operation to finish as well. Then
  the barrier is lifted and everything goes as usual.

Here is how it looks like after raising shared_buffers from 128 MB to
512 MB and calling pg_reload_conf():

    -- 128 MB
    7f5a2bd04000-7f5a32e52000  /dev/zero (deleted)
    7f5a39252000-7f5a4030e000  /dev/zero (deleted)
    7f5a4670e000-7f5a4d7ba000  /dev/zero (deleted)
    7f5a53bba000-7f5a5ad26000  /dev/zero (deleted)
    7f5a9ad26000-7f5aa9d94000  /dev/zero (deleted)
    ^ buffers mapping, ~240 MB
    7f5d29d94000-7f5d30e00000  /dev/zero (deleted)

    -- 512 MB
    7f5a2bd04000-7f5a33274000  /dev/zero (deleted)
    7f5a39252000-7f5a4057e000  /dev/zero (deleted)
    7f5a4670e000-7f5a4d9fa000  /dev/zero (deleted)
    7f5a53bba000-7f5a5b1a6000  /dev/zero (deleted)
    7f5a9ad26000-7f5ac1f14000  /dev/zero (deleted)
    ^ buffers mapping, ~625 MB
    7f5d29d94000-7f5d30f80000  /dev/zero (deleted)

The implementation supports only increasing of shared_buffers. For
decreasing the value a similar procedure is needed. But the buffer
blocks with data have to be drained first, so that the actual data set
fits into the new smaller space.

From experiment it turns out that shared mappings have to be extended
separately for each process that uses them. Another rough edge is that a
backend blocked on ReadCommand will not apply shared_buffers change
until it reads something.

Note, that mremap is Linux specific, thus the implementation not very
portable.

Authors: Dmitrii Dolgov, Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c                 | 300 ++++++++++++++++++
 src/backend/postmaster/postmaster.c           |  15 +
 src/backend/storage/buffer/buf_init.c         | 152 ++++++++-
 src/backend/storage/ipc/ipci.c                |  11 +
 src/backend/storage/ipc/procsignal.c          |  45 +++
 src/backend/storage/ipc/shmem.c               |  14 +-
 src/backend/tcop/postgres.c                   |  15 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/backend/utils/misc/guc_tables.c           |   4 +-
 src/include/storage/bufmgr.h                  |   1 +
 src/include/storage/ipc.h                     |   2 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/storage/pg_shmem.h                |  24 ++
 src/include/storage/procsignal.h              |   1 +
 src/tools/pgindent/typedefs.list              |   1 +
 15 files changed, 577 insertions(+), 12 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 59aa67cb135..35a8ff92175 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -30,13 +30,17 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "portability/mem.h"
+#include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
 #include "utils/pidfile.h"
+#include "utils/wait_event.h"
 
 
 /*
@@ -105,6 +109,13 @@ typedef struct AnonymousMapping
 
 static AnonymousMapping Mappings[ANON_MAPPINGS];
 
+/* Flag telling postmaster that resize is needed */
+volatile bool pending_pm_shmem_resize = false;
+
+/* Keeps track of the previous NBuffers value */
+static int NBuffersOld = -1;
+static int NBuffersPending = -1;
+
 /* Keeps track of used mapping segments */
 static int next_free_segment = 0;
 
@@ -859,6 +870,274 @@ AnonymousShmemDetach(int status, Datum arg)
 	}
 }
 
+/*
+ * Resize all shared memory segments based on the current NBuffers value, which
+ * is is applied from NBuffersPending. The actual segment resizing is done via
+ * mremap, which will fail if is not sufficient space to expand the mapping.
+ * When finished, based on the new and old values initialize new buffer blocks
+ * if any.
+ *
+ * If reinitializing took place, as the last step this function broadcasts
+ * NSharedBuffers to it's new value, allowing any other backends to rely on
+ * this new value and skip buffers reinitialization.
+ */
+static bool
+AnonymousShmemResize(void)
+{
+	int	numSemas;
+	bool reinit = false;
+	NBuffers = NBuffersPending;
+
+	elog(DEBUG1, "Resize shmem from %d to %d", NBuffersOld, NBuffers);
+
+	/*
+	 * XXX: Where to reset the flag is still an open question. E.g. do we
+	 * consider a no-op when NBuffers is equal to NBuffersOld a genuine resize
+	 * and reset the flag?
+	 */
+	pending_pm_shmem_resize = false;
+
+	/*
+	 * XXX: Currently only increasing of shared_buffers is supported. For
+	 * decreasing something similar has to be done, but buffer blocks with
+	 * data have to be drained first.
+	 */
+	if(NBuffersOld > NBuffers)
+		return false;
+
+	for(int i = 0; i < next_free_segment; i++)
+	{
+		/* Note that CalculateShmemSize indirectly depends on NBuffers */
+		Size new_size = CalculateShmemSize(&numSemas, i);
+		AnonymousMapping *m = &Mappings[i];
+
+		if (m->shmem == NULL)
+			continue;
+
+		if (m->shmem_size == new_size)
+			continue;
+
+
+		/*
+		 * Fail hard if faced any issues. In theory we could try to handle this
+		 * more gracefully and proceed with shared memory as before, but some
+		 * other backends might have succeeded and have different size. If we
+		 * would like to go this way, to be consistent we would need to
+		 * synchronize again, and it's not clear if it's worth the effort.
+		 */
+		if (mremap(m->shmem, m->shmem_size, new_size, 0) < 0)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not resize shared memory %p to %d (%zu): %m",
+							m->shmem, NBuffers, m->shmem_size)));
+		else
+		{
+			reinit = true;
+			m->shmem_size = new_size;
+		}
+	}
+
+	if (reinit)
+	{
+		if(IsUnderPostmaster &&
+			LWLockConditionalAcquire(ShmemResizeLock, LW_EXCLUSIVE))
+		{
+			/*
+			 * If the new NBuffers was already broadcasted, the buffer pool was
+			 * already initialized before.
+			 *
+			 * Since we're not on a hot path, we use lwlocks and do not need to
+			 * involve memory barrier.
+			 */
+			if(pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers)
+			{
+				/*
+				 * Allow the first backend that managed to get the lock to
+				 * reinitialize the new portion of buffer pool. Every other
+				 * process will wait on the shared barrier for that to finish,
+				 * since it's a part of the SHMEM_RESIZE_DONE phase.
+				 *
+				 * XXX: This is the right place for buffer eviction as well.
+				 */
+				ResizeBufferPool(NBuffersOld, true);
+
+				/* If all fine, broadcast the new value */
+				pg_atomic_write_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
+			}
+			else
+				ResizeBufferPool(NBuffersOld, false);
+
+			LWLockRelease(ShmemResizeLock);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * We are asked to resize shared memory. Do the resize and make sure to wait on
+ * the provided barrier until all simultaneously participating backends finish
+ * resizing as well, otherwise we face danger of inconsistency between
+ * backends.
+ *
+ * XXX: If a backend is blocked on ReadCommand in PostgresMain, it will not
+ * proceed with AnonymousShmemResize after receiving SIGHUP, until something
+ * will be sent.
+ */
+bool
+ProcessBarrierShmemResize(Barrier *barrier)
+{
+	elog(DEBUG1, "Handle a barrier for shmem resizing from %d to %d, %d",
+		 NBuffersOld, NBuffersPending, pending_pm_shmem_resize);
+
+	/* Wait until we have seen the new NBuffers value */
+	if (!pending_pm_shmem_resize)
+		return false;
+
+	/*
+	 * After attaching to the barrier we could be in any of states:
+	 *
+	 * - Initial SHMEM_RESIZE_REQUESTED, nothing has been done yet
+	 * - SHMEM_RESIZE_START, some of the backends have started to resize
+	 * - SHMEM_RESIZE_DONE, participating backends have finished resizing
+	 * - SHMEM_RESIZE_REQUESTED after the reset, the shared memory was already
+	 *   resized
+	 *
+	 * The first three states take place while the actual resize is in
+	 * progress, and all we need to do is join and proceed with resizing. This
+	 * way all simultaneously participating backends will remap and wait until
+	 * one of them initialize new buffers.
+	 *
+	 * The last state happens when we are too late and everything is already
+	 * done. In that case proceed as well, relying on AnonymousShmemResize not
+	 * reinitialize anything since the NSharedBuffers is already broadcasted.
+	 */
+	BarrierAttach(barrier);
+
+	/* First phase means the resize has begun, SHMEM_RESIZE_START */
+	BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START);
+
+	/* XXX: Split mremap and buffer reinitialization into two barrier phases */
+	AnonymousShmemResize();
+
+	/* The second phase means the resize has finished, SHMEM_RESIZE_DONE */
+	BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_DONE);
+
+	/* Allow the last backend to reset the barrier */
+	if (BarrierArriveAndDetach(barrier))
+		ResetShmemBarrier();
+
+	return true;
+}
+
+/*
+ * GUC assign hook for shared_buffers. It's recommended for an assign hook to
+ * be as minimal as possible, thus we just request shared memory resize and
+ * remember the previous value.
+ */
+void
+assign_shared_buffers(int newval, void *extra, bool *pending)
+{
+	elog(DEBUG1, "Received SIGHUP for shmem resizing");
+
+	/* Request shared memory resize only when it was initialized */
+	if (next_free_segment != 0)
+	{
+		elog(DEBUG1, "Set pending signal");
+		pending_pm_shmem_resize = true;
+		*pending = true;
+		NBuffersPending = newval;
+	}
+
+	NBuffersOld = NBuffers;
+}
+
+/*
+ * Test if we have somehow missed a shmem resize signal and NBuffers value
+ * differs from NSharedBuffers. If yes, catchup and do resize.
+ */
+void
+AdjustShmemSize(void)
+{
+	uint32 NSharedBuffers = pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers);
+
+	if (NSharedBuffers != NBuffers)
+	{
+		/*
+		 * If the broadcasted shared_buffers is different from the one we see,
+		 * it could be that the backend has missed a resize signal. To avoid
+		 * any inconsistency, adjust the shared mappings, before having a
+		 * chance to access the buffer pool.
+		 */
+		ereport(LOG,
+				(errmsg("shared_buffers has been changed from %d to %d, "
+						"resize shared memory",
+						NBuffers, NSharedBuffers)));
+		NBuffers = NSharedBuffers;
+		AnonymousShmemResize();
+	}
+}
+
+/*
+ * Coordinate all existing processes to make sure they all will have consistent
+ * view of shared memory size. Must be called only in postmaster.
+ */
+void
+CoordinateShmemResize(void)
+{
+	elog(DEBUG1, "Coordinating shmem resize from %d to %d",
+		 NBuffersOld, NBuffers);
+	Assert(!IsUnderPostmaster);
+
+	/*
+	 * If the value did not change, or shared memory segments are not
+	 * initialized yet, skip the resize.
+	 */
+	if (NBuffersPending == NBuffersOld || next_free_segment == 0)
+	{
+		elog(DEBUG1, "Skip resizing, new %d, old %d, free segment %d",
+			 NBuffers, NBuffersOld, next_free_segment);
+		return;
+	}
+
+	/*
+	 * Shared memory resize requires some coordination done by postmaster,
+	 * and consists of three phases:
+	 *
+	 * - Before the resize all existing backends have the same old NBuffers.
+	 * - When resize is in progress, backends are expected to have a
+	 *   mixture of old a new values. They're not allowed to touch buffer
+	 *   pool during this time frame.
+	 * - After resize has been finished, all existing backends, that can access
+	 *   the buffer pool, are expected to have the same new value of NBuffers.
+	 *   There might still be some backends, that are sleeping or for some
+	 *   other reason not doing any work yet and have old NBuffers -- but as
+	 *   soon as they will get some time slice, they will acquire the new
+	 *   value.
+	 */
+	elog(DEBUG1, "Emit a barrier for shmem resizing");
+	EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SHMEM_RESIZE);
+
+	AnonymousShmemResize();
+
+	/*
+	 * Normally we would call WaitForProcSignalBarrier here to wait until every
+	 * backend has reported on the ProcSignalBarrier. But for shared memory
+	 * resize we don't need this, as every participating backend will
+	 * synchronize on the ProcSignal barrier, and there is no sequential logic
+	 * we have to perform afterwards. In fact even if we would like to wait
+	 * here, it wouldn't be possible -- we're in the postmaster, without any
+	 * waiting infrastructure available.
+	 *
+	 * If at some point it will turn out that waiting is essential, we would
+	 * need to consider some alternatives. E.g. it could be a designated
+	 * coordination process, which is not a postmaster. Another option would be
+	 * to introduce a CoordinateShmemResize lock and allow only one process to
+	 * take it (this probably would have to be something different than
+	 * LWLocks, since they block interrupts, and coordination relies on them).
+	 */
+}
+
 /*
  * PGSharedMemoryCreate
  *
@@ -1174,3 +1453,24 @@ PGSharedMemoryDetach(void)
 		}
 	}
 }
+
+void
+WaitOnShmemBarrier(int phase)
+{
+	Barrier *barrier = &ShmemCtrl->Barrier;
+
+	if (BarrierPhase(barrier) == phase)
+	{
+		ereport(LOG,
+				(errmsg("ProcSignal barrier is in phase %d, waiting", phase)));
+		BarrierAttach(barrier);
+		BarrierArriveAndWait(barrier, 0);
+		BarrierDetach(barrier);
+	}
+}
+
+void
+ResetShmemBarrier(void)
+{
+	BarrierInit(&ShmemCtrl->Barrier, 0);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index bb22b13adef..f3e508141b2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -418,6 +418,7 @@ static void process_pm_pmsignal(void);
 static void process_pm_child_exit(void);
 static void process_pm_reload_request(void);
 static void process_pm_shutdown_request(void);
+static void process_pm_shmem_resize(void);
 static void dummy_handler(SIGNAL_ARGS);
 static void CleanupBackend(PMChild *bp, int exitstatus);
 static void HandleChildCrash(int pid, int exitstatus, const char *procname);
@@ -1680,6 +1681,9 @@ ServerLoop(void)
 			if (pending_pm_pmsignal)
 				process_pm_pmsignal();
 
+			if (pending_pm_shmem_resize)
+				process_pm_shmem_resize();
+
 			if (events[i].events & WL_SOCKET_ACCEPT)
 			{
 				ClientSocket s;
@@ -2026,6 +2030,17 @@ process_pm_reload_request(void)
 	}
 }
 
+static void
+process_pm_shmem_resize(void)
+{
+	/*
+	 * Failure to resize is considered to be fatal and will not be
+	 * retried, which means we can disable pending flag right here.
+	 */
+	pending_pm_shmem_resize = false;
+	CoordinateShmemResize();
+}
+
 /*
  * pg_ctl uses SIGTERM, SIGINT and SIGQUIT to request different types of
  * shutdown.
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index f5b9290a640..b7de0ab6b0d 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -23,6 +23,41 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
+/*
+ * Currently broadcasted value of NBuffers in shared memory.
+ *
+ * Most of the time this value is going to be equal to NBuffers. But if
+ * postmaster is resizing shared memory and a new backend was created
+ * at the same time, there is a possibility for the new backend to inherit the
+ * old NBuffers value, but miss the resize signal if ProcSignal infrastructure
+ * was not initialized yet. Consider this situation:
+ *
+ *     Postmaster ------> New Backend
+ *         |                   |
+ *         |                Launch
+ *         |                   |
+ *         |             Inherit NBuffers
+ *         |                   |
+ *     Resize NBuffers         |
+ *         |                   |
+ *     Emit Barrier            |
+ *         |            Init ProcSignal
+ *         |                   |
+ *     Finish resize           |
+ *         |                   |
+ *     New NBuffers       Old NBuffers
+ *
+ * In this case the backend is not yet ready to receive a signal from
+ * EmitProcSignalBarrier, and will be ignored. The same happens if ProcSignal
+ * is initialized even later, after the resizing was finished.
+ *
+ * To address resulting inconsistency, postmaster broadcasts the current
+ * NBuffers value via shared memory. Every new backend has to verify this value
+ * before it will access the buffer pool: if it differs from its own value,
+ * this indicates a shared memory resize has happened and the backend has to
+ * first synchronize with rest of the pack.
+ */
+ShmemControl *ShmemCtrl = NULL;
 
 /*
  * Data Structures:
@@ -72,7 +107,19 @@ BufferManagerShmemInit(void)
 	bool		foundBufs,
 				foundDescs,
 				foundIOCV,
-				foundBufCkpt;
+				foundBufCkpt,
+				foundShmemCtrl;
+
+	ShmemCtrl = (ShmemControl *)
+		ShmemInitStruct("Shmem Control", sizeof(ShmemControl),
+						&foundShmemCtrl);
+
+	if (!foundShmemCtrl)
+	{
+		/* Initialize with the currently known value */
+		pg_atomic_init_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
+		BarrierInit(&ShmemCtrl->Barrier, 0);
+	}
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
@@ -153,6 +200,109 @@ BufferManagerShmemInit(void)
 						 &backend_flush_after);
 }
 
+/*
+ * Reinitialize shared memory structures, which size depends on NBuffers. It's
+ * similar to InitBufferPool, but applied only to the buffers in the range
+ * between NBuffersOld and NBuffers.
+ *
+ * NBuffersOld tells what was the original value of NBuffersOld. It will be
+ * used to identify new and not yet initialized buffers.
+ *
+ * initNew flag indicates that the caller wants new buffers to be initialized.
+ * No locks are taking in this function, it is the caller responsibility to
+ * make sure only one backend can work with new buffers.
+ */
+void
+ResizeBufferPool(int NBuffersOld, bool initNew)
+{
+	bool		foundBufs,
+				foundDescs,
+				foundIOCV,
+				foundBufCkpt;
+	int			i;
+	elog(DEBUG1, "Resizing buffer pool from %d to %d", NBuffersOld, NBuffers);
+
+	/* XXX: Only increasing of shared_buffers is supported in this function */
+	if(NBuffersOld > NBuffers)
+		return;
+
+	/* Align descriptors to a cacheline boundary. */
+	BufferDescriptors = (BufferDescPadded *)
+		ShmemInitStructInSegment("Buffer Descriptors",
+						NBuffers * sizeof(BufferDescPadded),
+						&foundDescs, BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
+
+	/* Align condition variables to cacheline boundary. */
+	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
+		ShmemInitStructInSegment("Buffer IO Condition Variables",
+						NBuffers * sizeof(ConditionVariableMinimallyPadded),
+						&foundIOCV, BUFFER_IOCV_SHMEM_SEGMENT);
+
+	/*
+	 * The array used to sort to-be-checkpointed buffer ids is located in
+	 * shared memory, to avoid having to allocate significant amounts of
+	 * memory at runtime. As that'd be in the middle of a checkpoint, or when
+	 * the checkpointer is restarted, memory allocation failures would be
+	 * painful.
+	 */
+	CkptBufferIds = (CkptSortItem *)
+		ShmemInitStructInSegment("Checkpoint BufferIds",
+						NBuffers * sizeof(CkptSortItem), &foundBufCkpt,
+						CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
+
+	/* Align buffer pool on IO page size boundary. */
+	BufferBlocks = (char *)
+		TYPEALIGN(PG_IO_ALIGN_SIZE,
+				  ShmemInitStructInSegment("Buffer Blocks",
+								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  &foundBufs, BUFFERS_SHMEM_SEGMENT));
+
+	/*
+	 * It's enough to only resize shmem structures, if some other backend will
+	 * do initialization of new buffers for us.
+	 */
+	if (!initNew)
+		return;
+
+	elog(DEBUG1, "Initialize new buffers");
+
+	/*
+	 * Initialize the headers for new buffers.
+	 */
+	for (i = NBuffersOld; i < NBuffers; i++)
+	{
+		BufferDesc *buf = GetBufferDescriptor(i);
+
+		ClearBufferTag(&buf->tag);
+
+		pg_atomic_init_u32(&buf->state, 0);
+		buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
+
+		buf->buf_id = i;
+
+		/*
+		 * Initially link all the buffers together as unused. Subsequent
+		 * management of this list is done by freelist.c.
+		 */
+		buf->freeNext = i + 1;
+
+		LWLockInitialize(BufferDescriptorGetContentLock(buf),
+						 LWTRANCHE_BUFFER_CONTENT);
+
+		ConditionVariableInit(BufferDescriptorGetIOCV(buf));
+	}
+
+	/* Correct last entry of linked list */
+	GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
+
+	/* Init other shared buffer-management stuff */
+	StrategyInitialize(!foundDescs);
+
+	/* Initialize per-backend file flush context */
+	WritebackContextInit(&BackendWritebackContext,
+						 &backend_flush_after);
+}
+
 /*
  * BufferManagerShmemSize
  *
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 68778522591..a2c635f288e 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -83,6 +83,9 @@ RequestAddinShmemSpace(Size size)
  *
  * If num_semaphores is not NULL, it will be set to the number of semaphores
  * required.
+ *
+ * XXX: Calculation for non main shared memory segments are incorrect, it
+ * includes more than needed for buffers only.
  */
 Size
 CalculateShmemSize(int *num_semaphores, int shmem_segment)
@@ -149,6 +152,14 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 
+	/*
+	 * XXX: For some reason slightly more memory is needed for larger
+	 * shared_buffers, but this size is enough for any large value I've tested
+	 * with. Is it a mistake in how slots are split, or there was a hidden
+	 * inconsistency in shmem calculation?
+	 */
+	size = add_size(size, 1024 * 1024 * 100);
+
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
 
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7401b6e625e..bec0e00f901 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -27,6 +27,7 @@
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
+#include "storage/pg_shmem.h"
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "storage/smgr.h"
@@ -108,6 +109,10 @@ static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
 
+#ifdef DEBUG_SHMEM_RESIZE
+bool delay_proc_signal_init = false;
+#endif
+
 /*
  * ProcSignalShmemSize
  *		Compute space needed for ProcSignal's shared memory
@@ -168,6 +173,42 @@ ProcSignalInit(bool cancel_key_valid, int32 cancel_key)
 	ProcSignalSlot *slot;
 	uint64		barrier_generation;
 
+#ifdef DEBUG_SHMEM_RESIZE
+	/*
+	 * Introduced for debugging purposes. You can change the variable at
+	 * runtime using gdb, then start new backends with delayed ProcSignal
+	 * initialization. Simple pg_usleep wont work here due to SIGHUP interrupt
+	 * needed for testing. Taken from pg_sleep;
+	 */
+	if (delay_proc_signal_init)
+	{
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+		float8		endtime = GetNowFloat() + 5;
+
+		for (;;)
+		{
+			float8		delay;
+			long		delay_ms;
+
+			CHECK_FOR_INTERRUPTS();
+
+			delay = endtime - GetNowFloat();
+			if (delay >= 600.0)
+				delay_ms = 600000;
+			else if (delay > 0.0)
+				delay_ms = (long) (delay * 1000.0);
+			else
+				break;
+
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 delay_ms,
+							 WAIT_EVENT_PG_SLEEP);
+			ResetLatch(MyLatch);
+		}
+	}
+#endif
+
 	if (MyProcNumber < 0)
 		elog(ERROR, "MyProcNumber not set");
 	if (MyProcNumber >= NumProcSignalSlots)
@@ -573,6 +614,10 @@ ProcessProcSignalBarrier(void)
 					case PROCSIGNAL_BARRIER_SMGRRELEASE:
 						processed = ProcessBarrierSmgrRelease();
 						break;
+					case PROCSIGNAL_BARRIER_SHMEM_RESIZE:
+						processed = ProcessBarrierShmemResize(
+								&ShmemCtrl->Barrier);
+						break;
 				}
 
 				/*
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 389abc82519..226b38ba979 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -493,17 +493,13 @@ ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
 	{
 		/*
 		 * Structure is in the shmem index so someone else has allocated it
-		 * already.  The size better be the same as the size we are trying to
-		 * initialize to, or there is a name conflict (or worse).
+		 * already. Verify the structure's size:
+		 * - If it's the same, we've found the expected structure.
+		 * - If it's different, we're resizing the expected structure.
 		 */
 		if (result->size != size)
-		{
-			LWLockRelease(ShmemIndexLock);
-			ereport(ERROR,
-					(errmsg("ShmemIndex entry size is wrong for data structure"
-							" \"%s\": expected %zu, actual %zu",
-							name, size, result->size)));
-		}
+			result->size = size;
+
 		structPtr = result->location;
 	}
 	else
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 13fb8c31702..04cdd0d24d8 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -62,6 +62,7 @@
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -4267,6 +4268,20 @@ PostgresMain(const char *dbname, const char *username)
 	 */
 	BeginReportingGUCOptions();
 
+	/*
+	 * Verify the shared barrier, if it's still active: join and wait.
+	 *
+	 * XXX: Any potential race condition if not a single backend has
+	 * incremented the barrier phase?
+	 */
+	WaitOnShmemBarrier(SHMEM_RESIZE_START);
+
+	/*
+	 * After waiting on the barrier above we guaranteed to have NSharedBuffers
+	 * broadcasted, so we can use it in the function below.
+	 */
+	AdjustShmemSize();
+
 	/*
 	 * Also set up handler to log session end; we have to wait till now to be
 	 * sure Log_disconnections has its final value.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index ccf73781d81..947f13cb1fa 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -154,6 +154,8 @@ REPLICATION_ORIGIN_DROP	"Waiting for a replication origin to become inactive so
 REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it can be dropped."
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
+SHMEM_RESIZE_START	"Waiting for other backends to start resizing shared memory."
+SHMEM_RESIZE_DONE	"Waiting for other backends to finish resizing shared memory."
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
 WAL_BUFFER_INIT	"Waiting on WAL buffer to be initialized."
 WAL_RECEIVER_EXIT	"Waiting for the WAL receiver to exit."
@@ -346,6 +348,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+ShmemResize	"Waiting to resize shared memory."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 42728189322..01faf705582 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2329,14 +2329,14 @@ struct config_int ConfigureNamesInt[] =
 	 * checking for overflow, so we mustn't allow more than INT_MAX / 2.
 	 */
 	{
-		{"shared_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+		{"shared_buffers", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the number of shared memory buffers used by the server."),
 			NULL,
 			GUC_UNIT_BLOCKS
 		},
 		&NBuffers,
 		16384, 16, INT_MAX / 2,
-		NULL, NULL, NULL
+		NULL, assign_shared_buffers, NULL
 	},
 
 	{
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index bb7fe02e243..fff80214822 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -298,6 +298,7 @@ extern bool EvictUnpinnedBuffer(Buffer buf);
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(int);
+extern void ResizeBufferPool(int, bool);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index c0439f2206b..5f5b45c88bd 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -64,6 +64,7 @@ typedef void (*shmem_startup_hook_type) (void);
 /* ipc.c */
 extern PGDLLIMPORT bool proc_exit_inprogress;
 extern PGDLLIMPORT bool shmem_exit_inprogress;
+extern PGDLLIMPORT volatile bool pending_pm_shmem_resize;
 
 extern void proc_exit(int code) pg_attribute_noreturn();
 extern void shmem_exit(int code);
@@ -83,5 +84,6 @@ extern void CreateSharedMemoryAndSemaphores(void);
 extern void AttachSharedMemoryStructs(void);
 #endif
 extern void InitializeShmemGUCs(void);
+extern void CoordinateShmemResize(void);
 
 #endif							/* IPC_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index ff897515769..e5d8cd183cf 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, ShmemResize)
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index ba0192baf95..b597df0d3a3 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -24,6 +24,7 @@
 #ifndef PG_SHMEM_H
 #define PG_SHMEM_H
 
+#include "storage/barrier.h"
 #include "storage/dsm_impl.h"
 #include "storage/spin.h"
 
@@ -56,6 +57,23 @@ typedef struct ShmemSegment
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 
+/*
+ * ShmemControl is shared between backends and helps to coordinate shared
+ * memory resize.
+ */
+typedef struct
+{
+	pg_atomic_uint32 	NSharedBuffers;
+	Barrier 			Barrier;
+} ShmemControl;
+
+extern PGDLLIMPORT ShmemControl *ShmemCtrl;
+
+/* The phases for shared memory resizing, used by for ProcSignal barrier. */
+#define SHMEM_RESIZE_REQUESTED			0
+#define SHMEM_RESIZE_START				1
+#define SHMEM_RESIZE_DONE				2
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -105,6 +123,12 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+bool ProcessBarrierShmemResize(Barrier *barrier);
+void assign_shared_buffers(int newval, void *extra, bool *pending);
+void AdjustShmemSize(void);
+extern void WaitOnShmemBarrier(int phase);
+extern void ResetShmemBarrier(void);
+
 /*
  * To be able to dynamically resize largest parts of the data stored in shared
  * memory, we split it into multiple shared memory mappings segments. Each
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 022fd8ed933..4c9973dc2d9 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -54,6 +54,7 @@ typedef enum
 typedef enum
 {
 	PROCSIGNAL_BARRIER_SMGRRELEASE, /* ask smgr to close files */
+	PROCSIGNAL_BARRIER_SHMEM_RESIZE, /* ask backends to resize shared memory */
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b6c170ac249..00f6b9d7d7d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2668,6 +2668,7 @@ ShellTypeInfo
 ShippableCacheEntry
 ShippableCacheKey
 ShmemIndexEnt
+ShmemControl
 ShutdownForeignScan_function
 ShutdownInformation
 ShutdownMode
-- 
2.34.1

0002-Allow-placing-shared-memory-mapping-with-an-20250228.patchtext/x-patch; charset=US-ASCII; name=0002-Allow-placing-shared-memory-mapping-with-an-20250228.patchDownload

From 436475da9bbb9f8f370332931016f4a5cbd9d393 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Wed, 16 Oct 2024 20:21:33 +0200
Subject: [PATCH 02/11] Allow placing shared memory mapping with an offset

Currently the kernel is responsible to chose an address, where to place each
shared memory mapping, which is the lowest possible address that do not clash
with any other mappings. This is considered to be the most portable approach,
but one of the downsides is that there is no place to resize allocated mappings
anymore. Here is how it looks like for one mapping in /proc/$PID/maps,
/dev/zero represents the anonymous shared memory we talk about:

    00400000-00490000         /path/bin/postgres
    ...
    012d9000-0133e000         [heap]
    7f443a800000-7f470a800000 /dev/zero (deleted)
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
    ...
    7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842
    7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted)

By specifying the mapping address directly it's possible to place the
mapping in a way that leaves room for resizing. The idea is first to get
the address chosen by the kernel, then apply some offset derived from
the expected upper limit. Because we base the layout on the address
chosen by the kernel, things like address space randomization should not
be a problem, since the randomization is applied to the mmap base, which
is one per process. The result looks like this:

    012d9000-0133e000         [heap]
    7f443a800000-7f444196c000 /dev/zero (deleted)
    [...free space...]
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2

This approach do not impact the actual memory usage as reported by the kernel.
Here is the output of /proc/$PID/status for the master version with
shared_buffers = 128 MB:

    // Peak virtual memory size, which is described as total pages mapped in mm_struct
    VmPeak:           422780 kB
    // Size of memory portions. It contains RssAnon + RssFile + RssShmem
    VmRSS:             21248 kB
    // Size of resident anonymous memory
    RssAnon:             640 kB
    // Size of resident file mappings
    RssFile:            9728 kB
    // Size of resident shmem memory (includes SysV shm, mapping of tmpfs and
    // shared anonymous mappings)
    RssShmem:          10880 kB

Here is the same for the patch with the shared mapping placed at
an offset 10 GB:

    VmPeak:          1102844 kB
    VmRSS:             21376 kB
    RssAnon:             640 kB
    RssFile:            9856 kB
    RssShmem:          10880 kB

Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup
was created with the memory limit 256 MB, then PostgreSQL was launched withing
this cgroup with shared_buffers = 128 MB:

    $ cd /sys/fs/cgroup
    $ mkdir postgres
    $ cd postres
    $ echo 268435456 > memory.max

    $ echo $MASTER_PID_SHELL > cgroup.procs
    # postgres from the master branch has being successfully launched
    #  from that shell
    $ cat memory.current
    17465344 (~16 MB)
    # stop postgres

    $ echo $PATCH_PID_SHELL > cgroup.procs
    # postgres from the patch has being successfully launched from that shell
    $ cat memory.current
    18219008 (~17 MB)

Note that currently the implementation makes assumptions about the upper limit.
Ideally it should be based on the maximum available memory.
---
 src/backend/port/sysv_shmem.c | 120 +++++++++++++++++++++++++++++++++-
 1 file changed, 119 insertions(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 843b1b3220f..62f01d8218a 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -108,6 +108,63 @@ static AnonymousMapping Mappings[ANON_MAPPINGS];
 /* Keeps track of used mapping segments */
 static int next_free_segment = 0;
 
+/*
+ * Anonymous mapping placing (/dev/zero (deleted) below) looks like this:
+ *
+ * 00400000-00490000         /path/bin/postgres
+ * ...
+ * 012d9000-0133e000         [heap]
+ * 7f443a800000-7f470a800000 /dev/zero (deleted)
+ * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
+ * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
+ * ...
+ * 7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842
+ * 7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted)
+ * ...
+ *
+ * We would like to place multiple mappings in such a way, that there will be
+ * enough space between them in the address space to be able to resize up to
+ * certain size, but without counting towards the total memory consumption.
+ *
+ * By letting Linux to chose a mapping address, it will pick up the lowest
+ * possible address that do not clash with any other mappings, which will be
+ * right before locales in the example above. This information (maximum allowed
+ * size of mappings and the lowest mapping address) is enough to place every
+ * mapping as follow:
+ *
+ * - Take the lowest mapping address, which we call later the probe address.
+ * - Substract the offset of the previous mapping.
+ * - Substract the maximum allowed size for the current mapping from the
+ *   address.
+ * - Place the mapping by the resulting address.
+ *
+ * The result would look like this:
+ *
+ * 012d9000-0133e000         [heap]
+ * 7f4426f54000-7f442e010000 /dev/zero (deleted)
+ * [...free space...]
+ * 7f443a800000-7f444196c000 /dev/zero (deleted)
+ * [...free space...]
+ * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
+ * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
+ * ...
+ */
+Size SHMEM_EXTRA_SIZE_LIMIT[1] = {
+	0, 									/* MAIN_SHMEM_SLOT */
+};
+
+/* Remembers offset of the last mapping from the probe address */
+static Size last_offset = 0;
+
+/*
+ * Size of the mapping, which will be used to calculate anonymous mapping
+ * address. It should not be too small, otherwise there is a chance the probe
+ * mapping will be created between other mappings, leaving no room extending
+ * it. But it should not be too large either, in case if there are limitations
+ * on the mapping size. Current value is the default shared_buffers.
+ */
+#define PROBE_MAPPING_SIZE (Size) 128 * 1024 * 1024
+
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
@@ -673,13 +730,74 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 
 	if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
 	{
+		void *probe = NULL;
+
 		/*
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
 		allocsize = mapping->shmem_size;
-		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
+
+		/*
+		 * Try to create mapping at an address, which will allow to extend it
+		 * later:
+		 *
+		 * - First create the temporary probe mapping of a fixed size and let
+		 *   kernel to place it at address of its choice. By the virtue of the
+		 *   probe mapping size we expect it to be located at the lowest
+		 *   possible address, expecting some non mapped space above.
+		 *
+		 * - Unmap the probe mapping, remember the address.
+		 *
+		 * - Create an actual anonymous mapping at that address with the
+		 *   offset. The offset is calculated in such a way to allow growing
+		 *   the mapping withing certain boundaries. For this mapping we use
+		 *   MAP_FIXED_NOREPLACE, which will error out with EEXIST if there is
+		 *   any mapping clash.
+		 *
+		 * - If the last step has failed, fallback to the regular mapping
+		 *   creation and signal that shared buffers could not be resized
+		 *   without a restart.
+		 */
+		probe = mmap(NULL, PROBE_MAPPING_SIZE, PROT_READ | PROT_WRITE,
 				   PG_MMAP_FLAGS, -1, 0);
+
+		if (probe == MAP_FAILED)
+		{
+			mmap_errno = errno;
+			DebugMappings();
+			elog(DEBUG1, "segment[%s]: probe mmap(%zu) failed: %m",
+					MappingName(mapping->shmem_segment), allocsize);
+		}
+		else
+		{
+			Size offset = last_offset + SHMEM_EXTRA_SIZE_LIMIT[next_free_segment] + allocsize;
+			last_offset = offset;
+
+			munmap(probe, PROBE_MAPPING_SIZE);
+
+			ptr = mmap(probe - offset, allocsize, PROT_READ | PROT_WRITE,
+					   PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, -1, 0);
+			mmap_errno = errno;
+			if (ptr == MAP_FAILED)
+			{
+				DebugMappings();
+				elog(DEBUG1, "segment[%s]: mmap(%zu) at address %p failed: %m",
+					 MappingName(mapping->shmem_segment), allocsize, probe - offset);
+			}
+
+		}
+	}
+
+	if (ptr == MAP_FAILED)
+	{
+		/*
+		 * Fallback to the portable way of creating a mapping.
+		 */
+		allocsize = mapping->shmem_size;
+
+		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
+						   PG_MMAP_FLAGS, -1, 0);
 		mmap_errno = errno;
 	}
 
-- 
2.34.1

0001-Allow-to-use-multiple-shared-memory-mapping-20250228.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-to-use-multiple-shared-memory-mapping-20250228.patchDownload

From 4fb360f80ddb353d013d7b8e999b88522a36a604 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Wed, 19 Feb 2025 17:43:13 +0100
Subject: [PATCH 01/11] Allow to use multiple shared memory mappings

Currently all the work with shared memory is done via a single anonymous
memory mapping, which limits ways how the shared memory could be organized.

Introduce possibility to allocate multiple shared memory mappings, where
a single mapping is associated with a specified shared memory segment.
There is only fixed amount of available segments, currently only one
main shared memory segment is allocated. A new shared memory API is
introduces, extended with a segment as a new parameter. As a path of
least resistance, the original API is kept in place, utilizing the main
shared memory segment.
---
 src/backend/port/posix_sema.c       |   4 +-
 src/backend/port/sysv_sema.c        |   4 +-
 src/backend/port/sysv_shmem.c       | 138 ++++++++++++++++++---------
 src/backend/port/win32_sema.c       |   2 +-
 src/backend/storage/ipc/ipc.c       |   4 +-
 src/backend/storage/ipc/ipci.c      |  63 +++++++------
 src/backend/storage/ipc/shmem.c     | 141 +++++++++++++++++++---------
 src/backend/storage/lmgr/lwlock.c   |   5 +-
 src/include/storage/buf_internals.h |   1 +
 src/include/storage/ipc.h           |   2 +-
 src/include/storage/pg_sema.h       |   2 +-
 src/include/storage/pg_shmem.h      |  18 ++++
 src/include/storage/shmem.h         |  12 +++
 13 files changed, 272 insertions(+), 124 deletions(-)

diff --git a/src/backend/port/posix_sema.c b/src/backend/port/posix_sema.c
index 269c7460817..401e1113fa1 100644
--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -193,7 +193,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * we don't have to expose the counters to other processes.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	struct stat statbuf;
 
@@ -220,7 +220,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSegment(PGSemaphoreShmemSize(maxSemas), shmem_segment);
 #endif
 
 	numSems = 0;
diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c
index f7c8638aec5..b6301463ac7 100644
--- a/src/backend/port/sysv_sema.c
+++ b/src/backend/port/sysv_sema.c
@@ -313,7 +313,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * have clobbered.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	struct stat statbuf;
 
@@ -334,7 +334,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSegment(PGSemaphoreShmemSize(maxSemas), shmem_segment);
 	numSharedSemas = 0;
 	maxSharedSemas = maxSemas;
 
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..843b1b3220f 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -94,8 +94,19 @@ typedef enum
 unsigned long UsedShmemSegID = 0;
 void	   *UsedShmemSegAddr = NULL;
 
-static Size AnonymousShmemSize;
-static void *AnonymousShmem = NULL;
+typedef struct AnonymousMapping
+{
+	int shmem_segment;
+	Size shmem_size; 			/* Size of the mapping */
+	void *shmem; 				/* Pointer to the start of the mapped memory */
+	void *seg_addr; 			/* SysV shared memory for the header */
+	unsigned long seg_id; 		/* IPC key */
+} AnonymousMapping;
+
+static AnonymousMapping Mappings[ANON_MAPPINGS];
+
+/* Keeps track of used mapping segments */
+static int next_free_segment = 0;
 
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
@@ -104,6 +115,28 @@ static IpcMemoryState PGSharedMemoryAttach(IpcMemoryId shmId,
 										   void *attachAt,
 										   PGShmemHeader **addr);
 
+static const char*
+MappingName(int shmem_segment)
+{
+	switch (shmem_segment)
+	{
+		case MAIN_SHMEM_SEGMENT:
+			return "main";
+		default:
+			return "unknown";
+	}
+}
+
+static void
+DebugMappings()
+{
+	for(int i = 0; i < next_free_segment; i++)
+	{
+		AnonymousMapping m = Mappings[i];
+		elog(DEBUG1, "Mapping[%s]: addr %p, size %zu",
+			 MappingName(i), m.shmem, m.shmem_size);
+	}
+}
 
 /*
  *	InternalIpcMemoryCreate(memKey, size)
@@ -591,14 +624,13 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
 /*
  * Creates an anonymous mmap()ed shared memory segment.
  *
- * Pass the requested size in *size.  This function will modify *size to the
- * actual size of the allocation, if it ends up allocating a segment that is
- * larger than requested.
+ * This function will modify mapping size to the actual size of the allocation,
+ * if it ends up allocating a segment that is larger than requested.
  */
-static void *
-CreateAnonymousSegment(Size *size)
+static void
+CreateAnonymousSegment(AnonymousMapping *mapping)
 {
-	Size		allocsize = *size;
+	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
 
@@ -623,8 +655,11 @@ CreateAnonymousSegment(Size *size)
 				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
-			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-				 allocsize);
+		{
+			DebugMappings();
+			elog(DEBUG1, "segment[%s]: mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
+				 MappingName(mapping->shmem_segment), allocsize);
+		}
 	}
 #endif
 
@@ -642,7 +677,7 @@ CreateAnonymousSegment(Size *size)
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
-		allocsize = *size;
+		allocsize = mapping->shmem_size;
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
 				   PG_MMAP_FLAGS, -1, 0);
 		mmap_errno = errno;
@@ -651,8 +686,10 @@ CreateAnonymousSegment(Size *size)
 	if (ptr == MAP_FAILED)
 	{
 		errno = mmap_errno;
+		DebugMappings();
 		ereport(FATAL,
-				(errmsg("could not map anonymous shared memory: %m"),
+				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
+						MappingName(mapping->shmem_segment)),
 				 (mmap_errno == ENOMEM) ?
 				 errhint("This error usually means that PostgreSQL's request "
 						 "for a shared memory segment exceeded available memory, "
@@ -663,8 +700,8 @@ CreateAnonymousSegment(Size *size)
 						 allocsize) : 0));
 	}
 
-	*size = allocsize;
-	return ptr;
+	mapping->shmem = ptr;
+	mapping->shmem_size = allocsize;
 }
 
 /*
@@ -674,13 +711,18 @@ CreateAnonymousSegment(Size *size)
 static void
 AnonymousShmemDetach(int status, Datum arg)
 {
-	/* Release anonymous shared memory block, if any. */
-	if (AnonymousShmem != NULL)
+	for(int i = 0; i < next_free_segment; i++)
 	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		AnonymousMapping m = Mappings[i];
+
+		/* Release anonymous shared memory block, if any. */
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
 
@@ -705,6 +747,7 @@ PGSharedMemoryCreate(Size size,
 	PGShmemHeader *hdr;
 	struct stat statbuf;
 	Size		sysvsize;
+	AnonymousMapping *mapping = &Mappings[next_free_segment];
 
 	/*
 	 * We use the data directory's ID info (inode and device numbers) to
@@ -733,11 +776,15 @@ PGSharedMemoryCreate(Size size,
 
 	/* Room for a header? */
 	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
+	mapping->shmem_size = size;
+	mapping->shmem_segment = next_free_segment;
 
 	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
-		AnonymousShmem = CreateAnonymousSegment(&size);
-		AnonymousShmemSize = size;
+		/* On success, mapping data will be modified. */
+		CreateAnonymousSegment(mapping);
+
+		next_free_segment++;
 
 		/* Register on-exit routine to unmap the anonymous segment */
 		on_shmem_exit(AnonymousShmemDetach, (Datum) 0);
@@ -760,7 +807,7 @@ PGSharedMemoryCreate(Size size,
 	 * loop simultaneously.  (CreateDataDirLockFile() does not entirely ensure
 	 * that, but prefer fixing it over coping here.)
 	 */
-	NextShmemSegID = statbuf.st_ino;
+	NextShmemSegID = statbuf.st_ino + next_free_segment;
 
 	for (;;)
 	{
@@ -852,13 +899,13 @@ PGSharedMemoryCreate(Size size,
 	/*
 	 * Initialize space allocation status for segment.
 	 */
-	hdr->totalsize = size;
+	hdr->totalsize = mapping->shmem_size;
 	hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader));
 	*shim = hdr;
 
 	/* Save info for possible future use */
-	UsedShmemSegAddr = memAddress;
-	UsedShmemSegID = (unsigned long) NextShmemSegID;
+	mapping->seg_addr = memAddress;
+	mapping->seg_id = (unsigned long) NextShmemSegID;
 
 	/*
 	 * If AnonymousShmem is NULL here, then we're not using anonymous shared
@@ -866,10 +913,10 @@ PGSharedMemoryCreate(Size size,
 	 * block. Otherwise, the System V shared memory block is only a shim, and
 	 * we must return a pointer to the real block.
 	 */
-	if (AnonymousShmem == NULL)
+	if (mapping->shmem == NULL)
 		return hdr;
-	memcpy(AnonymousShmem, hdr, sizeof(PGShmemHeader));
-	return (PGShmemHeader *) AnonymousShmem;
+	memcpy(mapping->shmem, hdr, sizeof(PGShmemHeader));
+	return (PGShmemHeader *) mapping->shmem;
 }
 
 #ifdef EXEC_BACKEND
@@ -969,23 +1016,28 @@ PGSharedMemoryNoReAttach(void)
 void
 PGSharedMemoryDetach(void)
 {
-	if (UsedShmemSegAddr != NULL)
+	for(int i = 0; i < next_free_segment; i++)
 	{
-		if ((shmdt(UsedShmemSegAddr) < 0)
+		AnonymousMapping m = Mappings[i];
+
+		if (m.seg_addr != NULL)
+		{
+			if ((shmdt(m.seg_addr) < 0)
 #if defined(EXEC_BACKEND) && defined(__CYGWIN__)
-		/* Work-around for cygipc exec bug */
-			&& shmdt(NULL) < 0
+			/* Work-around for cygipc exec bug */
+				&& shmdt(NULL) < 0
 #endif
-			)
-			elog(LOG, "shmdt(%p) failed: %m", UsedShmemSegAddr);
-		UsedShmemSegAddr = NULL;
-	}
+				)
+				elog(LOG, "shmdt(%p) failed: %m", m.seg_addr);
+			m.seg_addr = NULL;
+		}
 
-	if (AnonymousShmem != NULL)
-	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
diff --git a/src/backend/port/win32_sema.c b/src/backend/port/win32_sema.c
index 5854ad1f54d..e7365ff8060 100644
--- a/src/backend/port/win32_sema.c
+++ b/src/backend/port/win32_sema.c
@@ -44,7 +44,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * process exits.
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	mySemSet = (HANDLE *) malloc(maxSemas * sizeof(HANDLE));
 	if (mySemSet == NULL)
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index e4d5b944e12..9d526eb43fd 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -61,6 +61,8 @@ static void proc_exit_prepare(int code);
  * but provide some additional features we need --- in particular,
  * we want to register callbacks to invoke when we are disconnecting
  * from a broken shared-memory context but not exiting the postmaster.
+ * Maximum number of such exit callbacks depends on the number of shared
+ * segments.
  *
  * Callback functions can take zero, one, or two args: the first passed
  * arg is the integer exitcode, the second is the Datum supplied when
@@ -68,7 +70,7 @@ static void proc_exit_prepare(int code);
  * ----------------------------------------------------------------
  */
 
-#define MAX_ON_EXITS 20
+#define MAX_ON_EXITS 40
 
 struct ONEXIT
 {
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..4f6c707c204 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -85,7 +85,7 @@ RequestAddinShmemSpace(Size size)
  * required.
  */
 Size
-CalculateShmemSize(int *num_semaphores)
+CalculateShmemSize(int *num_semaphores, int shmem_segment)
 {
 	Size		size;
 	int			numSemas;
@@ -204,33 +204,38 @@ CreateSharedMemoryAndSemaphores(void)
 
 	Assert(!IsUnderPostmaster);
 
-	/* Compute the size of the shared-memory block */
-	size = CalculateShmemSize(&numSemas);
-	elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
-
-	/*
-	 * Create the shmem segment
-	 */
-	seghdr = PGSharedMemoryCreate(size, &shim);
-
-	/*
-	 * Make sure that huge pages are never reported as "unknown" while the
-	 * server is running.
-	 */
-	Assert(strcmp("unknown",
-				  GetConfigOption("huge_pages_status", false, false)) != 0);
-
-	InitShmemAccess(seghdr);
-
-	/*
-	 * Create semaphores
-	 */
-	PGReserveSemaphores(numSemas);
-
-	/*
-	 * Set up shared memory allocation mechanism
-	 */
-	InitShmemAllocation();
+	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
+	{
+		/* Compute the size of the shared-memory block */
+		size = CalculateShmemSize(&numSemas, segment);
+		elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
+
+		/*
+		 * Create the shmem segment.
+		 *
+		 * XXX: Do multiple shims are needed, one per segment?
+		 */
+		seghdr = PGSharedMemoryCreate(size, &shim);
+
+		/*
+		 * Make sure that huge pages are never reported as "unknown" while the
+		 * server is running.
+		 */
+		Assert(strcmp("unknown",
+					  GetConfigOption("huge_pages_status", false, false)) != 0);
+
+		InitShmemAccessInSegment(seghdr, segment);
+
+		/*
+		 * Create semaphores
+		 */
+		PGReserveSemaphores(numSemas, segment);
+
+		/*
+		 * Set up shared memory allocation mechanism
+		 */
+		InitShmemAllocationInSegment(segment);
+	}
 
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
@@ -360,7 +365,7 @@ InitializeShmemGUCs(void)
 	/*
 	 * Calculate the shared memory size and round up to the nearest megabyte.
 	 */
-	size_b = CalculateShmemSize(&num_semas);
+	size_b = CalculateShmemSize(&num_semas, MAIN_SHMEM_SEGMENT);
 	size_mb = add_size(size_b, (1024 * 1024) - 1) / (1024 * 1024);
 	sprintf(buf, "%zu", size_mb);
 	SetConfigOption("shared_memory_size", buf,
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..389abc82519 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -75,19 +75,19 @@
 #include "utils/builtins.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
+static void *ShmemAllocRawInSegment(Size size, Size *allocated_size,
+								 int shmem_segment);
 
 /* shared memory global variables */
 
-static PGShmemHeader *ShmemSegHdr;	/* shared mem segment header */
+ShmemSegment Segments[ANON_MAPPINGS];
 
-static void *ShmemBase;			/* start address of shared memory */
-
-static void *ShmemEnd;			/* end+1 address of shared memory */
-
-slock_t    *ShmemLock;			/* spinlock for shared memory and LWLock
-								 * allocation */
-
-static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/*
+ * Primary index hashtable for shmem, for simplicity we use a single for all
+ * shared memory segments. There can be performance consequences of that, and
+ * an alternative option would be to have one index per shared memory segments.
+ */
+static HTAB *ShmemIndex = NULL;
 
 
 /*
@@ -96,9 +96,17 @@ static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
 void
 InitShmemAccess(PGShmemHeader *seghdr)
 {
-	ShmemSegHdr = seghdr;
-	ShmemBase = seghdr;
-	ShmemEnd = (char *) ShmemBase + seghdr->totalsize;
+	InitShmemAccessInSegment(seghdr, MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAccessInSegment(PGShmemHeader *seghdr, int shmem_segment)
+{
+	PGShmemHeader *shmhdr = (PGShmemHeader *) seghdr;
+	ShmemSegment *seg = &Segments[shmem_segment];
+	seg->ShmemSegHdr = shmhdr;
+	seg->ShmemBase = (void *) shmhdr;
+	seg->ShmemEnd = (char *) seg->ShmemBase + shmhdr->totalsize;
 }
 
 /*
@@ -109,7 +117,13 @@ InitShmemAccess(PGShmemHeader *seghdr)
 void
 InitShmemAllocation(void)
 {
-	PGShmemHeader *shmhdr = ShmemSegHdr;
+	InitShmemAllocationInSegment(MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAllocationInSegment(int shmem_segment)
+{
+	PGShmemHeader *shmhdr = Segments[shmem_segment].ShmemSegHdr;
 	char	   *aligned;
 
 	Assert(shmhdr != NULL);
@@ -118,9 +132,9 @@ InitShmemAllocation(void)
 	 * Initialize the spinlock used by ShmemAlloc.  We must use
 	 * ShmemAllocUnlocked, since obviously ShmemAlloc can't be called yet.
 	 */
-	ShmemLock = (slock_t *) ShmemAllocUnlocked(sizeof(slock_t));
+	Segments[shmem_segment].ShmemLock = (slock_t *) ShmemAllocUnlockedInSegment(sizeof(slock_t), shmem_segment);
 
-	SpinLockInit(ShmemLock);
+	SpinLockInit(Segments[shmem_segment].ShmemLock);
 
 	/*
 	 * Allocations after this point should go through ShmemAlloc, which
@@ -145,11 +159,17 @@ InitShmemAllocation(void)
  */
 void *
 ShmemAlloc(Size size)
+{
+	return ShmemAllocInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocInSegment(Size size, int shmem_segment)
 {
 	void	   *newSpace;
 	Size		allocated_size;
 
-	newSpace = ShmemAllocRaw(size, &allocated_size);
+	newSpace = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
 	if (!newSpace)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -179,6 +199,12 @@ ShmemAllocNoError(Size size)
  */
 static void *
 ShmemAllocRaw(Size size, Size *allocated_size)
+{
+	return ShmemAllocRawInSegment(size, allocated_size, MAIN_SHMEM_SEGMENT);
+}
+
+static void *
+ShmemAllocRawInSegment(Size size, Size *allocated_size, int shmem_segment)
 {
 	Size		newStart;
 	Size		newFree;
@@ -198,22 +224,22 @@ ShmemAllocRaw(Size size, Size *allocated_size)
 	size = CACHELINEALIGN(size);
 	*allocated_size = size;
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-	SpinLockAcquire(ShmemLock);
+	SpinLockAcquire(Segments[shmem_segment].ShmemLock);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree <= ShmemSegHdr->totalsize)
+	if (newFree <= Segments[shmem_segment].ShmemSegHdr->totalsize)
 	{
-		newSpace = (char *) ShmemBase + newStart;
-		ShmemSegHdr->freeoffset = newFree;
+		newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
+		Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 	}
 	else
 		newSpace = NULL;
 
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[shmem_segment].ShmemLock);
 
 	/* note this assert is okay with newSpace == NULL */
 	Assert(newSpace == (void *) CACHELINEALIGN(newSpace));
@@ -231,6 +257,12 @@ ShmemAllocRaw(Size size, Size *allocated_size)
  */
 void *
 ShmemAllocUnlocked(Size size)
+{
+	return ShmemAllocUnlockedInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocUnlockedInSegment(Size size, int shmem_segment)
 {
 	Size		newStart;
 	Size		newFree;
@@ -241,19 +273,19 @@ ShmemAllocUnlocked(Size size)
 	 */
 	size = MAXALIGN(size);
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree > ShmemSegHdr->totalsize)
+	if (newFree > Segments[shmem_segment].ShmemSegHdr->totalsize)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of shared memory (%zu bytes requested)",
 						size)));
-	ShmemSegHdr->freeoffset = newFree;
+	Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 
-	newSpace = (char *) ShmemBase + newStart;
+	newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
 
 	Assert(newSpace == (void *) MAXALIGN(newSpace));
 
@@ -268,7 +300,13 @@ ShmemAllocUnlocked(Size size)
 bool
 ShmemAddrIsValid(const void *addr)
 {
-	return (addr >= ShmemBase) && (addr < ShmemEnd);
+	return ShmemAddrIsValidInSegment(addr, MAIN_SHMEM_SEGMENT);
+}
+
+bool
+ShmemAddrIsValidInSegment(const void *addr, int shmem_segment)
+{
+	return (addr >= Segments[shmem_segment].ShmemBase) && (addr < Segments[shmem_segment].ShmemEnd);
 }
 
 /*
@@ -329,6 +367,18 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
+{
+	return ShmemInitHashInSegment(name, init_size, max_size, infoP, hash_flags,
+							   MAIN_SHMEM_SEGMENT);
+}
+
+HTAB *
+ShmemInitHashInSegment(const char *name,		/* table string name for shmem index */
+			  long init_size,		/* initial table size */
+			  long max_size,		/* max size of the table */
+			  HASHCTL *infoP,		/* info about key and bucket size */
+			  int hash_flags,		/* info about infoP */
+			  int shmem_segment) 	/* in which segment to keep the table */
 {
 	bool		found;
 	void	   *location;
@@ -345,9 +395,9 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 	hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
 
 	/* look it up in the shmem index */
-	location = ShmemInitStruct(name,
+	location = ShmemInitStructInSegment(name,
 							   hash_get_shared_size(infoP, hash_flags),
-							   &found);
+							   &found, shmem_segment);
 
 	/*
 	 * if it already exists, attach to it rather than allocate and initialize
@@ -380,6 +430,13 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
  */
 void *
 ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+{
+	return ShmemInitStructInSegment(name, size, foundPtr, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
+					  int shmem_segment)
 {
 	ShmemIndexEnt *result;
 	void	   *structPtr;
@@ -388,7 +445,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	if (!ShmemIndex)
 	{
-		PGShmemHeader *shmemseghdr = ShmemSegHdr;
+		PGShmemHeader *shmemseghdr = Segments[shmem_segment].ShmemSegHdr;
 
 		/* Must be trying to create/attach to ShmemIndex itself */
 		Assert(strcmp(name, "ShmemIndex") == 0);
@@ -411,7 +468,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 			 * process can be accessing shared memory yet.
 			 */
 			Assert(shmemseghdr->index == NULL);
-			structPtr = ShmemAlloc(size);
+			structPtr = ShmemAllocInSegment(size, shmem_segment);
 			shmemseghdr->index = structPtr;
 			*foundPtr = false;
 		}
@@ -428,8 +485,8 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		LWLockRelease(ShmemIndexLock);
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("could not create ShmemIndex entry for data structure \"%s\"",
-						name)));
+				 errmsg("could not create ShmemIndex entry for data structure \"%s\" in segment %d",
+						name, shmem_segment)));
 	}
 
 	if (*foundPtr)
@@ -454,7 +511,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		Size		allocated_size;
 
 		/* It isn't in the table yet. allocate and initialize it */
-		structPtr = ShmemAllocRaw(size, &allocated_size);
+		structPtr = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
 		if (structPtr == NULL)
 		{
 			/* out of memory; remove the failed ShmemIndex entry */
@@ -473,14 +530,13 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	LWLockRelease(ShmemIndexLock);
 
-	Assert(ShmemAddrIsValid(structPtr));
+	Assert(ShmemAddrIsValidInSegment(structPtr, shmem_segment));
 
 	Assert(structPtr == (void *) CACHELINEALIGN(structPtr));
 
 	return structPtr;
 }
 
-
 /*
  * Add two Size values, checking for overflow
  */
@@ -537,10 +593,11 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 
 	/* output all allocated entries */
 	memset(nulls, 0, sizeof(nulls));
+	/* XXX: take all shared memory segments into account. */
 	while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
 	{
 		values[0] = CStringGetTextDatum(ent->key);
-		values[1] = Int64GetDatum((char *) ent->location - (char *) ShmemSegHdr);
+		values[1] = Int64GetDatum((char *) ent->location - (char *) Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr);
 		values[2] = Int64GetDatum(ent->size);
 		values[3] = Int64GetDatum(ent->allocated_size);
 		named_allocated += ent->allocated_size;
@@ -552,15 +609,15 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 	/* output shared memory allocated but not counted via the shmem index */
 	values[0] = CStringGetTextDatum("<anonymous>");
 	nulls[1] = true;
-	values[2] = Int64GetDatum(ShmemSegHdr->freeoffset - named_allocated);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset - named_allocated);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
 	/* output as-of-yet unused shared memory */
 	nulls[0] = true;
-	values[1] = Int64GetDatum(ShmemSegHdr->freeoffset);
+	values[1] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
 	nulls[1] = false;
-	values[2] = Int64GetDatum(ShmemSegHdr->totalsize - ShmemSegHdr->freeoffset);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->totalsize - Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index f1e74f184f1..40aa4014b5f 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -81,6 +81,7 @@
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
 #include "postmaster/postmaster.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/proclist.h"
 #include "storage/spin.h"
@@ -607,9 +608,9 @@ LWLockNewTrancheId(void)
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - sizeof(int));
 	/* We use the ShmemLock spinlock to protect LWLockCounter */
-	SpinLockAcquire(ShmemLock);
+	SpinLockAcquire(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 	result = (*LWLockCounter)++;
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 
 	return result;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 1a65342177d..4595f5a9676 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -22,6 +22,7 @@
 #include "storage/condition_variable.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
+#include "storage/pg_shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index e0f5f92e947..c0439f2206b 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -77,7 +77,7 @@ extern void check_on_shmem_exit_lists_are_empty(void);
 /* ipci.c */
 extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;
 
-extern Size CalculateShmemSize(int *num_semaphores);
+extern Size CalculateShmemSize(int *num_semaphores, int shmem_segment);
 extern void CreateSharedMemoryAndSemaphores(void);
 #ifdef EXEC_BACKEND
 extern void AttachSharedMemoryStructs(void);
diff --git a/src/include/storage/pg_sema.h b/src/include/storage/pg_sema.h
index fa6ca35a51f..8ae9637fcd0 100644
--- a/src/include/storage/pg_sema.h
+++ b/src/include/storage/pg_sema.h
@@ -41,7 +41,7 @@ typedef HANDLE PGSemaphore;
 extern Size PGSemaphoreShmemSize(int maxSemas);
 
 /* Module initialization (called during postmaster start or shmem reinit) */
-extern void PGReserveSemaphores(int maxSemas);
+extern void PGReserveSemaphores(int maxSemas, int shmem_segment);
 
 /* Allocate a PGSemaphore structure with initial count 1 */
 extern PGSemaphore PGSemaphoreCreate(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..138078c29c5 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -25,6 +25,7 @@
 #define PG_SHMEM_H
 
 #include "storage/dsm_impl.h"
+#include "storage/spin.h"
 
 typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 {
@@ -41,6 +42,20 @@ typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 #endif
 } PGShmemHeader;
 
+typedef struct ShmemSegment
+{
+	PGShmemHeader *ShmemSegHdr; 	/* shared mem segment header */
+	void *ShmemBase; 				/* start address of shared memory */
+	void *ShmemEnd; 				/* end+1 address of shared memory */
+	slock_t    *ShmemLock; 			/* spinlock for shared memory and LWLock
+									 * allocation */
+} ShmemSegment;
+
+/* Number of available segments for anonymous memory mappings */
+#define ANON_MAPPINGS 1
+
+extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -90,4 +105,7 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+/* The main segment, contains everything except buffer blocks and related data. */
+#define MAIN_SHMEM_SEGMENT 0
+
 #endif							/* PG_SHMEM_H */
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index 904a336b851..5929f140236 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -29,15 +29,27 @@
 extern PGDLLIMPORT slock_t *ShmemLock;
 struct PGShmemHeader;			/* avoid including storage/pg_shmem.h here */
 extern void InitShmemAccess(struct PGShmemHeader *seghdr);
+extern void InitShmemAccessInSegment(struct PGShmemHeader *seghdr,
+									 int shmem_segment);
 extern void InitShmemAllocation(void);
+extern void InitShmemAllocationInSegment(int shmem_segment);
 extern void *ShmemAlloc(Size size);
+extern void *ShmemAllocInSegment(Size size, int shmem_segment);
 extern void *ShmemAllocNoError(Size size);
 extern void *ShmemAllocUnlocked(Size size);
+extern void *ShmemAllocUnlockedInSegment(Size size, int shmem_segment);
 extern bool ShmemAddrIsValid(const void *addr);
+extern bool ShmemAddrIsValidInSegment(const void *addr, int shmem_segment);
 extern void InitShmemIndex(void);
+extern void InitVariableShmemIndex(void);
 extern HTAB *ShmemInitHash(const char *name, long init_size, long max_size,
 						   HASHCTL *infoP, int hash_flags);
+extern HTAB *ShmemInitHashInSegment(const char *name, long init_size,
+									long max_size, HASHCTL *infoP,
+									int hash_flags, int shmem_segment);
 extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
+extern void *ShmemInitStructInSegment(const char *name, Size size,
+									  bool *foundPtr, int shmem_segment);
 extern Size add_size(Size s1, Size s2);
 extern Size mul_size(Size s1, Size s2);
 

base-commit: eaf502747bacee0122668eb1ba3979f86b8d8342
-- 
2.34.1

0006-Use-anonymous-files-to-back-shared-memory-s-20250228.patchtext/x-patch; charset=US-ASCII; name=0006-Use-anonymous-files-to-back-shared-memory-s-20250228.patchDownload

From be911372e4de4b2b98699512007ff8055dbea2f2 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sun, 23 Feb 2025 14:42:39 +0100
Subject: [PATCH 06/11] Use anonymous files to back shared memory segments

Allow to use anonymous files for shared memory, instead of plain
anonymous memory. Such an anonymous file is created via memfd_create, it
lives in memory, behaves like a regular file and semantically equivalent
to an anonymous memory allocated via mmap with MAP_ANONYMOUS.

Advantages of using anon files are following:

* We've got a file descriptor, which could be used for regular file
  operations (modification, truncation, you name it).

* The file could be given a name, which improves readability when it
  comes to process maps. Here is how it looks like

7f5a2bd04000-7f5a32e52000 rw-s 00000000 00:01 1845 /memfd:strategy (deleted)
7f5a39252000-7f5a4030e000 rw-s 00000000 00:01 1842 /memfd:checkpoint (deleted)
7f5a4670e000-7f5a4d7ba000 rw-s 00000000 00:01 1839 /memfd:iocv (deleted)
7f5a53bba000-7f5a5ad26000 rw-s 00000000 00:01 1836 /memfd:descriptors (deleted)
7f5a9ad26000-7f5aa9d94000 rw-s 00000000 00:01 1833 /memfd:buffers (deleted)
7f5d29d94000-7f5d30e00000 rw-s 00000000 00:01 1830 /memfd:main (deleted)

* By default, Linux will not add file-backed shared mappings into a core dump,
  making it more convenient to work with them in PostgreSQL: no more huge dumps
  to process.

The downside is that memfd_create is Linux specific.
---
 src/backend/port/sysv_shmem.c | 46 +++++++++++++++++++++++++++++------
 src/include/portability/mem.h |  2 +-
 2 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 35a8ff92175..8864866f26c 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -105,6 +105,7 @@ typedef struct AnonymousMapping
 	void *shmem; 				/* Pointer to the start of the mapped memory */
 	void *seg_addr; 			/* SysV shared memory for the header */
 	unsigned long seg_id; 		/* IPC key */
+	int segment_fd; 			/* fd for the backing anon file */
 } AnonymousMapping;
 
 static AnonymousMapping Mappings[ANON_MAPPINGS];
@@ -125,7 +126,7 @@ static int next_free_segment = 0;
  * 00400000-00490000         /path/bin/postgres
  * ...
  * 012d9000-0133e000         [heap]
- * 7f443a800000-7f470a800000 /dev/zero (deleted)
+ * 7f443a800000-7f470a800000 /memfd:main (deleted)
  * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
  * ...
@@ -152,9 +153,9 @@ static int next_free_segment = 0;
  * The result would look like this:
  *
  * 012d9000-0133e000         [heap]
- * 7f4426f54000-7f442e010000 /dev/zero (deleted)
+ * 7f4426f54000-7f442e010000 /memfd:main (deleted)
  * [...free space...]
- * 7f443a800000-7f444196c000 /dev/zero (deleted)
+ * 7f443a800000-7f444196c000 /memfd:buffers (deleted)
  * [...free space...]
  * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
@@ -717,6 +718,18 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
 
+	/*
+	 * Prepare an anonymous file backing the segment. Its size will be
+	 * specified later via ftruncate.
+	 *
+	 * The file behaves like a regular file, but lives in memory. Once all
+	 * references to the file are dropped,  it is automatically released.
+	 * Anonymous memory is used for all backing pages of the file, thus it has
+	 * the same semantics as anonymous memory allocations using mmap with the
+	 * MAP_ANONYMOUS flag.
+	 */
+	mapping->segment_fd = memfd_create(MappingName(mapping->shmem_segment), 0);
+
 #ifndef MAP_HUGETLB
 	/* PGSharedMemoryCreate should have dealt with this case */
 	Assert(huge_pages != HUGE_PAGES_ON);
@@ -734,8 +747,13 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		if (allocsize % hugepagesize != 0)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
+		/*
+		 * Do not use an anonymous file here yet. When adding it, do not forget
+		 * to use ftruncate and flags MFD_HUGETLB & MFD_HUGE_2MB/MFD_HUGE_1GB
+		 * in memfd_create.
+		 */
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+				   PG_MMAP_FLAGS | MAP_ANONYMOUS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 		{
@@ -771,7 +789,8 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		 * - First create the temporary probe mapping of a fixed size and let
 		 *   kernel to place it at address of its choice. By the virtue of the
 		 *   probe mapping size we expect it to be located at the lowest
-		 *   possible address, expecting some non mapped space above.
+		 *   possible address, expecting some non mapped space above. The probe
+		 *   is does not need to be  backed by an anonymous file.
 		 *
 		 * - Unmap the probe mapping, remember the address.
 		 *
@@ -786,7 +805,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		 *   without a restart.
 		 */
 		probe = mmap(NULL, PROBE_MAPPING_SIZE, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS, -1, 0);
+				   PG_MMAP_FLAGS | MAP_ANONYMOUS, -1, 0);
 
 		if (probe == MAP_FAILED)
 		{
@@ -802,8 +821,14 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 
 			munmap(probe, PROBE_MAPPING_SIZE);
 
+			/*
+			 * Specify the segment file size using allocsize, which contains
+			 * potentially modified size.
+			 */
+			ftruncate(mapping->segment_fd, allocsize);
+
 			ptr = mmap(probe - offset, allocsize, PROT_READ | PROT_WRITE,
-					   PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, -1, 0);
+					   PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, mapping->segment_fd, 0);
 			mmap_errno = errno;
 			if (ptr == MAP_FAILED)
 			{
@@ -822,8 +847,11 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		 */
 		allocsize = mapping->shmem_size;
 
+		/* Specify the segment file size using allocsize. */
+		ftruncate(mapping->segment_fd, allocsize);
+
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-						   PG_MMAP_FLAGS, -1, 0);
+						   PG_MMAP_FLAGS, mapping->segment_fd, 0);
 		mmap_errno = errno;
 	}
 
@@ -917,6 +945,8 @@ AnonymousShmemResize(void)
 		if (m->shmem_size == new_size)
 			continue;
 
+		/* Resize the backing anon file. */
+		ftruncate(m->segment_fd, new_size);
 
 		/*
 		 * Fail hard if faced any issues. In theory we could try to handle this
diff --git a/src/include/portability/mem.h b/src/include/portability/mem.h
index ef9800732d9..40588ff6968 100644
--- a/src/include/portability/mem.h
+++ b/src/include/portability/mem.h
@@ -38,7 +38,7 @@
 #define MAP_NOSYNC			0
 #endif
 
-#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_HASSEMAPHORE)
 
 /* Some really old systems don't define MAP_FAILED. */
 #ifndef MAP_FAILED
-- 
2.34.1

0010-WIP-Reinitialize-buffer-sync-strategy-20250228.patchtext/x-patch; charset=US-ASCII; name=0010-WIP-Reinitialize-buffer-sync-strategy-20250228.patchDownload

From 67ab7fe07bd0bd9f6a1f7ebbeaa7f27e32cf0cf3 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Mon, 27 Jan 2025 15:46:23 +0530
Subject: [PATCH 10/11] WIP: Reinitialize buffer sync strategy

Resizing the shared buffers renders the saved state of BgBufferSync() invalid.
Hence reinitialize it. The state is saved in static variables inside the
function and thus can not be accessed from outside the function. Hence we add an
argument to BgBufferSync() to request the function to reset the state.

TODO: Ideally we should save this state in some global structure and add a
function to reset it.

TODO: StrategyInitialize() initializes buffer lookup table but
StrategyReInitialize() doesn't.  Where should we reinitialize the buffer
lookup table?

Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c         |  6 +++++
 src/backend/postmaster/bgwriter.c     |  2 +-
 src/backend/storage/buffer/buf_init.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 20 +++++++++++++++--
 src/backend/storage/buffer/freelist.c | 32 +++++++++++++++++++++++++++
 src/include/storage/buf_internals.h   |  1 +
 src/include/storage/bufmgr.h          |  2 +-
 7 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index f084a0747ff..66d6d4b4333 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -1007,7 +1007,13 @@ AnonymousShmemResize(void)
 		 * backend who can not lock the LWLock conditionally won't resize the
 		 * buffers.
 		 */
+
+		if (MyBackendType == B_BG_WRITER)
+		{
+			/* If we are bgwriter wipe out the previous state and start anew. */
+			BgBufferSync(NULL, true);
 		}
+	}
 
 	return true;
 }
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 3eff5dc6f0e..3cd421b23c0 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -231,7 +231,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(&wb_context, false);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index d8139a899bb..6b6f6e6ae08 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -305,7 +305,7 @@ ResizeBufferPool(int NBuffersOld, bool initNew)
 	GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
 
 	/* Init other shared buffer-management stuff */
-	StrategyInitialize(!foundDescs);
+	StrategyReInitialize();
 
 	/* Initialize per-backend file flush context */
 	WritebackContextInit(&BackendWritebackContext,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b6bec73e6b7..a7948e75e76 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3179,7 +3179,10 @@ BufferSync(int flags)
  * BgBufferSync -- Write out some dirty buffers in the pool.
  *
  * This is called periodically by the background writer process.
- *
+ * 
+ * If `reset` = true, the function discards any saved information and starts
+ * anew.
+ * 
  * Returns true if it's appropriate for the bgwriter process to go into
  * low-power hibernation mode.  (This happens if the strategy clock sweep
  * has been "lapped" and no buffer allocations have occurred recently,
@@ -3187,7 +3190,7 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(WritebackContext *wb_context, bool reset)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
@@ -3230,6 +3233,19 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	if (reset)
+	{
+		saved_info_valid = false;
+
+		/*
+		 * Return from here, if we don't have a valid WritebackContext. Next time
+		 * this function will be executed with a valid WritebackContext, it will
+		 * start over again.
+		 */
+		if (!wb_context)
+			return false;
+	}
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 45a6e768332..4bb000f36e1 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -528,6 +528,38 @@ StrategyInitialize(bool init)
 		Assert(!init);
 }
 
+/*
+ * StrategyReInitialize -- re-initialize the buffer cache replacement
+ *		strategy.
+ *
+ * To be called when resizing buffer manager and only from the coordinator.
+ */
+void
+StrategyReInitialize(void)
+{
+	bool		found;
+
+	/*
+	 * Get or create the shared strategy control block. This is mostly not
+	 * required since we are not moving the starting pointer anyway.
+	 */
+	StrategyControl = (BufferStrategyControl *)
+		ShmemInitStructInSegment("Buffer Strategy Status",
+						sizeof(BufferStrategyControl),
+						&found, STRATEGY_SHMEM_SEGMENT);
+
+	SpinLockInit(&StrategyControl->buffer_strategy_lock);
+
+	/* Initialize the clock sweep pointer */
+	pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+
+	/* Clear statistics */
+	StrategyControl->completePasses = 0;
+	pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
+
+	/* No pending notification */
+	StrategyControl->bgwprocno = -1;
+}
 
 /* ----------------------------------------------------------------
  *				Backend-private buffer ring management
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 416c405fe4e..77bcaf1b43c 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -437,6 +437,7 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
+extern void StrategyReInitialize(void);
 extern bool have_free_buffer(void);
 
 /* buf_table.c */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index fff80214822..02db81fa416 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -288,7 +288,7 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+extern bool BgBufferSync(struct WritebackContext *wb_context, bool reset);
 
 extern void LimitAdditionalPins(uint32 *additional_pins);
 extern void LimitAdditionalLocalPins(uint32 *additional_pins);
-- 
2.34.1

0007-Fix-compilation-failures-in-previous-patche-20250228.patchtext/x-patch; charset=US-ASCII; name=0007-Fix-compilation-failures-in-previous-patche-20250228.patchDownload

From b1f510627937398ceb472f09b9b28f368a1201ea Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Tue, 25 Feb 2025 17:31:05 +0530
Subject: [PATCH 07/11] Fix compilation failures in previous patches

---
 src/backend/port/sysv_shmem.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 8864866f26c..992ed849dc0 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -825,16 +825,20 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 			 * Specify the segment file size using allocsize, which contains
 			 * potentially modified size.
 			 */
-			ftruncate(mapping->segment_fd, allocsize);
+			if (ftruncate(mapping->segment_fd, allocsize) < 0)
+				ereport(FATAL,
+						(errcode(ERRCODE_SYSTEM_ERROR),
+						errmsg("could not set the size of file backing shared memory %p: %m",
+								mapping->shmem)));
 
-			ptr = mmap(probe - offset, allocsize, PROT_READ | PROT_WRITE,
+			ptr = mmap((void *)((char *) probe - offset), allocsize, PROT_READ | PROT_WRITE,
 					   PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, mapping->segment_fd, 0);
 			mmap_errno = errno;
 			if (ptr == MAP_FAILED)
 			{
 				DebugMappings();
 				elog(DEBUG1, "segment[%s]: mmap(%zu) at address %p failed: %m",
-					 MappingName(mapping->shmem_segment), allocsize, probe - offset);
+					 MappingName(mapping->shmem_segment), allocsize, ((void *) ((char *)probe - offset)));
 			}
 
 		}
@@ -848,7 +852,11 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		allocsize = mapping->shmem_size;
 
 		/* Specify the segment file size using allocsize. */
-		ftruncate(mapping->segment_fd, allocsize);
+		if (ftruncate(mapping->segment_fd, allocsize) < 0)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					errmsg("could not set the size of file backing shared memory %p: %m",
+							mapping->shmem)));
 
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
 						   PG_MMAP_FLAGS, mapping->segment_fd, 0);
@@ -946,7 +954,11 @@ AnonymousShmemResize(void)
 			continue;
 
 		/* Resize the backing anon file. */
-		ftruncate(m->segment_fd, new_size);
+		if (ftruncate(m->segment_fd, new_size) < 0)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					errmsg("could not resize file backing shared memory %p: %m",
+							m->shmem)));
 
 		/*
 		 * Fail hard if faced any issues. In theory we could try to handle this
-- 
2.34.1

0009-WIP-Support-shrinking-shared-buffers-20250228.patchtext/x-patch; charset=US-ASCII; name=0009-WIP-Support-shrinking-shared-buffers-20250228.patchDownload

From 17a5e2416006e885d0d0a7bada02e56c6bb40486 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Thu, 27 Feb 2025 17:39:45 +0530
Subject: [PATCH 09/11] WIP: Support shrinking shared buffers

When shrinking the shared buffers pool, each buffer in the area being shrunk
needs to be flushed if it's dirty so as not to loose the changes to that buffer
after shrinking. Also, each such buffer needs to be removed from the buffer
mapping table so that backends do not access it after shrinking. This needs to
be done before we remap the shared memory segments related to buffer pools.  If
a buffer being evicted is pinned, we raise a FATAL error. TODO: Ideally we
should be just rolling back the buffer pool resizing operation and try it again.
But we need infrastructure to do so.

Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c                 | 79 ++++++++++++++++---
 src/backend/storage/buffer/buf_init.c         |  5 +-
 .../utils/activity/wait_event_names.txt       |  1 +
 src/include/storage/pg_shmem.h                |  3 +-
 4 files changed, 70 insertions(+), 18 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 2b144d45cf0..f084a0747ff 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -933,14 +933,6 @@ AnonymousShmemResize(void)
 	 */
 	pending_pm_shmem_resize = false;
 
-	/*
-	 * XXX: Currently only increasing of shared_buffers is supported. For
-	 * decreasing something similar has to be done, but buffer blocks with
-	 * data have to be drained first.
-	 */
-	if(NBuffersOld > NBuffers)
-		return false;
-
 	for(int i = 0; i < next_free_segment; i++)
 	{
 		/* Note that CalculateShmemSize indirectly depends on NBuffers */
@@ -998,8 +990,6 @@ AnonymousShmemResize(void)
 				 * reinitialize the new portion of buffer pool. Every other
 				 * process will wait on the shared barrier for that to finish,
 				 * since it's a part of the SHMEM_RESIZE_DONE phase.
-				 *
-				 * XXX: This is the right place for buffer eviction as well.
 				 */
 				ResizeBufferPool(NBuffersOld, true);
 
@@ -1022,6 +1012,52 @@ AnonymousShmemResize(void)
 	return true;
 }
 
+/*
+ * When shrinking shared buffers pool, evict the buffers which will not be part
+ * of the shrunk buffer pool.
+ */
+static bool
+EvictExtraBuffers()
+{
+	bool result = true;
+
+	/*
+	 * If the buffer being evicated is locked, this function will need to wait.
+	 * This function should not be called from a Postmaster since it can not wait on a lock.
+	 */
+	Assert(IsUnderPostmaster);
+
+	/*
+	 * Let only one backend perform eviction. We could split the work across all
+	 * the backends but that doesn't seem necessary. The first backend to acquire sets its own PID as the evictor PID so that other backends do not perform eviction. Any backend which can not take this lock already knows that some backend is evicting the buffers without looking at evictor_pid. All the backends which do not perform eviction still wait for this phase to finish and thus release lock before the next phase begins. Thus the same LWLock can be used to select a leader for each phase. TODO: This comment would better be placed at a place common to all phases.
+	 */
+	if (LWLockConditionalAcquire(ShmemResizeLock, LW_EXCLUSIVE))
+	{
+		if (ShmemCtrl->evictor_pid == 0)
+		{
+			ShmemCtrl->evictor_pid = MyProcPid;
+
+			/*
+			 * TODO: Before evicting any buffer, we should check whether any of the
+			 * buffers are pinned. If we find that a buffer is pinned after evicting
+			 * most of them, that will impact performance since all those evicted
+			 * buffers might need to be read again.
+			 */
+			for (Buffer b = NBuffers + 1; b <= NBuffersOld; b++)
+			{
+				if (!EvictUnpinnedBuffer(b))
+				{
+					elog(WARNING, "could not remove buffer %u, it is pinned", b);
+					result = false;
+				}
+			}
+		}
+		LWLockRelease(ShmemResizeLock);
+	}
+
+	return result;
+}
+
 /*
  * We are asked to resize shared memory. Do the resize and make sure to wait on
  * the provided barrier until all simultaneously participating backends finish
@@ -1065,15 +1101,31 @@ ProcessBarrierShmemResize(Barrier *barrier)
 	/* First phase means the resize has begun, SHMEM_RESIZE_START */
 	BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START);
 
+	/*
+	 * Evict extra buffers when shrinking shared buffers. We need to do this
+	 * while the memory for extra buffers is still mapped i.e. before remapping
+	 * the shared memory segments to a smaller memory area.
+	 */
+	if (NBuffersOld > NBuffers)
+	{
+		/*
+		 * TODO: If the buffer eviction fails for any reason, we should gracefully rollback the shared buffer resizing and try again. But the infrastructure to do so is not available right now. Hence just raise a FATAL so that the system restarts.
+		 */
+		if (!EvictExtraBuffers())
+			elog(FATAL, "buffer eviction failed");
+
+		BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_EVICT);
+	}
+
 	/* XXX: Split mremap and buffer reinitialization into two barrier phases */
 	AnonymousShmemResize();
 
 	/* The second phase means the resize has finished, SHMEM_RESIZE_DONE */
 	BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_DONE);
 
-	/* Allow the last backend to reset the barrier */
+	/* Allow the last backend to reset the control area. */
 	if (BarrierArriveAndDetach(barrier))
-		ResetShmemBarrier();
+		ResetShmemCtrl();
 
 	return true;
 }
@@ -1518,7 +1570,8 @@ WaitOnShmemBarrier(int phase)
 }
 
 void
-ResetShmemBarrier(void)
+ResetShmemCtrl(void)
 {
 	BarrierInit(&ShmemCtrl->Barrier, 0);
+	ShmemCtrl->evictor_pid = 0;
 }
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 248fbf1633b..d8139a899bb 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -119,6 +119,7 @@ BufferManagerShmemInit(void)
 		/* Initialize with the currently known value */
 		pg_atomic_init_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
 		BarrierInit(&ShmemCtrl->Barrier, 0);
+		ShmemCtrl->evictor_pid = 0;
 	}
 
 	/* Align descriptors to a cacheline boundary. */
@@ -228,10 +229,6 @@ ResizeBufferPool(int NBuffersOld, bool initNew)
 	int			i;
 	elog(DEBUG1, "Resizing buffer pool from %d to %d", NBuffersOld, NBuffers);
 
-	/* XXX: Only increasing of shared_buffers is supported in this function */
-	if(NBuffersOld > NBuffers)
-		return;
-
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
 		ShmemInitStructInSegment("Buffer Descriptors",
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 4203c987edc..a4a1e855c48 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -155,6 +155,7 @@ REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it c
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
 SHMEM_RESIZE_START	"Waiting for other backends to start resizing shared memory."
+SHMEM_RESIZE_EVICT	"Waiting for other backends to finish buffer evication phase."
 SHMEM_RESIZE_DONE	"Waiting for other backends to finish resizing shared memory."
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
 WAL_BUFFER_INIT	"Waiting on WAL buffer to be initialized."
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 3f103d708a5..3793f369313 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -68,6 +68,7 @@ extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 typedef struct
 {
 	pg_atomic_uint32 	NSharedBuffers;
+	pid_t				evictor_pid;
 	Barrier 			Barrier;
 } ShmemControl;
 
@@ -131,7 +132,7 @@ bool ProcessBarrierShmemResize(Barrier *barrier);
 void assign_shared_buffers(int newval, void *extra, bool *pending);
 void AdjustShmemSize(void);
 extern void WaitOnShmemBarrier(int phase);
-extern void ResetShmemBarrier(void);
+extern void ResetShmemCtrl(void);
 
 /*
  * To be able to dynamically resize largest parts of the data stored in shared
-- 
2.34.1

0008-Add-TODOs-and-questions-about-previous-comm-20250228.patchtext/x-patch; charset=US-ASCII; name=0008-Add-TODOs-and-questions-about-previous-comm-20250228.patchDownload

From 677a089ac8a8e0f9e700d60bbe59fc91f5b5276b Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Mon, 6 Jan 2025 14:40:51 +0530
Subject: [PATCH 08/11] Add TODOs and questions about previous commits

The commit just marks the places which need more work or whether the
code raises some questions. This is not an exhaustive list of TODOs.
More TODOs may come up as I work with these patches further.

Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c                 |  8 +++++-
 src/backend/storage/buffer/buf_init.c         | 12 +++++++++
 src/backend/storage/buffer/bufmgr.c           |  5 ++++
 src/backend/storage/buffer/freelist.c         |  3 +++
 src/backend/storage/ipc/ipc.c                 |  4 +++
 src/backend/storage/ipc/ipci.c                | 10 +++++++
 src/backend/storage/ipc/shmem.c               | 26 ++++++++++++++++---
 src/backend/storage/lmgr/lwlock.c             |  5 ++++
 src/backend/tcop/postgres.c                   |  6 +++++
 .../utils/activity/wait_event_names.txt       |  1 +
 src/include/storage/buf_internals.h           |  5 ++++
 src/include/storage/pg_shmem.h                |  4 +++
 12 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 992ed849dc0..2b144d45cf0 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -1011,7 +1011,13 @@ AnonymousShmemResize(void)
 
 			LWLockRelease(ShmemResizeLock);
 		}
-	}
+
+		/*
+		 * TODO: Shouldn't we call ResizeBufferPool() here as well? Or those
+		 * backend who can not lock the LWLock conditionally won't resize the
+		 * buffers.
+		 */
+		}
 
 	return true;
 }
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index b7de0ab6b0d..248fbf1633b 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -211,6 +211,12 @@ BufferManagerShmemInit(void)
  * initNew flag indicates that the caller wants new buffers to be initialized.
  * No locks are taking in this function, it is the caller responsibility to
  * make sure only one backend can work with new buffers.
+ *
+ * TODO: Avoid code duplication with BufferManagerShmemInit() and also assess
+ * which functionality in the latter is required in this function.
+ * similar to BufferManagerShmemInit, but applied only to the buffers in the
+ * range between NBuffersOld and NBuffers.
+ *
  */
 void
 ResizeBufferPool(int NBuffersOld, bool initNew)
@@ -293,6 +299,12 @@ ResizeBufferPool(int NBuffersOld, bool initNew)
 	}
 
 	/* Correct last entry of linked list */
+	/*
+	 * TODO: I think this needs to be done only when expanding the buffers.
+	 * 
+	 * TODO: We should also fix the freelist to not point to a shrunk
+	 * buffer and to append the new buffers to the existing free list.
+	 */
 	GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
 
 	/* Init other shared buffer-management stuff */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 80b0d0c5ded..b6bec73e6b7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2974,6 +2974,11 @@ BufferSync(int flags)
 		UnlockBufHdr(bufHdr, buf_state);
 
 		/* Check for barrier events in case NBuffers is large. */
+		/*
+		 * TODO: If we allow buffer resizing while this loop is being executed,
+		 * the loop will need to consider the new value of NBuffers. Hence avoid
+		 * resizing if this loop is being executed.
+		 */
 		if (ProcSignalBarrierPending)
 			ProcessProcSignalBarrier();
 	}
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4919a92f2be..45a6e768332 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -484,6 +484,9 @@ StrategyInitialize(bool init)
 	 * a new entry before deleting the old.  In principle this could be
 	 * happening in each partition concurrently, so we could need as many as
 	 * NBuffers + NUM_BUFFER_PARTITIONS entries.
+	 * 
+	 * TODO: If we are resizing, we need to preserve the earlier entries, don't
+	 * we?
 	 */
 	InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
 
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index 9d526eb43fd..38bd6f41130 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -70,6 +70,10 @@ static void proc_exit_prepare(int code);
  * ----------------------------------------------------------------
  */
 
+/*
+ * TODO: Why do we need to increase this by 20? I didn't notice any new calls to
+ * on_shmem_exit or on_proc_exit or before_shmem_exit.
+ */
 #define MAX_ON_EXITS 40
 
 struct ONEXIT
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index a2c635f288e..029ad28fe0b 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -204,6 +204,12 @@ AttachSharedMemoryStructs(void)
 /*
  * CreateSharedMemoryAndSemaphores
  *		Creates and initializes shared memory and semaphores.
+ *
+ * TODO: IMO this function should be rewritten to calculate the size of each
+ * shared memory slot or mapping. Instead of passing slot number to
+ * CalculateShmemSize, we should instead let each shared memory module use their
+ * own slot number and update the required sizes in the corresponding mapping.
+ * Then allocate shared memory in each of the mappings.
  */
 void
 CreateSharedMemoryAndSemaphores(void)
@@ -225,6 +231,10 @@ CreateSharedMemoryAndSemaphores(void)
 		 * Create the shmem segment.
 		 *
 		 * XXX: Do multiple shims are needed, one per segment?
+		 *
+		 * TODO: while each slot will return a different shim, only the last one
+		 * is passed to dsm_postmaster_startup(). Is that right? Shouldn't we
+		 * pass all of them or none.
 		 */
 		seghdr = PGSharedMemoryCreate(size, &shim);
 
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 226b38ba979..b569d125507 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -86,6 +86,11 @@ ShmemSegment Segments[ANON_MAPPINGS];
  * Primary index hashtable for shmem, for simplicity we use a single for all
  * shared memory segments. There can be performance consequences of that, and
  * an alternative option would be to have one index per shared memory segments.
+ *
+ * TODO: shouldn't this be part of the ShmemSegment structure? Some shared
+ * memory segments that hold only one structure do not need their pointers to be
+ * stored in the shared hash table, instead they could be part of the segments
+ * itself.
  */
 static HTAB *ShmemIndex = NULL;
 
@@ -493,9 +498,18 @@ ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
 	{
 		/*
 		 * Structure is in the shmem index so someone else has allocated it
-		 * already. Verify the structure's size:
-		 * - If it's the same, we've found the expected structure.
-		 * - If it's different, we're resizing the expected structure.
+		 * already. Verify the structure's size: - If it's the same, we've found
+		 * the expected structure.  - If it's different, we're resizing the
+		 * expected structure.
+		 *
+		 * TODO: This works because every structure that needs to be resized
+		 * resides in a shmem slot by itself. But it won't work if a slot
+		 * contains more structures, that need to be resized, placed in adjacent
+		 * memory. Also we are not updating the Shmem stats like freeoffset. I
+		 * think we will keep all resizable structures in a slot for themselves,
+		 * and not have a hash table in such slots since resizing the hash table
+		 * itself might cause memory to be allocated next to the resizable
+		 * structure making it difficult to resize it.
 		 */
 		if (result->size != size)
 			result->size = size;
@@ -587,6 +601,12 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 
 	hash_seq_init(&hstat, ShmemIndex);
 
+	/*
+	 * TODO: For the sake of completeness we should rotate through all the slots
+	 * (after saving slotwise ShmemIndex, if any). Do we want to also output
+	 * shmem slot name, but that would expose the slotified structure of shared
+	 * memory.
+	 */
 	/* output all allocated entries */
 	memset(nulls, 0, sizeof(nulls));
 	/* XXX: take all shared memory segments into account. */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 40aa4014b5f..8b6fe9b24f7 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -608,6 +608,11 @@ LWLockNewTrancheId(void)
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - sizeof(int));
 	/* We use the ShmemLock spinlock to protect LWLockCounter */
+	/*
+	 * TODO: We have retained ShmemLock global variable, should we use it here
+	 * instead of main segment lock? We will need spinlock init on the global
+	 * one if yes.
+	 */
 	SpinLockAcquire(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 	result = (*LWLockCounter)++;
 	SpinLockRelease(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 04cdd0d24d8..e88448f2846 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4671,6 +4671,12 @@ PostgresMain(const char *dbname, const char *username)
 		/*
 		 * (6) check for any other interesting events that happened while we
 		 * slept.
+		 * TODO: When a backend is waiting for a command, it won't reload
+		 * configuration and hence wouldn't notice change in shared_buffers. The
+		 * change is only noticed after the command is received and the control
+		 * comes here. We may need to improve this in case we want to resize
+		 * shared buffers or perform of part of that operation in assign_hook
+		 * implementation (e.g. AnonymousShmemResize()).
 		 */
 		if (ConfigReloadPending)
 		{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 947f13cb1fa..4203c987edc 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -348,6 +348,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+# TODO, not used anywhere, do we need it?
 ShmemResize	"Waiting to resize shared memory."
 
 #
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 4595f5a9676..416c405fe4e 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -22,6 +22,11 @@
 #include "storage/condition_variable.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
+/*
+ * TODO: this header files doesn't use anything in pg_shmem.h but the files which
+ * include this file may. We should include pg_shmem.h in those files rather than
+ * here.
+ */
 #include "storage/pg_shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b597df0d3a3..3f103d708a5 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -43,6 +43,10 @@ typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 #endif
 } PGShmemHeader;
 
+/*
+ * TODO: should we define it in shmem.c where the previous global variables were
+ * declared? Do we need this structure outside shmem.c?
+ */
 typedef struct ShmemSegment
 {
 	PGShmemHeader *ShmemSegHdr; 	/* shared mem segment header */
-- 
2.34.1

pgbench-concurrent-resize-buffers.shapplication/x-shellscript; name=pgbench-concurrent-resize-buffers.shDownload

#36

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

11 months ago

In reply to: Dmitry Dolgov (#33)

Re: Changing shared_buffers without restart

On Tue, Feb 25, 2025 at 3:22 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
changing shared memory mapping layout. Any feedback is appreciated.

Hi,

Here is a new version of the patch, which contains a proposal about how to
coordinate shared memory resizing between backends. The rest is more or less
the same, a feedback about coordination is appreciated. It's a lot to read, but
the main difference is about:

Thanks Dmitry for the summary.

1. Allowing to decouple a GUC value change from actually applying it, sort of a
"pending" change. The idea is to let a custom logic be triggered on an assign
hook, and then take responsibility for what happens later and how it's going to
be applied. This allows to use regular GUC infrastructure in cases where value
change requires some complicated processing. I was trying to make the change
not so invasive, plus it's missing GUC reporting yet.

2. Shared memory resizing patch became more complicated thanks to some
coordination between backends. The current implementation was chosen from few
more or less equal alternatives, which are evolving along following lines:

* There should be one "coordinator" process overseeing the change. Having
postmaster to fulfill this role like in this patch seems like a natural idea,
but it poses certain challenges since it doesn't have locking infrastructure.
Another option would be to elect a single backend to be a coordinator, which
will handle the postmaster as a special case. If there will ever be a
"coordinator" worker in Postgres, that would be useful here.

* The coordinator uses EmitProcSignalBarrier to reach out to all other backends
and trigger the resize process. Backends join a Barrier to synchronize and wait
untill everyone is finished.

* There is some resizing state stored in shared memory, which is there to
handle backends that were for some reason late or didn't receive the signal.
What to store there is open for discussion.

* Since we want to make sure all processes share the same understanding of what
NBuffers value is, any failure is mostly a hard stop, since to rollback the
change coordination is needed as well and sounds a bit too complicated for now.

I think we should add a way to monitor the progress of resizing; at
least whether resizing is complete and whether the new GUC value is in
effect.

We've tested this change manually for now, although it might be useful to try
out injection points. The testing strategy, which has caught plenty of bugs,
was simply to run pgbench workload against a running instance and change
shared_buffers on the fly. Some more subtle cases were verified by manually
injecting delays to trigger expected scenarios.

I have shared a script with my changes but it's far from being full
testing. We will need to use injection points to test specific
scenarios.

--
Best Wishes,
Ashutosh Bapat

#37

Ni Ku

jakkuniku@gmail.com

10 months ago

In reply to: Ashutosh Bapat (#36)

1 attachment(s)

Re: Changing shared_buffers without restart

Dmitry / Ashutosh,
Thanks for the patch set. I've been doing some testing with it and in
particular want to see if this solution would work with hugepage bufferpool.

I ran some simple tests (outside of PG) on linux kernel v6.1, which has
this commit that added some hugepage support to mremap (
https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/
).

From reading the kernel code and testing, for a hugepage-backed mapping it
seems mremap supports only shrinking but not growing. Further, for
shrinking, what I observed is that after mremap is called the hugepage
memory
is not released back to the OS, rather it's released when the fd is closed
(or when the memory is unmapped for a mapping created with MAP_ANONYMOUS).
I'm not sure if this behavior is expected, but being able to release memory
back to the OS immediately after mremap would be important for use cases
such as supporting "serverless" PG instances on the cloud.

I'm no expert in the linux kernel so I could be missing something. It'd be
great if you or somebody can comment on these observations and whether this
mremap-based solution would work with hugepage bufferpool.

I also attached the test program in case someone can spot I did something
wrong.

Regards,

Jack Ng

On Tue, Mar 18, 2025 at 11:02 AM Ashutosh Bapat <
ashutosh.bapat.oss@gmail.com> wrote:

Show quoted text

On Tue, Feb 25, 2025 at 3:22 PM Dmitry Dolgov <9erthalion6@gmail.com>
wrote:

On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
changing shared memory mapping layout. Any feedback is appreciated.

Hi,

Here is a new version of the patch, which contains a proposal about how

to

coordinate shared memory resizing between backends. The rest is more or

less

the same, a feedback about coordination is appreciated. It's a lot to

read, but

the main difference is about:

Thanks Dmitry for the summary.

1. Allowing to decouple a GUC value change from actually applying it,

sort of a

"pending" change. The idea is to let a custom logic be triggered on an

assign

hook, and then take responsibility for what happens later and how it's

going to

be applied. This allows to use regular GUC infrastructure in cases where

value

change requires some complicated processing. I was trying to make the

change

not so invasive, plus it's missing GUC reporting yet.

2. Shared memory resizing patch became more complicated thanks to some
coordination between backends. The current implementation was chosen

from few

more or less equal alternatives, which are evolving along following

lines:

* There should be one "coordinator" process overseeing the change. Having
postmaster to fulfill this role like in this patch seems like a natural

idea,

but it poses certain challenges since it doesn't have locking

infrastructure.

Another option would be to elect a single backend to be a coordinator,

which

will handle the postmaster as a special case. If there will ever be a
"coordinator" worker in Postgres, that would be useful here.

* The coordinator uses EmitProcSignalBarrier to reach out to all other

backends

and trigger the resize process. Backends join a Barrier to synchronize

and wait

untill everyone is finished.

* There is some resizing state stored in shared memory, which is there to
handle backends that were for some reason late or didn't receive the

signal.

What to store there is open for discussion.

* Since we want to make sure all processes share the same understanding

of what

NBuffers value is, any failure is mostly a hard stop, since to rollback

the

change coordination is needed as well and sounds a bit too complicated

for now.

I think we should add a way to monitor the progress of resizing; at
least whether resizing is complete and whether the new GUC value is in
effect.

We've tested this change manually for now, although it might be useful

to try

out injection points. The testing strategy, which has caught plenty of

bugs,

was simply to run pgbench workload against a running instance and change
shared_buffers on the fly. Some more subtle cases were verified by

manually

injecting delays to trigger expected scenarios.

I have shared a script with my changes but it's far from being full
testing. We will need to use injection points to test specific
scenarios.

--
Best Wishes,
Ashutosh Bapat

#38

Dmitry Dolgov

9erthalion6@gmail.com

10 months ago

In reply to: Ni Ku (#37)

Re: Changing shared_buffers without restart

On Thu, Mar 20, 2025 at 04:55:47PM GMT, Ni Ku wrote:

I ran some simple tests (outside of PG) on linux kernel v6.1, which has
this commit that added some hugepage support to mremap (
https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/
).

From reading the kernel code and testing, for a hugepage-backed mapping it
seems mremap supports only shrinking but not growing. Further, for
shrinking, what I observed is that after mremap is called the hugepage
memory
is not released back to the OS, rather it's released when the fd is closed
(or when the memory is unmapped for a mapping created with MAP_ANONYMOUS).
I'm not sure if this behavior is expected, but being able to release memory
back to the OS immediately after mremap would be important for use cases
such as supporting "serverless" PG instances on the cloud.

I'm no expert in the linux kernel so I could be missing something. It'd be
great if you or somebody can comment on these observations and whether this
mremap-based solution would work with hugepage bufferpool.

Hm, I think you're right. I didn't realize there is such limitation, but
just verified on the latest kernel build and hit the same condition on
increasing hugetlb mapping you've mentioned above. That's annoying of
course, but I've got another approach I was originally experimenting
with -- instead of mremap do munmap and mmap with the new size and rely
on the anonymous fd to keep the memory content in between. I'm currently
reworking mmap'ing part of the patch, let me check if this new approach
is something we could universally rely on.

#39

Ni Ku

jakkuniku@gmail.com

10 months ago

In reply to: Dmitry Dolgov (#38)

Re: Changing shared_buffers without restart

Thanks for your insights and confirmation, Dmitry.
Right, I think the anonymous fd approach would work to keep the memory
contents intact in between munmap and mmap with the new size, so bufferpool
expansion would work.
But it seems shrinking would still be problematic, since that approach
requires the anonymous fd to remain open (for memory content protection),
and so munmap would not release the memory back to the OS right away (gets
released when the fd is closed). From testing this is true for hugepage
memory at least.
Is there a way around this? Or maybe I misunderstood what you have in mind
;)

Regards,

Jack Ng

On Thu, Mar 20, 2025 at 6:21 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Show quoted text

On Thu, Mar 20, 2025 at 04:55:47PM GMT, Ni Ku wrote:

I ran some simple tests (outside of PG) on linux kernel v6.1, which has
this commit that added some hugepage support to mremap (

https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/

).

From reading the kernel code and testing, for a hugepage-backed mapping

it

seems mremap supports only shrinking but not growing. Further, for
shrinking, what I observed is that after mremap is called the hugepage
memory
is not released back to the OS, rather it's released when the fd is

closed

(or when the memory is unmapped for a mapping created with

MAP_ANONYMOUS).

I'm not sure if this behavior is expected, but being able to release

memory

back to the OS immediately after mremap would be important for use cases
such as supporting "serverless" PG instances on the cloud.

I'm no expert in the linux kernel so I could be missing something. It'd

be

great if you or somebody can comment on these observations and whether

this

mremap-based solution would work with hugepage bufferpool.

Hm, I think you're right. I didn't realize there is such limitation, but
just verified on the latest kernel build and hit the same condition on
increasing hugetlb mapping you've mentioned above. That's annoying of
course, but I've got another approach I was originally experimenting
with -- instead of mremap do munmap and mmap with the new size and rely
on the anonymous fd to keep the memory content in between. I'm currently
reworking mmap'ing part of the patch, let me check if this new approach
is something we could universally rely on.

#40

Dmitry Dolgov

9erthalion6@gmail.com

10 months ago

In reply to: Ni Ku (#39)

Re: Changing shared_buffers without restart

On Fri, Mar 21, 2025 at 04:48:30PM GMT, Ni Ku wrote:
Thanks for your insights and confirmation, Dmitry.
Right, I think the anonymous fd approach would work to keep the memory
contents intact in between munmap and mmap with the new size, so bufferpool
expansion would work.
But it seems shrinking would still be problematic, since that approach
requires the anonymous fd to remain open (for memory content protection),
and so munmap would not release the memory back to the OS right away (gets
released when the fd is closed). From testing this is true for hugepage
memory at least.
Is there a way around this? Or maybe I misunderstood what you have in mind
;)

The anonymous file will be truncated to it's new shrinked size before
mapping it second time (I think this part is missing in your test
example), to my understanding after a quick look at do_vmi_align_munmap,
this should be enough to make the memory reclaimable.

#41

Ni Ku

jakkuniku@gmail.com

10 months ago

In reply to: Dmitry Dolgov (#40)

Re: Changing shared_buffers without restart

You're right Dmitry, truncating the anonymous file before mapping it again
does the trick! I see 'HugePages_Free' increases to the expected size right
after the ftruncate call for shrinking.
This alternative approach looks very promising. Thanks.

Regards,

Jack Ng

On Fri, Mar 21, 2025 at 5:31 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Show quoted text

On Fri, Mar 21, 2025 at 04:48:30PM GMT, Ni Ku wrote:
Thanks for your insights and confirmation, Dmitry.
Right, I think the anonymous fd approach would work to keep the memory
contents intact in between munmap and mmap with the new size, so

bufferpool

expansion would work.
But it seems shrinking would still be problematic, since that approach
requires the anonymous fd to remain open (for memory content protection),
and so munmap would not release the memory back to the OS right away

(gets

released when the fd is closed). From testing this is true for hugepage
memory at least.
Is there a way around this? Or maybe I misunderstood what you have in

mind

;)

The anonymous file will be truncated to it's new shrinked size before
mapping it second time (I think this part is missing in your test
example), to my understanding after a quick look at do_vmi_align_munmap,
this should be enough to make the memory reclaimable.

#42

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

9 months ago

In reply to: Ashutosh Bapat (#36)

Re: Changing shared_buffers without restart

On Fri, Feb 28, 2025 at 5:31 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

I think we should add a way to monitor the progress of resizing; at
least whether resizing is complete and whether the new GUC value is in
effect.

I further tested this approach by tracing the barrier synchronization
using the attached patch with adds a bunch of elogs().
I ran pgbench load and simultaneously
executed following commands on a psql connection

#alter system set shared_buffers to '200MB';
ALTER SYSTEM
#select pg_reload_conf();
pg_reload_conf
----------------
t
(1 row)

#show shared_buffers;
shared_buffers
----------------
200MB
(1 row)

#select count(*) from pg_stat_activity;
count
-------
6
(1 row)

#select pg_backend_pid(); - the backend where all these commands were executed
pg_backend_pid
----------------
878405
(1 row)

I see the following in the postgresql error logs.

2025-03-12 11:04:53.812 IST [878167] LOG: received SIGHUP, reloading
configuration files
2025-03-12 11:04:53.813 IST [878405] LOG: Handle a barrier for shmem
resizing from 16384 to -1, 0
2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem
resizing from 16384 to -1, 0
2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem
resizing from 16384 to -1, 0
2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem
resizing from 16384 to -1, 0
2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem
resizing from 16384 to -1, 0

-- not all backends have reloaded configuration.

2025-03-12 11:04:53.813 IST [878173] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.813 IST [878173] LOG: attached when barrier was at phase 0
2025-03-12 11:04:53.813 IST [878173] LOG: reached barrier phase 1
2025-03-12 11:04:53.813 IST [878171] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.813 IST [878172] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.813 IST [878171] LOG: attached when barrier was at phase 1
2025-03-12 11:04:53.813 IST [878172] LOG: attached when barrier was at phase 1
2025-03-12 11:04:53.813 IST [878340] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.813 IST [878340] STATEMENT: UPDATE
pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8;
2025-03-12 11:04:53.813 IST [878340] LOG: attached when barrier was at phase 1
2025-03-12 11:04:53.813 IST [878340] STATEMENT: UPDATE
pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8;
2025-03-12 11:04:53.813 IST [878338] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.813 IST [878338] STATEMENT: UPDATE
pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662;
2025-03-12 11:04:53.813 IST [878339] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.813 IST [878339] STATEMENT: UPDATE
pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726;
2025-03-12 11:04:53.813 IST [878338] LOG: attached when barrier was at phase 1
2025-03-12 11:04:53.813 IST [878338] STATEMENT: UPDATE
pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662;
2025-03-12 11:04:53.813 IST [878339] LOG: attached when barrier was at phase 1
2025-03-12 11:04:53.813 IST [878339] STATEMENT: UPDATE
pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726;
2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.813 IST [878341] STATEMENT: BEGIN;
2025-03-12 11:04:53.814 IST [878341] LOG: attached when barrier was at phase 1
2025-03-12 11:04:53.814 IST [878341] STATEMENT: BEGIN;
2025-03-12 11:04:53.814 IST [878337] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers
SET tbalance = tbalance + -1996 WHERE tid = 392;
2025-03-12 11:04:53.814 IST [878337] LOG: attached when barrier was at phase 1
2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers
SET tbalance = tbalance + -1996 WHERE tid = 392;
2025-03-12 11:04:53.814 IST [878168] LOG: Handle a barrier for shmem
resizing from 16384 to -1, 0
2025-03-12 11:04:53.814 IST [878172] LOG: reached barrier phase 2
2025-03-12 11:04:53.814 IST [878171] LOG: reached barrier phase 2
2025-03-12 11:04:53.814 IST [878340] LOG: reached barrier phase 2
2025-03-12 11:04:53.814 IST [878340] STATEMENT: UPDATE
pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8;
2025-03-12 11:04:53.814 IST [878338] LOG: reached barrier phase 2
2025-03-12 11:04:53.814 IST [878338] STATEMENT: UPDATE
pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662;
2025-03-12 11:04:53.814 IST [878341] LOG: reached barrier phase 2
2025-03-12 11:04:53.814 IST [878341] STATEMENT: BEGIN;
2025-03-12 11:04:53.814 IST [878337] LOG: reached barrier phase 2
2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers
SET tbalance = tbalance + -1996 WHERE tid = 392;
2025-03-12 11:04:53.814 IST [878173] LOG: reached barrier phase 2
2025-03-12 11:04:53.814 IST [878339] LOG: reached barrier phase 2
2025-03-12 11:04:53.814 IST [878339] STATEMENT: UPDATE
pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726;
2025-03-12 11:04:53.814 IST [878172] LOG: reached barrier phase 3
2025-03-12 11:04:53.814 IST [878340] LOG: reached barrier phase 3
2025-03-12 11:04:53.814 IST [878340] STATEMENT: UPDATE
pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8;
2025-03-12 11:04:53.814 IST [878341] LOG: reached barrier phase 3
2025-03-12 11:04:53.814 IST [878341] STATEMENT: BEGIN;
2025-03-12 11:04:53.814 IST [878339] LOG: reached barrier phase 3
2025-03-12 11:04:53.814 IST [878339] STATEMENT: UPDATE
pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726;
2025-03-12 11:04:53.814 IST [878171] LOG: reached barrier phase 3
2025-03-12 11:04:53.814 IST [878338] LOG: reached barrier phase 3
2025-03-12 11:04:53.814 IST [878338] STATEMENT: UPDATE
pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662;
2025-03-12 11:04:53.814 IST [878337] LOG: reached barrier phase 3
2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers
SET tbalance = tbalance + -1996 WHERE tid = 392;
2025-03-12 11:04:53.814 IST [878337] LOG: buffer resizing operation
finished at phase 4
2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers
SET tbalance = tbalance + -1996 WHERE tid = 392;
2025-03-12 11:04:53.814 IST [878168] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.814 IST [878168] LOG: attached when barrier was at phase 0
2025-03-12 11:04:53.814 IST [878168] LOG: reached barrier phase 1
2025-03-12 11:04:53.814 IST [878168] LOG: reached barrier phase 2
2025-03-12 11:04:53.814 IST [878168] LOG: buffer resizing operation
finished at phase 3
2025-03-12 11:04:53.815 IST [878169] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:53.815 IST [878169] LOG: attached when barrier was at phase 0
2025-03-12 11:04:53.815 IST [878169] LOG: reached barrier phase 1
2025-03-12 11:04:53.815 IST [878169] LOG: reached barrier phase 2
2025-03-12 11:04:53.815 IST [878169] LOG: buffer resizing operation
finished at phase 3
2025-03-12 11:04:55.965 IST [878405] LOG: Handle a barrier for shmem
resizing from 16384 to -1, 0
2025-03-12 11:04:55.965 IST [878405] LOG: Handle a barrier for shmem
resizing from 16384 to -1, 0
2025-03-12 11:04:55.965 IST [878405] LOG: Handle a barrier for shmem
resizing from 16384 to 25600, 1
2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers;
2025-03-12 11:04:55.965 IST [878405] LOG: attached when barrier was at phase 0
2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers;
2025-03-12 11:04:55.965 IST [878405] LOG: reached barrier phase 1
2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers;
2025-03-12 11:04:55.965 IST [878405] LOG: reached barrier phase 2
2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers;
2025-03-12 11:04:55.965 IST [878405] LOG: buffer resizing operation
finished at phase 3
2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers;

To tell the story in short. pid 173 (for the sake of brevity I am just
mentioning the last three digits of PID) attached to the barrier first
and immediately reached phase 1. 171, 172, 340, 338, 339, 341, 337 -
all attached barrier in phase 1. All of these backends completed the
phases in synchronous fashion. But 168, 169 and 405 were yet to attach
to the barrier since they hadn't loaded their configurations yet. Each
of these backends then finished all phases independent of others.

For your reference
#select pid, application_name, backend_type from pg_stat_activity
where pid in (878169, 878168);
pid | application_name | backend_type
--------+------------------+-------------------
878168 | | checkpointer
878169 | | background writer
(2 rows)

This is because the BarrierArriveAndWait() only waits for all the
attached backends. It doesn't wait for backends which are yet to
attach. I think what we want is *all* the backends should execute all
the phases synchronously and wait for others to finish. If we don't do
that, there's a possibility that some of them would see inconsistent
buffer states or even worse may not have necessary memory mapped and
resized - thus causing segfaults. Am I correct?

I think what needs to be done is that every backend should wait for other
backends to attach themselves to the barrier before moving to the
first phase. One way I can think of is we use two signal barriers -
one to ensure that all the backends have attached themselves and
second for the actual resizing. But then the postmaster needs to wait for
all the processes to process the first signal barrier. A postmaster can
not wait on anything. Maybe there's a way to poll, but I didn't find
it. Does that mean that we have to make some other backend a coordinator?

--
Best Wishes,
Ashutosh Bapat

#43

Dmitry Dolgov

9erthalion6@gmail.com

9 months ago

In reply to: Ashutosh Bapat (#42)

8 attachment(s)

Re: Changing shared_buffers without restart

On Mon, Apr 07, 2025 at 11:50:46AM GMT, Ashutosh Bapat wrote:
This is because the BarrierArriveAndWait() only waits for all the
attached backends. It doesn't wait for backends which are yet to
attach. I think what we want is *all* the backends should execute all
the phases synchronously and wait for others to finish. If we don't do
that, there's a possibility that some of them would see inconsistent
buffer states or even worse may not have necessary memory mapped and
resized - thus causing segfaults. Am I correct?

I think what needs to be done is that every backend should wait for other
backends to attach themselves to the barrier before moving to the
first phase. One way I can think of is we use two signal barriers -
one to ensure that all the backends have attached themselves and
second for the actual resizing. But then the postmaster needs to wait for
all the processes to process the first signal barrier. A postmaster can
not wait on anything. Maybe there's a way to poll, but I didn't find
it. Does that mean that we have to make some other backend a coordinator?

Yes, you're right, plain dynamic Barrier does not ensure all available
processes will be synchronized. I was aware about the scenario you
describe, it's mentioned in commentaries for the resize function. I was
under the impression this should be enough, but after some more thinking
I'm not so sure anymore. Let me try to structure it as a list of
possible corner cases that we need to worry about:

* New backend spawned while we're busy resizing shared memory. Those
should wait until the resizing is complete and get the new size as well.

* Old backend receives a resize message, but exits before attempting to
resize. Those should be excluded from coordination.

* A backend is blocked and not responding before or after the
ProcSignalBarrier message was sent. I'm thinking about a failure
situation, when one rogue backend is doing something without checking
for interrupts. We need to wait for those to become responsive, and
potentially abort shared memory resize after some timeout.

* Backends join the barrier in disjoint groups with some time in
between, which is longer than what it takes to resize shared memory.
That means that relying only on the shared dynamic barrier is not
enough -- it will only synchronize resize procedure withing those
groups.

Out of those I think the third poses some problems, e.g. if we shrinking
the shared memory, but one backend is accessing buffer pool without
checking for interrupts. In the v3 implementation this won't be handled
correctly, other backends will ignore such rogue process. Independently
from that we could reason about the logic much easier if it's guaranteed
that all the process to resize shared memory will wait for each other to
start simultaneously.

Looks like to achieve that we need a slightly different combination of a
global Barrier and ProcSignalBarrier mechanism. We can't use
ProcSignalBarrier as it is, because processes need to wait for each
other, and at the same time finish processing to bump the generation. We
also can't use a simple dynamic Barrier due to possibility of disjoint
groups of processes. A static Barrier is also not easier, because we
would need somehow to know exact number of processes, which might change
over time.

I think a relatively elegant solution is to extend ProcSignalBarrier
mechanism to track not only pss_barrierGeneration, as a sign that
everything was processed, but also something like
pss_barrierReceivedGeneration, indicating that the message was received
everywhere but not processed yet. That would be enough to allow
processes to wait until the resize message was received everywhere, then
use a global Barrier to wait until all processes are finished. It's
somehow similar to your proposal to use two signals, but has less
implementation overhead.

This would also allow different solutions regarding error handling. E.g.
we could do an unbounded waiting for all processes we expect to resize,
assuming that the user will be able to intervene and fix an issue if
there is any. Or we can do a timed waiting, and abort the resize after
some timeout of not all processes are ready yet. In the new v4 version
of the patch the first option is implemented.

On top of that there are following changes:

* Shared memory address space is now reserved for future usage, making
shared memory segments clash (e.g. due to memory allocation)
impossible. There is a new GUC to control how much space to reserve,
which is called max_available_memory -- on the assumption that most of
the time it would make sense to set its value to the total amount of
memory on the machine. I'm open for suggestions regarding the name.

* There is one more patch to address hugepages remap. As mentioned in
this thread above, Linux kernel has certain limitations when it comes
to mremap for segments allocated with huge pages. To work around it's
possible to replace mremap with a sequence of unmap and map again,
relying on the anon file behind the segment to keep the memory
content. I haven't found any downsides of this approach so far, but it
makes the anonymous file patch 0007 mandatory.

Attachments:

v4-0001-Allow-to-use-multiple-shared-memory-mappings.patchtext/plain; charset=us-asciiDownload

From 15b87a1cb89d3f31b656e27d07ead5aa935f1643 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 28 Feb 2025 19:54:47 +0100
Subject: [PATCH v4 1/8] Allow to use multiple shared memory mappings

Currently all the work with shared memory is done via a single anonymous
memory mapping, which limits ways how the shared memory could be organized.

Introduce possibility to allocate multiple shared memory mappings, where
a single mapping is associated with a specified shared memory segment.
There is only fixed amount of available segments, currently only one
main shared memory segment is allocated. A new shared memory API is
introduces, extended with a segment as a new parameter. As a path of
least resistance, the original API is kept in place, utilizing the main
shared memory segment.
---
 src/backend/port/posix_sema.c     |   4 +-
 src/backend/port/sysv_sema.c      |   4 +-
 src/backend/port/sysv_shmem.c     | 138 ++++++++++++++++++++---------
 src/backend/port/win32_sema.c     |   2 +-
 src/backend/storage/ipc/ipc.c     |   4 +-
 src/backend/storage/ipc/ipci.c    |  63 +++++++------
 src/backend/storage/ipc/shmem.c   | 141 +++++++++++++++++++++---------
 src/backend/storage/lmgr/lwlock.c |  13 ++-
 src/include/storage/ipc.h         |   2 +-
 src/include/storage/pg_sema.h     |   2 +-
 src/include/storage/pg_shmem.h    |  18 ++++
 src/include/storage/shmem.h       |  12 +++
 12 files changed, 278 insertions(+), 125 deletions(-)

diff --git a/src/backend/port/posix_sema.c b/src/backend/port/posix_sema.c
index 269c7460817..401e1113fa1 100644
--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -193,7 +193,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * we don't have to expose the counters to other processes.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	struct stat statbuf;
 
@@ -220,7 +220,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSegment(PGSemaphoreShmemSize(maxSemas), shmem_segment);
 #endif
 
 	numSems = 0;
diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c
index f7c8638aec5..b6301463ac7 100644
--- a/src/backend/port/sysv_sema.c
+++ b/src/backend/port/sysv_sema.c
@@ -313,7 +313,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * have clobbered.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	struct stat statbuf;
 
@@ -334,7 +334,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSegment(PGSemaphoreShmemSize(maxSemas), shmem_segment);
 	numSharedSemas = 0;
 	maxSharedSemas = maxSemas;
 
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..56af0231d24 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -94,8 +94,19 @@ typedef enum
 unsigned long UsedShmemSegID = 0;
 void	   *UsedShmemSegAddr = NULL;
 
-static Size AnonymousShmemSize;
-static void *AnonymousShmem = NULL;
+typedef struct AnonymousMapping
+{
+	int shmem_segment;
+	Size shmem_size; 			/* Size of the mapping */
+	Pointer shmem; 				/* Pointer to the start of the mapped memory */
+	Pointer seg_addr; 			/* SysV shared memory for the header */
+	unsigned long seg_id; 		/* IPC key */
+} AnonymousMapping;
+
+static AnonymousMapping Mappings[ANON_MAPPINGS];
+
+/* Keeps track of used mapping segments */
+static int next_free_segment = 0;
 
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
@@ -104,6 +115,28 @@ static IpcMemoryState PGSharedMemoryAttach(IpcMemoryId shmId,
 										   void *attachAt,
 										   PGShmemHeader **addr);
 
+static const char*
+MappingName(int shmem_segment)
+{
+	switch (shmem_segment)
+	{
+		case MAIN_SHMEM_SEGMENT:
+			return "main";
+		default:
+			return "unknown";
+	}
+}
+
+static void
+DebugMappings()
+{
+	for(int i = 0; i < next_free_segment; i++)
+	{
+		AnonymousMapping m = Mappings[i];
+		elog(DEBUG1, "Mapping[%s]: addr %p, size %zu",
+			 MappingName(i), m.shmem, m.shmem_size);
+	}
+}
 
 /*
  *	InternalIpcMemoryCreate(memKey, size)
@@ -591,14 +624,13 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
 /*
  * Creates an anonymous mmap()ed shared memory segment.
  *
- * Pass the requested size in *size.  This function will modify *size to the
- * actual size of the allocation, if it ends up allocating a segment that is
- * larger than requested.
+ * This function will modify mapping size to the actual size of the allocation,
+ * if it ends up allocating a segment that is larger than requested.
  */
-static void *
-CreateAnonymousSegment(Size *size)
+static void
+CreateAnonymousSegment(AnonymousMapping *mapping)
 {
-	Size		allocsize = *size;
+	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
 
@@ -623,8 +655,11 @@ CreateAnonymousSegment(Size *size)
 				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
-			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-				 allocsize);
+		{
+			DebugMappings();
+			elog(DEBUG1, "segment[%s]: mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
+				 MappingName(mapping->shmem_segment), allocsize);
+		}
 	}
 #endif
 
@@ -642,7 +677,7 @@ CreateAnonymousSegment(Size *size)
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
-		allocsize = *size;
+		allocsize = mapping->shmem_size;
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
 				   PG_MMAP_FLAGS, -1, 0);
 		mmap_errno = errno;
@@ -651,8 +686,10 @@ CreateAnonymousSegment(Size *size)
 	if (ptr == MAP_FAILED)
 	{
 		errno = mmap_errno;
+		DebugMappings();
 		ereport(FATAL,
-				(errmsg("could not map anonymous shared memory: %m"),
+				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
+						MappingName(mapping->shmem_segment)),
 				 (mmap_errno == ENOMEM) ?
 				 errhint("This error usually means that PostgreSQL's request "
 						 "for a shared memory segment exceeded available memory, "
@@ -663,8 +700,8 @@ CreateAnonymousSegment(Size *size)
 						 allocsize) : 0));
 	}
 
-	*size = allocsize;
-	return ptr;
+	mapping->shmem = ptr;
+	mapping->shmem_size = allocsize;
 }
 
 /*
@@ -674,13 +711,18 @@ CreateAnonymousSegment(Size *size)
 static void
 AnonymousShmemDetach(int status, Datum arg)
 {
-	/* Release anonymous shared memory block, if any. */
-	if (AnonymousShmem != NULL)
+	for(int i = 0; i < next_free_segment; i++)
 	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		AnonymousMapping m = Mappings[i];
+
+		/* Release anonymous shared memory block, if any. */
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
 
@@ -705,6 +747,7 @@ PGSharedMemoryCreate(Size size,
 	PGShmemHeader *hdr;
 	struct stat statbuf;
 	Size		sysvsize;
+	AnonymousMapping *mapping = &Mappings[next_free_segment];
 
 	/*
 	 * We use the data directory's ID info (inode and device numbers) to
@@ -733,11 +776,15 @@ PGSharedMemoryCreate(Size size,
 
 	/* Room for a header? */
 	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
+	mapping->shmem_size = size;
+	mapping->shmem_segment = next_free_segment;
 
 	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
-		AnonymousShmem = CreateAnonymousSegment(&size);
-		AnonymousShmemSize = size;
+		/* On success, mapping data will be modified. */
+		CreateAnonymousSegment(mapping);
+
+		next_free_segment++;
 
 		/* Register on-exit routine to unmap the anonymous segment */
 		on_shmem_exit(AnonymousShmemDetach, (Datum) 0);
@@ -760,7 +807,7 @@ PGSharedMemoryCreate(Size size,
 	 * loop simultaneously.  (CreateDataDirLockFile() does not entirely ensure
 	 * that, but prefer fixing it over coping here.)
 	 */
-	NextShmemSegID = statbuf.st_ino;
+	NextShmemSegID = statbuf.st_ino + next_free_segment;
 
 	for (;;)
 	{
@@ -852,13 +899,13 @@ PGSharedMemoryCreate(Size size,
 	/*
 	 * Initialize space allocation status for segment.
 	 */
-	hdr->totalsize = size;
+	hdr->totalsize = mapping->shmem_size;
 	hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader));
 	*shim = hdr;
 
 	/* Save info for possible future use */
-	UsedShmemSegAddr = memAddress;
-	UsedShmemSegID = (unsigned long) NextShmemSegID;
+	mapping->seg_addr = memAddress;
+	mapping->seg_id = (unsigned long) NextShmemSegID;
 
 	/*
 	 * If AnonymousShmem is NULL here, then we're not using anonymous shared
@@ -866,10 +913,10 @@ PGSharedMemoryCreate(Size size,
 	 * block. Otherwise, the System V shared memory block is only a shim, and
 	 * we must return a pointer to the real block.
 	 */
-	if (AnonymousShmem == NULL)
+	if (mapping->shmem == NULL)
 		return hdr;
-	memcpy(AnonymousShmem, hdr, sizeof(PGShmemHeader));
-	return (PGShmemHeader *) AnonymousShmem;
+	memcpy(mapping->shmem, hdr, sizeof(PGShmemHeader));
+	return (PGShmemHeader *) mapping->shmem;
 }
 
 #ifdef EXEC_BACKEND
@@ -969,23 +1016,28 @@ PGSharedMemoryNoReAttach(void)
 void
 PGSharedMemoryDetach(void)
 {
-	if (UsedShmemSegAddr != NULL)
+	for(int i = 0; i < next_free_segment; i++)
 	{
-		if ((shmdt(UsedShmemSegAddr) < 0)
+		AnonymousMapping m = Mappings[i];
+
+		if (m.seg_addr != NULL)
+		{
+			if ((shmdt(m.seg_addr) < 0)
 #if defined(EXEC_BACKEND) && defined(__CYGWIN__)
-		/* Work-around for cygipc exec bug */
-			&& shmdt(NULL) < 0
+			/* Work-around for cygipc exec bug */
+				&& shmdt(NULL) < 0
 #endif
-			)
-			elog(LOG, "shmdt(%p) failed: %m", UsedShmemSegAddr);
-		UsedShmemSegAddr = NULL;
-	}
+				)
+				elog(LOG, "shmdt(%p) failed: %m", m.seg_addr);
+			m.seg_addr = NULL;
+		}
 
-	if (AnonymousShmem != NULL)
-	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
diff --git a/src/backend/port/win32_sema.c b/src/backend/port/win32_sema.c
index 5854ad1f54d..e7365ff8060 100644
--- a/src/backend/port/win32_sema.c
+++ b/src/backend/port/win32_sema.c
@@ -44,7 +44,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * process exits.
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	mySemSet = (HANDLE *) malloc(maxSemas * sizeof(HANDLE));
 	if (mySemSet == NULL)
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index 567739b5be9..5b55bec8d9d 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -61,6 +61,8 @@ static void proc_exit_prepare(int code);
  * but provide some additional features we need --- in particular,
  * we want to register callbacks to invoke when we are disconnecting
  * from a broken shared-memory context but not exiting the postmaster.
+ * Maximum number of such exit callbacks depends on the number of shared
+ * segments.
  *
  * Callback functions can take zero, one, or two args: the first passed
  * arg is the integer exitcode, the second is the Datum supplied when
@@ -68,7 +70,7 @@ static void proc_exit_prepare(int code);
  * ----------------------------------------------------------------
  */
 
-#define MAX_ON_EXITS 20
+#define MAX_ON_EXITS 40
 
 struct ONEXIT
 {
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..8b38e985327 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -86,7 +86,7 @@ RequestAddinShmemSpace(Size size)
  * required.
  */
 Size
-CalculateShmemSize(int *num_semaphores)
+CalculateShmemSize(int *num_semaphores, int shmem_segment)
 {
 	Size		size;
 	int			numSemas;
@@ -206,33 +206,38 @@ CreateSharedMemoryAndSemaphores(void)
 
 	Assert(!IsUnderPostmaster);
 
-	/* Compute the size of the shared-memory block */
-	size = CalculateShmemSize(&numSemas);
-	elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
-
-	/*
-	 * Create the shmem segment
-	 */
-	seghdr = PGSharedMemoryCreate(size, &shim);
-
-	/*
-	 * Make sure that huge pages are never reported as "unknown" while the
-	 * server is running.
-	 */
-	Assert(strcmp("unknown",
-				  GetConfigOption("huge_pages_status", false, false)) != 0);
-
-	InitShmemAccess(seghdr);
-
-	/*
-	 * Create semaphores
-	 */
-	PGReserveSemaphores(numSemas);
-
-	/*
-	 * Set up shared memory allocation mechanism
-	 */
-	InitShmemAllocation();
+	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
+	{
+		/* Compute the size of the shared-memory block */
+		size = CalculateShmemSize(&numSemas, segment);
+		elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
+
+		/*
+		 * Create the shmem segment.
+		 *
+		 * XXX: Do multiple shims are needed, one per segment?
+		 */
+		seghdr = PGSharedMemoryCreate(size, &shim);
+
+		/*
+		 * Make sure that huge pages are never reported as "unknown" while the
+		 * server is running.
+		 */
+		Assert(strcmp("unknown",
+					  GetConfigOption("huge_pages_status", false, false)) != 0);
+
+		InitShmemAccessInSegment(seghdr, segment);
+
+		/*
+		 * Create semaphores
+		 */
+		PGReserveSemaphores(numSemas, segment);
+
+		/*
+		 * Set up shared memory allocation mechanism
+		 */
+		InitShmemAllocationInSegment(segment);
+	}
 
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
@@ -363,7 +368,7 @@ InitializeShmemGUCs(void)
 	/*
 	 * Calculate the shared memory size and round up to the nearest megabyte.
 	 */
-	size_b = CalculateShmemSize(&num_semas);
+	size_b = CalculateShmemSize(&num_semas, MAIN_SHMEM_SEGMENT);
 	size_mb = add_size(size_b, (1024 * 1024) - 1) / (1024 * 1024);
 	sprintf(buf, "%zu", size_mb);
 	SetConfigOption("shared_memory_size", buf,
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..389abc82519 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -75,19 +75,19 @@
 #include "utils/builtins.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
+static void *ShmemAllocRawInSegment(Size size, Size *allocated_size,
+								 int shmem_segment);
 
 /* shared memory global variables */
 
-static PGShmemHeader *ShmemSegHdr;	/* shared mem segment header */
+ShmemSegment Segments[ANON_MAPPINGS];
 
-static void *ShmemBase;			/* start address of shared memory */
-
-static void *ShmemEnd;			/* end+1 address of shared memory */
-
-slock_t    *ShmemLock;			/* spinlock for shared memory and LWLock
-								 * allocation */
-
-static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/*
+ * Primary index hashtable for shmem, for simplicity we use a single for all
+ * shared memory segments. There can be performance consequences of that, and
+ * an alternative option would be to have one index per shared memory segments.
+ */
+static HTAB *ShmemIndex = NULL;
 
 
 /*
@@ -96,9 +96,17 @@ static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
 void
 InitShmemAccess(PGShmemHeader *seghdr)
 {
-	ShmemSegHdr = seghdr;
-	ShmemBase = seghdr;
-	ShmemEnd = (char *) ShmemBase + seghdr->totalsize;
+	InitShmemAccessInSegment(seghdr, MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAccessInSegment(PGShmemHeader *seghdr, int shmem_segment)
+{
+	PGShmemHeader *shmhdr = (PGShmemHeader *) seghdr;
+	ShmemSegment *seg = &Segments[shmem_segment];
+	seg->ShmemSegHdr = shmhdr;
+	seg->ShmemBase = (void *) shmhdr;
+	seg->ShmemEnd = (char *) seg->ShmemBase + shmhdr->totalsize;
 }
 
 /*
@@ -109,7 +117,13 @@ InitShmemAccess(PGShmemHeader *seghdr)
 void
 InitShmemAllocation(void)
 {
-	PGShmemHeader *shmhdr = ShmemSegHdr;
+	InitShmemAllocationInSegment(MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAllocationInSegment(int shmem_segment)
+{
+	PGShmemHeader *shmhdr = Segments[shmem_segment].ShmemSegHdr;
 	char	   *aligned;
 
 	Assert(shmhdr != NULL);
@@ -118,9 +132,9 @@ InitShmemAllocation(void)
 	 * Initialize the spinlock used by ShmemAlloc.  We must use
 	 * ShmemAllocUnlocked, since obviously ShmemAlloc can't be called yet.
 	 */
-	ShmemLock = (slock_t *) ShmemAllocUnlocked(sizeof(slock_t));
+	Segments[shmem_segment].ShmemLock = (slock_t *) ShmemAllocUnlockedInSegment(sizeof(slock_t), shmem_segment);
 
-	SpinLockInit(ShmemLock);
+	SpinLockInit(Segments[shmem_segment].ShmemLock);
 
 	/*
 	 * Allocations after this point should go through ShmemAlloc, which
@@ -145,11 +159,17 @@ InitShmemAllocation(void)
  */
 void *
 ShmemAlloc(Size size)
+{
+	return ShmemAllocInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocInSegment(Size size, int shmem_segment)
 {
 	void	   *newSpace;
 	Size		allocated_size;
 
-	newSpace = ShmemAllocRaw(size, &allocated_size);
+	newSpace = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
 	if (!newSpace)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -179,6 +199,12 @@ ShmemAllocNoError(Size size)
  */
 static void *
 ShmemAllocRaw(Size size, Size *allocated_size)
+{
+	return ShmemAllocRawInSegment(size, allocated_size, MAIN_SHMEM_SEGMENT);
+}
+
+static void *
+ShmemAllocRawInSegment(Size size, Size *allocated_size, int shmem_segment)
 {
 	Size		newStart;
 	Size		newFree;
@@ -198,22 +224,22 @@ ShmemAllocRaw(Size size, Size *allocated_size)
 	size = CACHELINEALIGN(size);
 	*allocated_size = size;
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-	SpinLockAcquire(ShmemLock);
+	SpinLockAcquire(Segments[shmem_segment].ShmemLock);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree <= ShmemSegHdr->totalsize)
+	if (newFree <= Segments[shmem_segment].ShmemSegHdr->totalsize)
 	{
-		newSpace = (char *) ShmemBase + newStart;
-		ShmemSegHdr->freeoffset = newFree;
+		newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
+		Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 	}
 	else
 		newSpace = NULL;
 
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[shmem_segment].ShmemLock);
 
 	/* note this assert is okay with newSpace == NULL */
 	Assert(newSpace == (void *) CACHELINEALIGN(newSpace));
@@ -231,6 +257,12 @@ ShmemAllocRaw(Size size, Size *allocated_size)
  */
 void *
 ShmemAllocUnlocked(Size size)
+{
+	return ShmemAllocUnlockedInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocUnlockedInSegment(Size size, int shmem_segment)
 {
 	Size		newStart;
 	Size		newFree;
@@ -241,19 +273,19 @@ ShmemAllocUnlocked(Size size)
 	 */
 	size = MAXALIGN(size);
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree > ShmemSegHdr->totalsize)
+	if (newFree > Segments[shmem_segment].ShmemSegHdr->totalsize)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of shared memory (%zu bytes requested)",
 						size)));
-	ShmemSegHdr->freeoffset = newFree;
+	Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 
-	newSpace = (char *) ShmemBase + newStart;
+	newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
 
 	Assert(newSpace == (void *) MAXALIGN(newSpace));
 
@@ -268,7 +300,13 @@ ShmemAllocUnlocked(Size size)
 bool
 ShmemAddrIsValid(const void *addr)
 {
-	return (addr >= ShmemBase) && (addr < ShmemEnd);
+	return ShmemAddrIsValidInSegment(addr, MAIN_SHMEM_SEGMENT);
+}
+
+bool
+ShmemAddrIsValidInSegment(const void *addr, int shmem_segment)
+{
+	return (addr >= Segments[shmem_segment].ShmemBase) && (addr < Segments[shmem_segment].ShmemEnd);
 }
 
 /*
@@ -329,6 +367,18 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
+{
+	return ShmemInitHashInSegment(name, init_size, max_size, infoP, hash_flags,
+							   MAIN_SHMEM_SEGMENT);
+}
+
+HTAB *
+ShmemInitHashInSegment(const char *name,		/* table string name for shmem index */
+			  long init_size,		/* initial table size */
+			  long max_size,		/* max size of the table */
+			  HASHCTL *infoP,		/* info about key and bucket size */
+			  int hash_flags,		/* info about infoP */
+			  int shmem_segment) 	/* in which segment to keep the table */
 {
 	bool		found;
 	void	   *location;
@@ -345,9 +395,9 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 	hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
 
 	/* look it up in the shmem index */
-	location = ShmemInitStruct(name,
+	location = ShmemInitStructInSegment(name,
 							   hash_get_shared_size(infoP, hash_flags),
-							   &found);
+							   &found, shmem_segment);
 
 	/*
 	 * if it already exists, attach to it rather than allocate and initialize
@@ -380,6 +430,13 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
  */
 void *
 ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+{
+	return ShmemInitStructInSegment(name, size, foundPtr, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
+					  int shmem_segment)
 {
 	ShmemIndexEnt *result;
 	void	   *structPtr;
@@ -388,7 +445,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	if (!ShmemIndex)
 	{
-		PGShmemHeader *shmemseghdr = ShmemSegHdr;
+		PGShmemHeader *shmemseghdr = Segments[shmem_segment].ShmemSegHdr;
 
 		/* Must be trying to create/attach to ShmemIndex itself */
 		Assert(strcmp(name, "ShmemIndex") == 0);
@@ -411,7 +468,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 			 * process can be accessing shared memory yet.
 			 */
 			Assert(shmemseghdr->index == NULL);
-			structPtr = ShmemAlloc(size);
+			structPtr = ShmemAllocInSegment(size, shmem_segment);
 			shmemseghdr->index = structPtr;
 			*foundPtr = false;
 		}
@@ -428,8 +485,8 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		LWLockRelease(ShmemIndexLock);
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("could not create ShmemIndex entry for data structure \"%s\"",
-						name)));
+				 errmsg("could not create ShmemIndex entry for data structure \"%s\" in segment %d",
+						name, shmem_segment)));
 	}
 
 	if (*foundPtr)
@@ -454,7 +511,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		Size		allocated_size;
 
 		/* It isn't in the table yet. allocate and initialize it */
-		structPtr = ShmemAllocRaw(size, &allocated_size);
+		structPtr = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
 		if (structPtr == NULL)
 		{
 			/* out of memory; remove the failed ShmemIndex entry */
@@ -473,14 +530,13 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	LWLockRelease(ShmemIndexLock);
 
-	Assert(ShmemAddrIsValid(structPtr));
+	Assert(ShmemAddrIsValidInSegment(structPtr, shmem_segment));
 
 	Assert(structPtr == (void *) CACHELINEALIGN(structPtr));
 
 	return structPtr;
 }
 
-
 /*
  * Add two Size values, checking for overflow
  */
@@ -537,10 +593,11 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 
 	/* output all allocated entries */
 	memset(nulls, 0, sizeof(nulls));
+	/* XXX: take all shared memory segments into account. */
 	while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
 	{
 		values[0] = CStringGetTextDatum(ent->key);
-		values[1] = Int64GetDatum((char *) ent->location - (char *) ShmemSegHdr);
+		values[1] = Int64GetDatum((char *) ent->location - (char *) Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr);
 		values[2] = Int64GetDatum(ent->size);
 		values[3] = Int64GetDatum(ent->allocated_size);
 		named_allocated += ent->allocated_size;
@@ -552,15 +609,15 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 	/* output shared memory allocated but not counted via the shmem index */
 	values[0] = CStringGetTextDatum("<anonymous>");
 	nulls[1] = true;
-	values[2] = Int64GetDatum(ShmemSegHdr->freeoffset - named_allocated);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset - named_allocated);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
 	/* output as-of-yet unused shared memory */
 	nulls[0] = true;
-	values[1] = Int64GetDatum(ShmemSegHdr->freeoffset);
+	values[1] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
 	nulls[1] = false;
-	values[2] = Int64GetDatum(ShmemSegHdr->totalsize - ShmemSegHdr->freeoffset);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->totalsize - Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 3df29658f18..8241c061507 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -80,6 +80,8 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
+#include "postmaster/postmaster.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/proclist.h"
 #include "storage/procnumber.h"
@@ -618,10 +620,15 @@ LWLockNewTrancheId(void)
 	int		   *LWLockCounter;
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - sizeof(int));
-	/* We use the ShmemLock spinlock to protect LWLockCounter */
-	SpinLockAcquire(ShmemLock);
+	/*
+	 * We use the ShmemLock spinlock to protect LWLockCounter.
+	 *
+	 * XXX: Looks like this is the only use of Segments outside of shmem.c,
+	 * it's maybe worth it to reshape this part to hide Segments structure.
+	 */
+	SpinLockAcquire(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 	result = (*LWLockCounter)++;
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 
 	return result;
 }
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 3baf418b3d1..6ebda479ced 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -77,7 +77,7 @@ extern void check_on_shmem_exit_lists_are_empty(void);
 /* ipci.c */
 extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;
 
-extern Size CalculateShmemSize(int *num_semaphores);
+extern Size CalculateShmemSize(int *num_semaphores, int shmem_segment);
 extern void CreateSharedMemoryAndSemaphores(void);
 #ifdef EXEC_BACKEND
 extern void AttachSharedMemoryStructs(void);
diff --git a/src/include/storage/pg_sema.h b/src/include/storage/pg_sema.h
index fa6ca35a51f..8ae9637fcd0 100644
--- a/src/include/storage/pg_sema.h
+++ b/src/include/storage/pg_sema.h
@@ -41,7 +41,7 @@ typedef HANDLE PGSemaphore;
 extern Size PGSemaphoreShmemSize(int maxSemas);
 
 /* Module initialization (called during postmaster start or shmem reinit) */
-extern void PGReserveSemaphores(int maxSemas);
+extern void PGReserveSemaphores(int maxSemas, int shmem_segment);
 
 /* Allocate a PGSemaphore structure with initial count 1 */
 extern PGSemaphore PGSemaphoreCreate(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..138078c29c5 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -25,6 +25,7 @@
 #define PG_SHMEM_H
 
 #include "storage/dsm_impl.h"
+#include "storage/spin.h"
 
 typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 {
@@ -41,6 +42,20 @@ typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 #endif
 } PGShmemHeader;
 
+typedef struct ShmemSegment
+{
+	PGShmemHeader *ShmemSegHdr; 	/* shared mem segment header */
+	void *ShmemBase; 				/* start address of shared memory */
+	void *ShmemEnd; 				/* end+1 address of shared memory */
+	slock_t    *ShmemLock; 			/* spinlock for shared memory and LWLock
+									 * allocation */
+} ShmemSegment;
+
+/* Number of available segments for anonymous memory mappings */
+#define ANON_MAPPINGS 1
+
+extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -90,4 +105,7 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+/* The main segment, contains everything except buffer blocks and related data. */
+#define MAIN_SHMEM_SEGMENT 0
+
 #endif							/* PG_SHMEM_H */
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index 904a336b851..5929f140236 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -29,15 +29,27 @@
 extern PGDLLIMPORT slock_t *ShmemLock;
 struct PGShmemHeader;			/* avoid including storage/pg_shmem.h here */
 extern void InitShmemAccess(struct PGShmemHeader *seghdr);
+extern void InitShmemAccessInSegment(struct PGShmemHeader *seghdr,
+									 int shmem_segment);
 extern void InitShmemAllocation(void);
+extern void InitShmemAllocationInSegment(int shmem_segment);
 extern void *ShmemAlloc(Size size);
+extern void *ShmemAllocInSegment(Size size, int shmem_segment);
 extern void *ShmemAllocNoError(Size size);
 extern void *ShmemAllocUnlocked(Size size);
+extern void *ShmemAllocUnlockedInSegment(Size size, int shmem_segment);
 extern bool ShmemAddrIsValid(const void *addr);
+extern bool ShmemAddrIsValidInSegment(const void *addr, int shmem_segment);
 extern void InitShmemIndex(void);
+extern void InitVariableShmemIndex(void);
 extern HTAB *ShmemInitHash(const char *name, long init_size, long max_size,
 						   HASHCTL *infoP, int hash_flags);
+extern HTAB *ShmemInitHashInSegment(const char *name, long init_size,
+									long max_size, HASHCTL *infoP,
+									int hash_flags, int shmem_segment);
 extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
+extern void *ShmemInitStructInSegment(const char *name, Size size,
+									  bool *foundPtr, int shmem_segment);
 extern Size add_size(Size s1, Size s2);
 extern Size mul_size(Size s1, Size s2);
 

base-commit: 5e1915439085014140314979c4dd5e23bd677cac
-- 
2.45.1

v4-0002-Address-space-reservation-for-shared-memory.patchtext/plain; charset=us-asciiDownload

From eae77d430e6e6cc3ec95b2cf613e4b3ae095e75e Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Wed, 16 Oct 2024 20:21:33 +0200
Subject: [PATCH v4 2/8] Address space reservation for shared memory

Currently the kernel is responsible to chose an address, where to place each
shared memory mapping, which is the lowest possible address that do not clash
with any other mappings. This is considered to be the most portable approach,
but one of the downsides is that there is no place to resize allocated mappings
anymore. Here is how it looks like for one mapping in /proc/$PID/maps,
/dev/zero represents the anonymous shared memory we talk about:

    00400000-00490000         /path/bin/postgres
    ...
    012d9000-0133e000         [heap]
    7f443a800000-7f470a800000 /dev/zero (deleted)
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
    ...
    7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842
    7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted)

By specifying the mapping address directly it's possible to place the
mapping in a way that leaves room for resizing. The idea is:

* To reserve some address space via mmap'ing a large chunk of memory
  with PROT_NONE and MAP_NORESERVE. This way we prepare a playground for
  preparing shared memory layout without risking anything interfering
  with that.

* To slice the reserved space up into sections, one to use for each
  shared segment.

* Allocate shared memory segments out of corresponding slices and
  leaving unclaimed space in between them. This is implemented via
  mmap'ing memory at a specified address from the reserved space with
  MAP_FIXED.

The result looks like this:

    012d9000-0133e000         [heap]
    7f443a800000-7f444196c000 /dev/zero (deleted)
    7f444196c000-7f470a800000                     # reserved space
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2

Things like address space randomization should not be a problem in this
context, since the randomization is applied to the mmap base, which is
one per process.

This approach also do not impact the actual memory usage as reported by
the kernel. Here is the output of /proc/$PID/status for the master
version with shared_buffers = 128 MB:

    // Peak virtual memory size, which is described as total pages
    // mapped in mm_struct. It corresponds to the mapped reserved space
    // and is the only number that grows with it.
    VmPeak:          2043192 kB
    // Size of memory portions. It contains RssAnon + RssFile + RssShmem
    VmRSS:             22908 kB
    // Size of resident anonymous memory
    RssAnon:             768 kB
    // Size of resident file mappings
    RssFile:           10364 kB
    // Size of resident shmem memory (includes SysV shm, mapping of tmpfs and
    // shared anonymous mappings)
    RssShmem:          11776 kB

Here is the same for the patch when reserving 20GB of space:

    VmPeak:         21250648 kB
    VmRSS:             22948 kB
    RssAnon:             768 kB
    RssFile:           10404 kB
    RssShmem:          11776 kB

Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup
was created with the memory limit 256 MB, then PostgreSQL was launched withing
this cgroup with shared_buffers = 128 MB:

    $ cd /sys/fs/cgroup
    $ mkdir postgres
    $ cd postres
    $ echo 268435456 > memory.max

    $ echo $MASTER_PID_SHELL > cgroup.procs
    # postgres from the master branch has being successfully launched
    #  from that shell
    $ cat memory.current
    17465344 (~16.6 MB)
    # stop postgres

    $ echo $PATCH_PID_SHELL > cgroup.procs
    # postgres from the patch has being successfully launched from that shell
    $ cat memory.current
    17637376 (~16.8 MB)

To control the amount of space reserved a new GUC max_available_memory
is introduced. Ideally it should be based on the maximum available
memory, hense the name.
---
 src/backend/port/sysv_shmem.c       | 284 ++++++++++++++++++++++++----
 src/backend/port/win32_shmem.c      |   2 +-
 src/backend/storage/ipc/ipci.c      |   5 +-
 src/backend/utils/init/globals.c    |   1 +
 src/backend/utils/misc/guc_tables.c |  14 ++
 src/include/storage/pg_shmem.h      |   4 +-
 6 files changed, 271 insertions(+), 39 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 56af0231d24..a0f03ff868f 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -108,6 +108,66 @@ static AnonymousMapping Mappings[ANON_MAPPINGS];
 /* Keeps track of used mapping segments */
 static int next_free_segment = 0;
 
+/*
+ * Anonymous mapping placing (/dev/zero (deleted) below) looks like this:
+ *
+ * 00400000-00490000         /path/bin/postgres
+ * ...
+ * 012d9000-0133e000         [heap]
+ * 7f443a800000-7f470a800000 /dev/zero (deleted)
+ * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
+ * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
+ * ...
+ * 7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842
+ * 7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted)
+ * ...
+ *
+ * We would like to place multiple mappings in such a way, that there will be
+ * enough space between them in the address space to be able to resize up to
+ * certain size, but without counting towards the total memory consumption.
+ *
+ * To achieve that we first reserve some shared memory address space by
+ * mmap'ing a segment of MaxAvailableMemory size with PROT_NONE and
+ * MAP_NORESERVE (these flags allow to make sure this space will not be used by
+ * anything else, yet do not count against memory limits). Having the reserved
+ * space, we allocate out of it actual chunks of shared memory as usual,
+ * updating a pointer to the current available reserved space for the next
+ * allocation with the gap between segments in mind.
+ *
+ * The result would look like this:
+ *
+ * 012d9000-0133e000         [heap]
+ * 7f4426f54000-7f442e010000 /dev/zero (deleted)
+ * 7f442e010000-7f443a800000                     # reserved empty space
+ * 7f443a800000-7f444196c000 /dev/zero (deleted)
+ * 7f444196c000-7f470a800000                     # reserved empty space
+ * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
+ * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
+ * [...]
+ *
+ * The reserved space pointer is calculated to slice up the total reserved
+ * space into fixed fractions of address space for each segment, as specified
+ * in the SHMEM_RESIZE_RATIO array.
+ */
+static double SHMEM_RESIZE_RATIO[1] = {
+	1.0, 									/* MAIN_SHMEM_SLOT */
+};
+
+/*
+ * Offset from the beginning of the reserved space, which indicates currently
+ * available range. New shared memory segments have to be allocated at this
+ * offset related to the reserved space.
+ */
+static Size reserved_offset = 0;
+
+/*
+ * Flag telling that we have decided to use huge pages.
+ *
+ * XXX: It's possible to use GetConfigOption("huge_pages_status", false, false)
+ * instead, but it feels like an overkill.
+ */
+static bool huge_pages_on = false;
+
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
@@ -626,39 +686,198 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
  *
  * This function will modify mapping size to the actual size of the allocation,
  * if it ends up allocating a segment that is larger than requested.
+ *
+ * Note that we do not switch from huge pages to regular pages in this
+ * function, this decision was already made in ReserveAnonymousMemory and we
+ * stick to it.
  */
 static void
-CreateAnonymousSegment(AnonymousMapping *mapping)
+CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 {
 	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
+	int			mmap_flags = PG_MMAP_FLAGS;
 
 #ifndef MAP_HUGETLB
-	/* PGSharedMemoryCreate should have dealt with this case */
-	Assert(huge_pages != HUGE_PAGES_ON);
+	/* ReserveAnonymousMemory should have dealt with this case */
+	Assert(huge_pages != HUGE_PAGES_ON && !huge_pages_on);
 #else
-	if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
+	if (huge_pages_on)
 	{
-		/*
-		 * Round up the request size to a suitable large value.
-		 */
 		Size		hugepagesize;
-		int			mmap_flags;
 
+		/* Make sure nothing is messed up */
+		Assert(huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY);
+
+		/* Round up the request size to a suitable large value */
 		GetHugePageSize(&hugepagesize, &mmap_flags);
 
 		if (allocsize % hugepagesize != 0)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
+		mmap_flags = PG_MMAP_FLAGS | mmap_flags;
+	}
+#endif
+
+	elog(DEBUG1, "segment[%s]: mmap(%zu) at address %p",
+		 MappingName(mapping->shmem_segment), allocsize, base + reserved_offset);
+
+	/*
+	 * Try to create mapping at an address out of the reserved range, which
+	 * will allow to extend it later. Use reserved_offset to allocate the
+	 * segment, then update currently available reserved range.
+	 *
+	 * If the last step has failed, fallback to the regular mapping
+	 * creation and signal that shared buffers could not be resized without
+	 * a restart.
+	 */
+	ptr = mmap(base + reserved_offset, allocsize, PROT_READ | PROT_WRITE,
+			   mmap_flags | MAP_FIXED, -1, 0);
+	mmap_errno = errno;
+
+	if (ptr == MAP_FAILED)
+	{
+		DebugMappings();
+		elog(DEBUG1, "segment[%s]: mmap(%zu) at address %p failed: %m, "
+					 "fallback to the non-resizable allocation",
+			 MappingName(mapping->shmem_segment), allocsize, base + reserved_offset);
+
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+						   PG_MMAP_FLAGS, -1, 0);
+		mmap_errno = errno;
+	}
+	else
+	{
+		Size total_reserved = (Size) MaxAvailableMemory * BLCKSZ;
+
+		reserved_offset += total_reserved * SHMEM_RESIZE_RATIO[next_free_segment];
+	}
+
+	if (ptr == MAP_FAILED)
+	{
+		errno = mmap_errno;
+		DebugMappings();
+		ereport(FATAL,
+				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
+						MappingName(mapping->shmem_segment)),
+				 (mmap_errno == ENOMEM) ?
+				 errhint("This error usually means that PostgreSQL's request "
+						 "for a shared memory segment exceeded available memory, "
+						 "swap space, or huge pages. To reduce the request size "
+						 "(currently %zu bytes), reduce PostgreSQL's shared "
+						 "memory usage, perhaps by reducing \"shared_buffers\" or "
+						 "\"max_connections\".",
+						 allocsize) : 0));
+	}
+
+	mapping->shmem = ptr;
+	mapping->shmem_size = allocsize;
+}
+
+/*
+ * ReserveAnonymousMemory
+ *
+ * Reserve shared memory address space, from which shared memory segments are
+ * going to be sliced out. The goal of this exercise is to support segments
+ * resizing, for which we need a reserved space free of potential clashes with
+ * other mmap'd areas that are not under our control. Reservation is done via
+ * mmap, and will not allocate any memory until it will be actually used, and
+ * MAP_NORESERVE allows to make it not counting againt kernel reservation
+ * limits (e.g. in cgroups or for huge pages). Do not get confused because of
+ * MAP_NORESERVE -- we need to reserve some space, but not the actual memory,
+ * and that is that this flag is about.
+ *
+ * Note, that with MAP_NORESERVE a reservation with hugetlb will succeed even
+ * if there is actually not enough huge pages. Hence this function is
+ * responsible for deciding whether to use huge pages or not. To achieve that
+ * we need to probe first and try to allocate needed memory for all segments --
+ * if this succeeds, we unmap the probe segment and use hugetlb; if it fails,
+ * we proceed with the regular memory.
+ */
+void *
+ReserveAnonymousMemory(Size reserve_size)
+{
+	Size		allocsize = reserve_size;
+	void	   *ptr = MAP_FAILED;
+	int			mmap_errno = 0;
+
+	/* Complain if hugepages demanded but we can't possibly support them */
+#if !defined(MAP_HUGETLB)
+	if (huge_pages == HUGE_PAGES_ON)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("huge pages not supported on this platform")));
+#else
+	if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
+	{
+		Size		hugepagesize, total_size = 0;
+		int			mmap_flags;
+
+		GetHugePageSize(&hugepagesize, &mmap_flags);
+
+		/*
+		 * Figure out how much memory is needed for all segments, keeping in
+		 * mind that for every segment this value will be rounding up by the
+		 * huge page size. The resulting value will be used to probe memory and
+		 * decide whether we will allocate huge pages or not.
+		 *
+		 * We could actually have a mix and match of segments with and without
+		 * huge pages. But in that case we need to have multiple reservation
+		 * spaces to use corresponding memory (hugetlb adress space reserved
+		 * for hugetlb segments, regular memory for others), and it doesn't
+		 * seem to worth the complexity for now.
+		 */
+		for(int segment = 0; segment < ANON_MAPPINGS; segment++)
+		{
+			int	numSemas;
+			Size segment_size = CalculateShmemSize(&numSemas, segment);
+
+			if (segment_size % hugepagesize != 0)
+				segment_size += hugepagesize - (segment_size % hugepagesize);
+
+			total_size += segment_size;
+		}
+
+		/* Map total amount of memory to test its availability. */
+		elog(DEBUG1, "reserving space: probe mmap(%zu) with MAP_HUGETLB",
+					 total_size);
+		ptr = mmap(NULL, total_size, PROT_NONE,
+				   PG_MMAP_FLAGS | MAP_ANONYMOUS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 		{
-			DebugMappings();
-			elog(DEBUG1, "segment[%s]: mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-				 MappingName(mapping->shmem_segment), allocsize);
+			/* No huge pages, we will go with the regular page size */
+			elog(DEBUG1, "reserving space: probe mmap(%zu) with MAP_HUGETLB "
+						 "failed, huge pages disabled: %m", total_size);
+		}
+		else
+		{
+			/*
+			 * All fine, unmap the temporary segment and proceed with reserving
+			 * using huge pages.
+			 */
+			if (munmap(ptr, total_size) < 0)
+				elog(LOG, "reservice space: munmap(%p, %zu) failed: %m",
+					 ptr, total_size);
+
+			/* Round up the requested size to a suitable large value. */
+			if (allocsize % hugepagesize != 0)
+				allocsize += hugepagesize - (allocsize % hugepagesize);
+
+			elog(DEBUG1, "reserving space: mmap(%zu) with MAP_HUGETLB",
+						 allocsize);
+			ptr = mmap(NULL, allocsize, PROT_NONE,
+					   PG_MMAP_FLAGS | MAP_ANONYMOUS | MAP_NORESERVE | mmap_flags,
+					   -1, 0);
+			mmap_errno = errno;
+
+			/* This should not happen, but handle errors anyway */
+			if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
+			{
+				elog(DEBUG1, "reserving space: mmap(%zu) with MAP_HUGETLB "
+							 "failed, huge pages disabled: %m", allocsize);
+			}
 		}
 	}
 #endif
@@ -666,10 +885,12 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 	/*
 	 * Report whether huge pages are in use.  This needs to be tracked before
 	 * the second mmap() call if attempting to use huge pages failed
-	 * previously.
+	 * previously. At this point ptr is either pointing to the probe segment,
+	 * if we couldn't mmap it, or the reservation space.
 	 */
 	SetConfigOption("huge_pages_status", (ptr == MAP_FAILED) ? "off" : "on",
 					PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
+	huge_pages_on = ptr != MAP_FAILED;
 
 	if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
 	{
@@ -677,10 +898,11 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
-		allocsize = mapping->shmem_size;
-		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS, -1, 0);
-		mmap_errno = errno;
+		allocsize = reserve_size;
+
+		elog(DEBUG1, "reserving space: mmap(%zu)", allocsize);
+		ptr = mmap(NULL, allocsize, PROT_NONE,
+				   MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE, -1, 0);
 	}
 
 	if (ptr == MAP_FAILED)
@@ -688,20 +910,18 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		errno = mmap_errno;
 		DebugMappings();
 		ereport(FATAL,
-				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
-						MappingName(mapping->shmem_segment)),
+				(errmsg("reserving space: could not map anonymous shared "
+						"memory: %m"),
 				 (mmap_errno == ENOMEM) ?
 				 errhint("This error usually means that PostgreSQL's request "
-						 "for a shared memory segment exceeded available memory, "
-						 "swap space, or huge pages. To reduce the request size "
-						 "(currently %zu bytes), reduce PostgreSQL's shared "
-						 "memory usage, perhaps by reducing \"shared_buffers\" or "
-						 "\"max_connections\".",
+						 "for a reserved shared memory address space exceeded "
+						 "available memory, swap space, or huge pages. To "
+						 "reduce the request reservation size (currently %zu "
+						 "bytes), reduce PostgreSQL's \"maximum_shared_buffers\".",
 						 allocsize) : 0));
 	}
 
-	mapping->shmem = ptr;
-	mapping->shmem_size = allocsize;
+	return ptr;
 }
 
 /*
@@ -740,7 +960,7 @@ AnonymousShmemDetach(int status, Datum arg)
  */
 PGShmemHeader *
 PGSharedMemoryCreate(Size size,
-					 PGShmemHeader **shim)
+					 PGShmemHeader **shim, Pointer base)
 {
 	IpcMemoryKey NextShmemSegID;
 	void	   *memAddress;
@@ -760,14 +980,6 @@ PGSharedMemoryCreate(Size size,
 				 errmsg("could not stat data directory \"%s\": %m",
 						DataDir)));
 
-	/* Complain if hugepages demanded but we can't possibly support them */
-#if !defined(MAP_HUGETLB)
-	if (huge_pages == HUGE_PAGES_ON)
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("huge pages not supported on this platform")));
-#endif
-
 	/* For now, we don't support huge pages in SysV memory */
 	if (huge_pages == HUGE_PAGES_ON && shared_memory_type != SHMEM_TYPE_MMAP)
 		ereport(ERROR,
@@ -782,7 +994,7 @@ PGSharedMemoryCreate(Size size,
 	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
 		/* On success, mapping data will be modified. */
-		CreateAnonymousSegment(mapping);
+		CreateAnonymousSegment(mapping, base);
 
 		next_free_segment++;
 
diff --git a/src/backend/port/win32_shmem.c b/src/backend/port/win32_shmem.c
index 4dee856d6bd..ce719f1b412 100644
--- a/src/backend/port/win32_shmem.c
+++ b/src/backend/port/win32_shmem.c
@@ -205,7 +205,7 @@ EnableLockPagesPrivilege(int elevel)
  */
 PGShmemHeader *
 PGSharedMemoryCreate(Size size,
-					 PGShmemHeader **shim)
+					 PGShmemHeader **shim, Pointer base)
 {
 	void	   *memAddress;
 	PGShmemHeader *hdr;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 8b38e985327..076888c0172 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -203,9 +203,12 @@ CreateSharedMemoryAndSemaphores(void)
 	PGShmemHeader *seghdr;
 	Size		size;
 	int			numSemas;
+	void 		*base;
 
 	Assert(!IsUnderPostmaster);
 
+	base = ReserveAnonymousMemory((Size) MaxAvailableMemory * BLCKSZ);
+
 	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
 	{
 		/* Compute the size of the shared-memory block */
@@ -217,7 +220,7 @@ CreateSharedMemoryAndSemaphores(void)
 		 *
 		 * XXX: Do multiple shims are needed, one per segment?
 		 */
-		seghdr = PGSharedMemoryCreate(size, &shim);
+		seghdr = PGSharedMemoryCreate(size, &shim, base);
 
 		/*
 		 * Make sure that huge pages are never reported as "unknown" while the
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 2152aad97d9..1d42a5856c0 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -140,6 +140,7 @@ int			max_parallel_maintenance_workers = 2;
  * register background workers.
  */
 int			NBuffers = 16384;
+int			MaxAvailableMemory = 131072;
 int			MaxConnections = 100;
 int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..dede37f7905 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2364,6 +2364,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"max_available_memory", PGC_SIGHUP, RESOURCES_MEM,
+			gettext_noop("Sets the upper limit for the shared_buffers value."),
+			gettext_noop("Shared memory could be resized at runtime, this "
+						 "parameters sets the upper limit for it, beyond which "
+						 "resizing would not be supported. Normally this value "
+						 "would be the same as the total available memory."),
+			GUC_UNIT_BLOCKS
+		},
+		&MaxAvailableMemory,
+		131072, 16, INT_MAX / 2,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_buffer_usage_limit", PGC_USERSET, RESOURCES_MEM,
 			gettext_noop("Sets the buffer pool size for VACUUM, ANALYZE, and autovacuum."),
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 138078c29c5..4a83e255652 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -60,6 +60,7 @@ extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
 extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int MaxAvailableMemory;
 
 /* Possible values for huge_pages and huge_pages_status */
 typedef enum
@@ -100,10 +101,11 @@ extern void PGSharedMemoryNoReAttach(void);
 #endif
 
 extern PGShmemHeader *PGSharedMemoryCreate(Size size,
-										   PGShmemHeader **shim);
+										   PGShmemHeader **shim, Pointer base);
 extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
+void *ReserveAnonymousMemory(Size reserve_size);
 
 /* The main segment, contains everything except buffer blocks and related data. */
 #define MAIN_SHMEM_SEGMENT 0
-- 
2.45.1

v4-0003-Introduce-multiple-shmem-segments-for-shared-buff.patchtext/plain; charset=us-asciiDownload

From dca1257476fc4c0718fec35b11ba0f4a4e57151b Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sat, 15 Mar 2025 16:38:59 +0100
Subject: [PATCH v4 3/8] Introduce multiple shmem segments for shared buffers

Add more shmem segments to split shared buffers into following chunks:
* BUFFERS_SHMEM_SEGMENT: contains buffer blocks
* BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
* BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
* CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
* STRATEGY_SHMEM_SEGMENT: contains buffer strategy status

Size of the corresponding shared data directly depends on NBuffers,
meaning that if we would like to change NBuffers, they have to be
resized correspondingly. Placing each of them in a separate shmem
segment allows to achieve that.

There are some asumptions made about each of shmem segments upper size
limit. The buffer blocks have the largest, while the rest claim less
extra room for resize. Ideally those limits have to be deduced from the
maximum allowed shared memory.
---
 src/backend/port/sysv_shmem.c          | 24 +++++++-
 src/backend/storage/buffer/buf_init.c  | 79 +++++++++++++++++---------
 src/backend/storage/buffer/buf_table.c |  6 +-
 src/backend/storage/buffer/freelist.c  |  5 +-
 src/backend/storage/ipc/ipci.c         |  2 +-
 src/include/storage/bufmgr.h           |  2 +-
 src/include/storage/pg_shmem.h         | 24 +++++++-
 7 files changed, 105 insertions(+), 37 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index a0f03ff868f..f46d9d5d9cd 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -147,10 +147,18 @@ static int next_free_segment = 0;
  *
  * The reserved space pointer is calculated to slice up the total reserved
  * space into fixed fractions of address space for each segment, as specified
- * in the SHMEM_RESIZE_RATIO array.
+ * in the SHMEM_RESIZE_RATIO array. E.g. we allow BUFFERS_SHMEM_SEGMENT to take
+ * up to 60% of the whole space when resizing, based on the fact that it most
+ * likely will be the main consumer of this memory. Those numbers are pulled
+ * out of thin air for now, makes sense to evaluate them more precise.
  */
-static double SHMEM_RESIZE_RATIO[1] = {
-	1.0, 									/* MAIN_SHMEM_SLOT */
+static double SHMEM_RESIZE_RATIO[6] = {
+	0.1,    /* MAIN_SHMEM_SEGMENT */
+	0.6,    /* BUFFERS_SHMEM_SEGMENT */
+	0.1,    /* BUFFER_DESCRIPTORS_SHMEM_SEGMENT */
+	0.1,    /* BUFFER_IOCV_SHMEM_SEGMENT */
+	0.05,   /* CHECKPOINT_BUFFERS_SHMEM_SEGMENT */
+	0.05,   /* STRATEGY_SHMEM_SEGMENT */
 };
 
 /*
@@ -182,6 +190,16 @@ MappingName(int shmem_segment)
 	{
 		case MAIN_SHMEM_SEGMENT:
 			return "main";
+		case BUFFERS_SHMEM_SEGMENT:
+			return "buffers";
+		case BUFFER_DESCRIPTORS_SHMEM_SEGMENT:
+			return "descriptors";
+		case BUFFER_IOCV_SHMEM_SEGMENT:
+			return "iocv";
+		case CHECKPOINT_BUFFERS_SHMEM_SEGMENT:
+			return "checkpoint";
+		case STRATEGY_SHMEM_SEGMENT:
+			return "strategy";
 		default:
 			return "unknown";
 	}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..bd68b69ee98 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -62,7 +62,10 @@ CkptSortItem *CkptBufferIds;
  * Initialize shared buffer pool
  *
  * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend).
+ * postmaster, or in a standalone backend). Size of data structures initialized
+ * here depends on NBuffers, and to be able to change NBuffers without a
+ * restart we store each structure into a separate shared memory segment, which
+ * could be resized on demand.
  */
 void
 BufferManagerShmemInit(void)
@@ -74,22 +77,22 @@ BufferManagerShmemInit(void)
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
+		ShmemInitStructInSegment("Buffer Descriptors",
 						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+						&foundDescs, BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
 		TYPEALIGN(PG_IO_ALIGN_SIZE,
-				  ShmemInitStruct("Buffer Blocks",
+				  ShmemInitStructInSegment("Buffer Blocks",
 								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
-								  &foundBufs));
+								  &foundBufs, BUFFERS_SHMEM_SEGMENT));
 
 	/* Align condition variables to cacheline boundary. */
 	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
-		ShmemInitStruct("Buffer IO Condition Variables",
+		ShmemInitStructInSegment("Buffer IO Condition Variables",
 						NBuffers * sizeof(ConditionVariableMinimallyPadded),
-						&foundIOCV);
+						&foundIOCV, BUFFER_IOCV_SHMEM_SEGMENT);
 
 	/*
 	 * The array used to sort to-be-checkpointed buffer ids is located in
@@ -99,8 +102,9 @@ BufferManagerShmemInit(void)
 	 * painful.
 	 */
 	CkptBufferIds = (CkptSortItem *)
-		ShmemInitStruct("Checkpoint BufferIds",
-						NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
+		ShmemInitStructInSegment("Checkpoint BufferIds",
+						NBuffers * sizeof(CkptSortItem), &foundBufCkpt,
+						CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
 
 	if (foundDescs || foundBufs || foundIOCV || foundBufCkpt)
 	{
@@ -156,33 +160,54 @@ BufferManagerShmemInit(void)
  * BufferManagerShmemSize
  *
  * compute the size of shared memory for the buffer pool including
- * data pages, buffer descriptors, hash tables, etc.
+ * data pages, buffer descriptors, hash tables, etc. based on the
+ * shared memory segment. The main segment must not allocate anything
+ * related to buffers, every other segment will receive part of the
+ * data.
  */
 Size
-BufferManagerShmemSize(void)
+BufferManagerShmemSize(int shmem_segment)
 {
 	Size		size = 0;
 
-	/* size of buffer descriptors */
-	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
-	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_segment == MAIN_SHMEM_SEGMENT)
+		return size;
 
-	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
-	size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	if (shmem_segment == BUFFER_DESCRIPTORS_SHMEM_SEGMENT)
+	{
+		/* size of buffer descriptors */
+		size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
+		/* to allow aligning buffer descriptors */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
 
-	/* size of stuff controlled by freelist.c */
-	size = add_size(size, StrategyShmemSize());
+	if (shmem_segment == BUFFERS_SHMEM_SEGMENT)
+	{
+		/* size of data pages, plus alignment padding */
+		size = add_size(size, PG_IO_ALIGN_SIZE);
+		size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	}
 
-	/* size of I/O condition variables */
-	size = add_size(size, mul_size(NBuffers,
-								   sizeof(ConditionVariableMinimallyPadded)));
-	/* to allow aligning the above */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_segment == STRATEGY_SHMEM_SEGMENT)
+	{
+		/* size of stuff controlled by freelist.c */
+		size = add_size(size, StrategyShmemSize());
+	}
 
-	/* size of checkpoint sort array in bufmgr.c */
-	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	if (shmem_segment == BUFFER_IOCV_SHMEM_SEGMENT)
+	{
+		/* size of I/O condition variables */
+		size = add_size(size, mul_size(NBuffers,
+									   sizeof(ConditionVariableMinimallyPadded)));
+		/* to allow aligning the above */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
+
+	if (shmem_segment == CHECKPOINT_BUFFERS_SHMEM_SEGMENT)
+	{
+		/* size of checkpoint sort array in bufmgr.c */
+		size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	}
 
 	return size;
 }
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index a50955d5286..a9952b36eba 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -22,6 +22,7 @@
 #include "postgres.h"
 
 #include "storage/buf_internals.h"
+#include "storage/pg_shmem.h"
 
 /* entry for buffer lookup hashtable */
 typedef struct
@@ -59,10 +60,11 @@ InitBufTable(int size)
 	info.entrysize = sizeof(BufferLookupEnt);
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
-	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
+	SharedBufHash = ShmemInitHashInSegment("Shared Buffer Lookup Table",
 								  size, size,
 								  &info,
-								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION,
+								  STRATEGY_SHMEM_SEGMENT);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 336715b6c63..81543cb5ced 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
@@ -491,9 +492,9 @@ StrategyInitialize(bool init)
 	 * Get or create the shared strategy control block
 	 */
 	StrategyControl = (BufferStrategyControl *)
-		ShmemInitStruct("Buffer Strategy Status",
+		ShmemInitStructInSegment("Buffer Strategy Status",
 						sizeof(BufferStrategyControl),
-						&found);
+						&found, STRATEGY_SHMEM_SEGMENT);
 
 	if (!found)
 	{
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 076888c0172..9d00b80b4f8 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -113,7 +113,7 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
 											 sizeof(ShmemIndexEnt)));
 	size = add_size(size, dsm_estimate_size());
 	size = add_size(size, DSMRegistryShmemSize());
-	size = add_size(size, BufferManagerShmemSize());
+	size = add_size(size, BufferManagerShmemSize(shmem_segment));
 	size = add_size(size, LockManagerShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f2192ceb271..1977001e533 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -308,7 +308,7 @@ extern bool EvictUnpinnedBuffer(Buffer buf);
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
-extern Size BufferManagerShmemSize(void);
+extern Size BufferManagerShmemSize(int);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 4a83e255652..c5009a1cd73 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -52,7 +52,7 @@ typedef struct ShmemSegment
 } ShmemSegment;
 
 /* Number of available segments for anonymous memory mappings */
-#define ANON_MAPPINGS 1
+#define ANON_MAPPINGS 6
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 
@@ -107,7 +107,29 @@ extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 void *ReserveAnonymousMemory(Size reserve_size);
 
+/*
+ * To be able to dynamically resize largest parts of the data stored in shared
+ * memory, we split it into multiple shared memory mappings segments. Each
+ * segment contains only certain part of the data, which size depends on
+ * NBuffers.
+ */
+
 /* The main segment, contains everything except buffer blocks and related data. */
 #define MAIN_SHMEM_SEGMENT 0
 
+/* Buffer blocks */
+#define BUFFERS_SHMEM_SEGMENT 1
+
+/* Buffer descriptors */
+#define BUFFER_DESCRIPTORS_SHMEM_SEGMENT 2
+
+/* Condition variables for buffers */
+#define BUFFER_IOCV_SHMEM_SEGMENT 3
+
+/* Checkpoint BufferIds */
+#define CHECKPOINT_BUFFERS_SHMEM_SEGMENT 4
+
+/* Buffer strategy status */
+#define STRATEGY_SHMEM_SEGMENT 5
+
 #endif							/* PG_SHMEM_H */
-- 
2.45.1

v4-0004-Introduce-pending-flag-for-GUC-assign-hooks.patchtext/plain; charset=us-asciiDownload

From 24704e57aea0ee94fbfb37ca3a4ea4fcf050a738 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sun, 6 Apr 2025 16:40:32 +0200
Subject: [PATCH v4 4/8] Introduce pending flag for GUC assign hooks

Currently an assing hook can perform some preprocessing of a new value,
but it cannot change the behavior, which dictates that the new value
will be applied immediately after the hook. Certain GUC options (like
shared_buffers, coming in subsequent patches) may need coordinating work
between backends to change, meaning we cannot apply it right away.

Add a new flag "pending" for an assign hook to allow the hook indicate
exactly that. If the pending flag is set after the hook, the new value
will not be applied and it's handling becomes the hook's implementation
responsibility.

Note, that this also requires changes in the way how GUCs are getting
reported, but the patch does not cover that yet.
---
 src/backend/access/transam/xlog.c    |  2 +-
 src/backend/commands/variable.c      |  6 +--
 src/backend/libpq/pqcomm.c           |  8 ++--
 src/backend/tcop/postgres.c          |  2 +-
 src/backend/utils/misc/guc.c         | 59 +++++++++++++++++++---------
 src/backend/utils/misc/stack_depth.c |  2 +-
 src/include/utils/guc.h              |  2 +-
 src/include/utils/guc_hooks.h        | 20 +++++-----
 8 files changed, 61 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ec40c0b7c42..9aa426992a2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2321,7 +2321,7 @@ CalculateCheckpointSegments(void)
 }
 
 void
-assign_max_wal_size(int newval, void *extra)
+assign_max_wal_size(int newval, void *extra, bool *pending)
 {
 	max_wal_size_mb = newval;
 	CalculateCheckpointSegments();
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index a9f2a3a3062..e715a6f01c2 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1143,7 +1143,7 @@ check_cluster_name(char **newval, void **extra, GucSource source)
  * GUC assign_hook for maintenance_io_concurrency
  */
 void
-assign_maintenance_io_concurrency(int newval, void *extra)
+assign_maintenance_io_concurrency(int newval, void *extra, bool *pending)
 {
 	/*
 	 * Reconfigure recovery prefetching, because a setting it depends on
@@ -1161,13 +1161,13 @@ assign_maintenance_io_concurrency(int newval, void *extra)
  * they may be assigned in either order.
  */
 void
-assign_io_max_combine_limit(int newval, void *extra)
+assign_io_max_combine_limit(int newval, void *extra, bool *pending)
 {
 	io_max_combine_limit = newval;
 	io_combine_limit = Min(io_max_combine_limit, io_combine_limit_guc);
 }
 void
-assign_io_combine_limit(int newval, void *extra)
+assign_io_combine_limit(int newval, void *extra, bool *pending)
 {
 	io_combine_limit_guc = newval;
 	io_combine_limit = Min(io_max_combine_limit, io_combine_limit_guc);
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index e5171467de1..2a6a587ef76 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -1952,7 +1952,7 @@ pq_settcpusertimeout(int timeout, Port *port)
  * GUC assign_hook for tcp_keepalives_idle
  */
 void
-assign_tcp_keepalives_idle(int newval, void *extra)
+assign_tcp_keepalives_idle(int newval, void *extra, bool *pending)
 {
 	/*
 	 * The kernel API provides no way to test a value without setting it; and
@@ -1985,7 +1985,7 @@ show_tcp_keepalives_idle(void)
  * GUC assign_hook for tcp_keepalives_interval
  */
 void
-assign_tcp_keepalives_interval(int newval, void *extra)
+assign_tcp_keepalives_interval(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_setkeepalivesinterval(newval, MyProcPort);
@@ -2008,7 +2008,7 @@ show_tcp_keepalives_interval(void)
  * GUC assign_hook for tcp_keepalives_count
  */
 void
-assign_tcp_keepalives_count(int newval, void *extra)
+assign_tcp_keepalives_count(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_setkeepalivescount(newval, MyProcPort);
@@ -2031,7 +2031,7 @@ show_tcp_keepalives_count(void)
  * GUC assign_hook for tcp_user_timeout
  */
 void
-assign_tcp_user_timeout(int newval, void *extra)
+assign_tcp_user_timeout(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_settcpusertimeout(newval, MyProcPort);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 6ae9f38f0c8..b1fba850f02 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3593,7 +3593,7 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 
 /* GUC assign hook for transaction_timeout */
 void
-assign_transaction_timeout(int newval, void *extra)
+assign_transaction_timeout(int newval, void *extra, bool *pending)
 {
 	if (IsTransactionState())
 	{
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 667df448732..bb681f5bc60 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1679,6 +1679,7 @@ InitializeOneGUCOption(struct config_generic *gconf)
 				struct config_int *conf = (struct config_int *) gconf;
 				int			newval = conf->boot_val;
 				void	   *extra = NULL;
+				bool 	   pending = false;
 
 				Assert(newval >= conf->min);
 				Assert(newval <= conf->max);
@@ -1687,9 +1688,13 @@ InitializeOneGUCOption(struct config_generic *gconf)
 					elog(FATAL, "failed to initialize %s to %d",
 						 conf->gen.name, newval);
 				if (conf->assign_hook)
-					conf->assign_hook(newval, extra);
-				*conf->variable = conf->reset_val = newval;
-				conf->gen.extra = conf->reset_extra = extra;
+					conf->assign_hook(newval, extra, &pending);
+
+				if (!pending)
+				{
+					*conf->variable = conf->reset_val = newval;
+					conf->gen.extra = conf->reset_extra = extra;
+				}
 				break;
 			}
 		case PGC_REAL:
@@ -2041,13 +2046,18 @@ ResetAllOptions(void)
 			case PGC_INT:
 				{
 					struct config_int *conf = (struct config_int *) gconf;
+					bool 			  pending = false;
 
 					if (conf->assign_hook)
 						conf->assign_hook(conf->reset_val,
-										  conf->reset_extra);
-					*conf->variable = conf->reset_val;
-					set_extra_field(&conf->gen, &conf->gen.extra,
-									conf->reset_extra);
+										  conf->reset_extra,
+										  &pending);
+					if (!pending)
+					{
+						*conf->variable = conf->reset_val;
+						set_extra_field(&conf->gen, &conf->gen.extra,
+										conf->reset_extra);
+					}
 					break;
 				}
 			case PGC_REAL:
@@ -2424,16 +2434,21 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
 							struct config_int *conf = (struct config_int *) gconf;
 							int			newval = newvalue.val.intval;
 							void	   *newextra = newvalue.extra;
+							bool 	    pending = false;
 
 							if (*conf->variable != newval ||
 								conf->gen.extra != newextra)
 							{
 								if (conf->assign_hook)
-									conf->assign_hook(newval, newextra);
-								*conf->variable = newval;
-								set_extra_field(&conf->gen, &conf->gen.extra,
-												newextra);
-								changed = true;
+									conf->assign_hook(newval, newextra, &pending);
+
+								if (!pending)
+								{
+									*conf->variable = newval;
+									set_extra_field(&conf->gen, &conf->gen.extra,
+													newextra);
+									changed = true;
+								}
 							}
 							break;
 						}
@@ -3850,18 +3865,24 @@ set_config_with_handle(const char *name, config_handle *handle,
 
 				if (changeVal)
 				{
+					bool pending = false;
+
 					/* Save old value to support transaction abort */
 					if (!makeDefault)
 						push_old_value(&conf->gen, action);
 
 					if (conf->assign_hook)
-						conf->assign_hook(newval, newextra);
-					*conf->variable = newval;
-					set_extra_field(&conf->gen, &conf->gen.extra,
-									newextra);
-					set_guc_source(&conf->gen, source);
-					conf->gen.scontext = context;
-					conf->gen.srole = srole;
+						conf->assign_hook(newval, newextra, &pending);
+
+					if (!pending)
+					{
+						*conf->variable = newval;
+						set_extra_field(&conf->gen, &conf->gen.extra,
+										newextra);
+						set_guc_source(&conf->gen, source);
+						conf->gen.scontext = context;
+						conf->gen.srole = srole;
+					}
 				}
 				if (makeDefault)
 				{
diff --git a/src/backend/utils/misc/stack_depth.c b/src/backend/utils/misc/stack_depth.c
index 8f7cf531fbc..ef59ae62008 100644
--- a/src/backend/utils/misc/stack_depth.c
+++ b/src/backend/utils/misc/stack_depth.c
@@ -156,7 +156,7 @@ check_max_stack_depth(int *newval, void **extra, GucSource source)
 
 /* GUC assign hook for max_stack_depth */
 void
-assign_max_stack_depth(int newval, void *extra)
+assign_max_stack_depth(int newval, void *extra, bool *pending)
 {
 	ssize_t		newval_bytes = newval * (ssize_t) 1024;
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index f619100467d..8802ad8a3cb 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -187,7 +187,7 @@ typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource sourc
 typedef bool (*GucEnumCheckHook) (int *newval, void **extra, GucSource source);
 
 typedef void (*GucBoolAssignHook) (bool newval, void *extra);
-typedef void (*GucIntAssignHook) (int newval, void *extra);
+typedef void (*GucIntAssignHook) (int newval, void *extra, bool *pending);
 typedef void (*GucRealAssignHook) (double newval, void *extra);
 typedef void (*GucStringAssignHook) (const char *newval, void *extra);
 typedef void (*GucEnumAssignHook) (int newval, void *extra);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 799fa7ace68..c8300cffa8e 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -81,14 +81,14 @@ extern bool check_log_stats(bool *newval, void **extra, GucSource source);
 extern bool check_log_timezone(char **newval, void **extra, GucSource source);
 extern void assign_log_timezone(const char *newval, void *extra);
 extern const char *show_log_timezone(void);
-extern void assign_maintenance_io_concurrency(int newval, void *extra);
-extern void assign_io_max_combine_limit(int newval, void *extra);
-extern void assign_io_combine_limit(int newval, void *extra);
+extern void assign_maintenance_io_concurrency(int newval, void *extra, bool *pending);
+extern void assign_io_max_combine_limit(int newval, void *extra, bool *pending);
+extern void assign_io_combine_limit(int newval, void *extra, bool *pending);
 extern bool check_max_slot_wal_keep_size(int *newval, void **extra,
 										 GucSource source);
-extern void assign_max_wal_size(int newval, void *extra);
+extern void assign_max_wal_size(int newval, void *extra, bool *pending);
 extern bool check_max_stack_depth(int *newval, void **extra, GucSource source);
-extern void assign_max_stack_depth(int newval, void *extra);
+extern void assign_max_stack_depth(int newval, void *extra, bool *pending);
 extern bool check_multixact_member_buffers(int *newval, void **extra,
 										   GucSource source);
 extern bool check_multixact_offset_buffers(int *newval, void **extra,
@@ -143,13 +143,13 @@ extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 extern void assign_syslog_facility(int newval, void *extra);
 extern void assign_syslog_ident(const char *newval, void *extra);
-extern void assign_tcp_keepalives_count(int newval, void *extra);
+extern void assign_tcp_keepalives_count(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_count(void);
-extern void assign_tcp_keepalives_idle(int newval, void *extra);
+extern void assign_tcp_keepalives_idle(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_idle(void);
-extern void assign_tcp_keepalives_interval(int newval, void *extra);
+extern void assign_tcp_keepalives_interval(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_interval(void);
-extern void assign_tcp_user_timeout(int newval, void *extra);
+extern void assign_tcp_user_timeout(int newval, void *extra, bool *pending);
 extern const char *show_tcp_user_timeout(void);
 extern bool check_temp_buffers(int *newval, void **extra, GucSource source);
 extern bool check_temp_tablespaces(char **newval, void **extra,
@@ -165,7 +165,7 @@ extern bool check_transaction_buffers(int *newval, void **extra, GucSource sourc
 extern bool check_transaction_deferrable(bool *newval, void **extra, GucSource source);
 extern bool check_transaction_isolation(int *newval, void **extra, GucSource source);
 extern bool check_transaction_read_only(bool *newval, void **extra, GucSource source);
-extern void assign_transaction_timeout(int newval, void *extra);
+extern void assign_transaction_timeout(int newval, void *extra, bool *pending);
 extern const char *show_unix_socket_permissions(void);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern bool check_wal_consistency_checking(char **newval, void **extra,
-- 
2.45.1

v4-0005-Introduce-pss_barrierReceivedGeneration.patchtext/plain; charset=us-asciiDownload

From 619b10ec409185995a4a3ffd56972f1efa493c45 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 4 Apr 2025 21:46:14 +0200
Subject: [PATCH v4 5/8] Introduce pss_barrierReceivedGeneration

Currently WaitForProcSignalBarrier allows to make sure the message sent
via EmitProcSignalBarrier was processed by all ProcSignal mechanism
participants.

Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
which will be updated when a process has received the message, but not
processed it yet. This makes it possible to support a new mode of
waiting, when ProcSignal participants want to synchronize message
processing. To do that, a participant can wait via
WaitForProcSignalBarrierReceived when processing a message, effectively
making sure that all processes are going to start processing
ProcSignalBarrier simultaneously.
---
 src/backend/storage/ipc/procsignal.c | 67 ++++++++++++++++++++++------
 src/include/storage/procsignal.h     |  1 +
 2 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index b7c39a4c5f0..8e313ad9bf8 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -58,7 +58,10 @@
  * of it. For such use cases, we set a bit in pss_barrierCheckMask and then
  * increment the current "barrier generation"; when the new barrier generation
  * (or greater) appears in the pss_barrierGeneration flag of every process,
- * we know that the message has been received everywhere.
+ * we know that the message has been received and processed everywhere. In case
+ * if we only need to know only that the message was received everywhere (e.g.
+ * receiving processes need to handle the message in a coordinated fashion)
+ * use pss_barrierReceivedGeneration in the same way.
  */
 typedef struct
 {
@@ -70,6 +73,7 @@ typedef struct
 
 	/* Barrier-related fields (not protected by pss_mutex) */
 	pg_atomic_uint64 pss_barrierGeneration;
+	pg_atomic_uint64 pss_barrierReceivedGeneration;
 	pg_atomic_uint32 pss_barrierCheckMask;
 	ConditionVariable pss_barrierCV;
 } ProcSignalSlot;
@@ -151,6 +155,8 @@ ProcSignalShmemInit(void)
 			slot->pss_cancel_key_len = 0;
 			MemSet(slot->pss_signalFlags, 0, sizeof(slot->pss_signalFlags));
 			pg_atomic_init_u64(&slot->pss_barrierGeneration, PG_UINT64_MAX);
+			pg_atomic_init_u64(&slot->pss_barrierReceivedGeneration,
+							   PG_UINT64_MAX);
 			pg_atomic_init_u32(&slot->pss_barrierCheckMask, 0);
 			ConditionVariableInit(&slot->pss_barrierCV);
 		}
@@ -198,6 +204,8 @@ ProcSignalInit(char *cancel_key, int cancel_key_len)
 	barrier_generation =
 		pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration);
 	pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation);
+	pg_atomic_write_u64(&slot->pss_barrierReceivedGeneration,
+						barrier_generation);
 
 	if (cancel_key_len > 0)
 		memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
@@ -262,6 +270,7 @@ CleanupProcSignalState(int status, Datum arg)
 	 * no barrier waits block on it.
 	 */
 	pg_atomic_write_u64(&slot->pss_barrierGeneration, PG_UINT64_MAX);
+	pg_atomic_write_u64(&slot->pss_barrierReceivedGeneration, PG_UINT64_MAX);
 
 	SpinLockRelease(&slot->pss_mutex);
 
@@ -415,12 +424,8 @@ EmitProcSignalBarrier(ProcSignalBarrierType type)
 	return generation;
 }
 
-/*
- * WaitForProcSignalBarrier - wait until it is guaranteed that all changes
- * requested by a specific call to EmitProcSignalBarrier() have taken effect.
- */
-void
-WaitForProcSignalBarrier(uint64 generation)
+static void
+WaitForProcSignalBarrierInternal(uint64 generation, bool receivedOnly)
 {
 	Assert(generation <= pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration));
 
@@ -435,12 +440,17 @@ WaitForProcSignalBarrier(uint64 generation)
 		uint64		oldval;
 
 		/*
-		 * It's important that we check only pss_barrierGeneration here and
-		 * not pss_barrierCheckMask. Bits in pss_barrierCheckMask get cleared
-		 * before the barrier is actually absorbed, but pss_barrierGeneration
+		 * It's important that we check only pss_barrierGeneration &
+		 * pss_barrierGeneration here and not pss_barrierCheckMask. Bits in
+		 * pss_barrierCheckMask get cleared before the barrier is actually
+		 * absorbed, but pss_barrierGeneration & pss_barrierReceivedGeneration
 		 * is updated only afterward.
 		 */
-		oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+		if (receivedOnly)
+			oldval = pg_atomic_read_u64(&slot->pss_barrierReceivedGeneration);
+		else
+			oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+
 		while (oldval < generation)
 		{
 			if (ConditionVariableTimedSleep(&slot->pss_barrierCV,
@@ -449,7 +459,11 @@ WaitForProcSignalBarrier(uint64 generation)
 				ereport(LOG,
 						(errmsg("still waiting for backend with PID %d to accept ProcSignalBarrier",
 								(int) pg_atomic_read_u32(&slot->pss_pid))));
-			oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+
+			if (receivedOnly)
+				oldval = pg_atomic_read_u64(&slot->pss_barrierReceivedGeneration);
+			else
+				oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
 		}
 		ConditionVariableCancelSleep();
 	}
@@ -463,12 +477,33 @@ WaitForProcSignalBarrier(uint64 generation)
 	 * The caller is probably calling this function because it wants to read
 	 * the shared state or perform further writes to shared state once all
 	 * backends are known to have absorbed the barrier. However, the read of
-	 * pss_barrierGeneration was performed unlocked; insert a memory barrier
-	 * to separate it from whatever follows.
+	 * pss_barrierGeneration & pss_barrierReceivedGeneration was performed
+	 * unlocked; insert a memory barrier to separate it from whatever follows.
 	 */
 	pg_memory_barrier();
 }
 
+/*
+ * WaitForProcSignalBarrier - wait until it is guaranteed that all changes
+ * requested by a specific call to EmitProcSignalBarrier() have taken effect.
+ */
+void
+WaitForProcSignalBarrier(uint64 generation)
+{
+	WaitForProcSignalBarrierInternal(generation, false);
+}
+
+/*
+ * WaitForProcSignalBarrierReceived - wait until it is guaranteed that all
+ * backends have observed the message sent by a specific call to
+ * EmitProcSignalBarrier().
+ */
+void
+WaitForProcSignalBarrierReceived(uint64 generation)
+{
+	WaitForProcSignalBarrierInternal(generation, true);
+}
+
 /*
  * Handle receipt of an interrupt indicating a global barrier event.
  *
@@ -522,6 +557,10 @@ ProcessProcSignalBarrier(void)
 	if (local_gen == shared_gen)
 		return;
 
+	/* The message is observed, record that */
+	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
+						shared_gen);
+
 	/*
 	 * Get and clear the flags that are set for this backend. Note that
 	 * pg_atomic_exchange_u32 is a full barrier, so we're guaranteed that the
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 016dfd9b3f6..defd8b66a19 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -79,6 +79,7 @@ extern void SendCancelRequest(int backendPID, char *cancel_key, int cancel_key_l
 
 extern uint64 EmitProcSignalBarrier(ProcSignalBarrierType type);
 extern void WaitForProcSignalBarrier(uint64 generation);
+extern void WaitForProcSignalBarrierReceived(uint64 generation);
 extern void ProcessProcSignalBarrier(void);
 
 extern void procsignal_sigusr1_handler(SIGNAL_ARGS);
-- 
2.45.1

v4-0006-Allow-to-resize-shared-memory-without-restart.patchtext/plain; charset=us-asciiDownload

From 886a3ea87408e628bea08a9c77116343616ad032 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sun, 6 Apr 2025 16:47:16 +0200
Subject: [PATCH v4 6/8] Allow to resize shared memory without restart

Add assing hook for shared_buffers to resize shared memory using space,
introduced in the previous commits without requiring PostgreSQL restart.
Essentially the implementation is based on two mechanisms: a
ProcSignalBarrier is used to make sure all processes are starting the
resize procedure simultaneously, and a global Barrier is used to
coordinate after that and make sure all finished processes are waiting
for others that are in progress.

The resize process looks like this:

* The GUC assign hook sets a flag to let the Postmaster know that resize
  was requested.

* Postmaster verifies the flag in the event loop, and starts the resize
  by emitting a ProcSignal barrier.

* All processes, that participate in ProcSignal mechanism, begin to
  process ProcSignal barrier. First a process waits until all processes
  have confirmed they received the message and can start simultaneously.

* Every process recalculates shared memory size based on the new
  NBuffers and extend it using mremap. One elected process signals the
  postmaster to do the same.

* When finished, every process waits on a global ShmemControl barrier,
  untill all others are finished as well. This way we ensure three
  stages with clear boundaries: before the resize, when all processes
  use old NBuffers; during the resize, when processes have mix of old
  and new NBuffers, and wait until it's done; after the resize, when all
  processes use new NBuffers.

* After all processes are using new value, one of them will initialize
  new shared structures (buffer blocks, descriptors, etc) as needed and
  broadcast new value of NBuffers via ShmemControl in shared memory.
  Other backends are waiting for this operation to finish as well. Then
  the barrier is lifted and everything goes as usual.

Since resizing takes time, we need to take into account that during that time:

- New backends can be spawned. They will check status of the barrier
  early during the bootstrap, and wait until everything is over to work
  with the new NBuffers value.

- Old backends can exit before attempting to resize. Synchronization
  used between backends relies on ProcSignalBarrier and waits for all
  participants received the message at the beginning to gather all
  existing backends.

- Some backends might be blocked and not responsing either before or
  after receiving the message. In the first case such backend still
  have ProcSignalSlot and should be waited for, in the second case
  shared barrier will make sure we still waiting for those backends. In
  any case there is an unbounded wait.

- Backends might join barrier in disjoint groups with some time in
  between. That means that relying only on the shared dynamic barrier is
  not enough -- it will only synchronize resize procedure withing those
  groups. That's why we wait first for all participants of ProcSignal
  mechanism who received the message.

Here is how it looks like after raising shared_buffers from 128 MB to
512 MB and calling pg_reload_conf():

    -- 128 MB
    7f90cde00000-7f90d4fa6000  /dev/zero (deleted)
    7f90d4fa6000-7f914de00000
    7f914de00000-7f915cfa8000  /dev/zero (deleted)
    ^ buffers mapping, ~241 MB
    7f915cfa8000-7f944de00000
    7f944de00000-7f94550a8000  /dev/zero (deleted)
    7f94550a8000-7f94cde00000
    7f94cde00000-7f94d4fe8000  /dev/zero (deleted)
    7f94d4fe8000-7f954de00000
    7f954de00000-7f9554ff6000  /dev/zero (deleted)
    7f9554ff6000-7f958de00000
    7f958de00000-7f959508a000  /dev/zero (deleted)
    7f959508a000-7f95cde00000

    -- 512 MB
    7f90cde00000-7f90d5126000  /dev/zero (deleted)
    7f90d5126000-7f914de00000
    7f914de00000-7f9175128000  /dev/zero (deleted)
    ^ buffers mapping, ~627 MB
    7f9175128000-7f944de00000
    7f944de00000-7f9455528000  /dev/zero (deleted)
    7f9455528000-7f94cde00000
    7f94cde00000-7f94d5228000  /dev/zero (deleted)
    7f94d5228000-7f954de00000
    7f954de00000-7f9555266000  /dev/zero (deleted)
    7f9555266000-7f958de00000
    7f958de00000-7f95954aa000  /dev/zero (deleted)
    7f95954aa000-7f95cde00000

The implementation supports only increasing of shared_buffers. For
decreasing the value a similar procedure is needed. But the buffer
blocks with data have to be drained first, so that the actual data set
fits into the new smaller space.

From experiment it turns out that shared mappings have to be extended
separately for each process that uses them. Another rough edge is that a
backend blocked on ReadCommand will not apply shared_buffers change
until it receives something.

Note, that mremap is Linux specific, thus the implementation not very
portable.

Authors: Dmitrii Dolgov, Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c                 | 413 ++++++++++++++++++
 src/backend/postmaster/postmaster.c           |  18 +
 src/backend/storage/buffer/buf_init.c         |  75 ++--
 src/backend/storage/ipc/ipci.c                |  18 +-
 src/backend/storage/ipc/procsignal.c          |  46 ++
 src/backend/storage/ipc/shmem.c               |  23 +-
 src/backend/tcop/postgres.c                   |  10 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/backend/utils/misc/guc_tables.c           |   4 +-
 src/include/miscadmin.h                       |   1 +
 src/include/storage/bufmgr.h                  |   2 +-
 src/include/storage/ipc.h                     |   3 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/storage/pg_shmem.h                |  26 ++
 src/include/storage/pmsignal.h                |   1 +
 src/include/storage/procsignal.h              |   1 +
 src/tools/pgindent/typedefs.list              |   1 +
 17 files changed, 603 insertions(+), 43 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index f46d9d5d9cd..a3437973784 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -30,13 +30,19 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "portability/mem.h"
+#include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
+#include "storage/pmsignal.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
 #include "utils/pidfile.h"
+#include "utils/wait_event.h"
 
 
 /*
@@ -105,6 +111,13 @@ typedef struct AnonymousMapping
 
 static AnonymousMapping Mappings[ANON_MAPPINGS];
 
+/* Flag telling postmaster that resize is needed */
+volatile bool pending_pm_shmem_resize = false;
+
+/* Keeps track of the previous NBuffers value */
+static int NBuffersOld = -1;
+static int NBuffersPending = -1;
+
 /* Keeps track of used mapping segments */
 static int next_free_segment = 0;
 
@@ -176,6 +189,49 @@ static Size reserved_offset = 0;
  */
 static bool huge_pages_on = false;
 
+/*
+ * Flag telling that we have prepared the memory layout to be resizable. If
+ * false after all shared memory segments creation, it means we failed to setup
+ * needed layout and falled back to the regular non-resizable approach.
+ */
+static bool shmem_resizable = false;
+
+/*
+ * Currently broadcasted value of NBuffers in shared memory.
+ *
+ * Most of the time this value is going to be equal to NBuffers. But if
+ * postmaster is resizing shared memory and a new backend was created
+ * at the same time, there is a possibility for the new backend to inherit the
+ * old NBuffers value, but miss the resize signal if ProcSignal infrastructure
+ * was not initialized yet. Consider this situation:
+ *
+ *     Postmaster ------> New Backend
+ *         |                   |
+ *         |                Launch
+ *         |                   |
+ *         |             Inherit NBuffers
+ *         |                   |
+ *     Resize NBuffers         |
+ *         |                   |
+ *     Emit Barrier            |
+ *         |            Init ProcSignal
+ *         |                   |
+ *     Finish resize           |
+ *         |                   |
+ *     New NBuffers       Old NBuffers
+ *
+ * In this case the backend is not yet ready to receive a signal from
+ * EmitProcSignalBarrier, and will be ignored. The same happens if ProcSignal
+ * is initialized even later, after the resizing was finished.
+ *
+ * To address resulting inconsistency, postmaster broadcasts the current
+ * NBuffers value via shared memory. Every new backend has to verify this value
+ * before it will access the buffer pool: if it differs from its own value,
+ * this indicates a shared memory resize has happened and the backend has to
+ * first synchronize with rest of the pack.
+ */
+ShmemControl *ShmemCtrl = NULL;
+
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
@@ -769,6 +825,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 	{
 		Size total_reserved = (Size) MaxAvailableMemory * BLCKSZ;
 
+		shmem_resizable = true;
 		reserved_offset += total_reserved * SHMEM_RESIZE_RATIO[next_free_segment];
 	}
 
@@ -964,6 +1021,315 @@ AnonymousShmemDetach(int status, Datum arg)
 	}
 }
 
+/*
+ * Resize all shared memory segments based on the current NBuffers value, which
+ * is is applied from NBuffersPending. The actual segment resizing is done via
+ * mremap, which will fail if is not sufficient space to expand the mapping.
+ * When finished, based on the new and old values initialize new buffer blocks
+ * if any.
+ *
+ * If reinitializing took place, as the last step this function does buffers
+ * reinitialization as well and broadcasts the new value of NSharedBuffers. All
+ * of that needs to be done only by one backend, the first one that managed to
+ * grab the ShmemResizeLock.
+ */
+bool
+AnonymousShmemResize(void)
+{
+	int	numSemas;
+	bool reinit = false;
+	void *ptr = MAP_FAILED;
+	NBuffers = NBuffersPending;
+
+	elog(DEBUG1, "Resize shmem from %d to %d", NBuffersOld, NBuffers);
+
+	/*
+	 * XXX: Where to reset the flag is still an open question. E.g. do we
+	 * consider a no-op when NBuffers is equal to NBuffersOld a genuine resize
+	 * and reset the flag?
+	 */
+	pending_pm_shmem_resize = false;
+
+	/*
+	 * XXX: Currently only increasing of shared_buffers is supported. For
+	 * decreasing something similar has to be done, but buffer blocks with
+	 * data have to be drained first.
+	 */
+	if(NBuffersOld > NBuffers)
+		return false;
+
+	for(int i = 0; i < next_free_segment; i++)
+	{
+		/* Note that CalculateShmemSize indirectly depends on NBuffers */
+		Size new_size = CalculateShmemSize(&numSemas, i);
+		AnonymousMapping *m = &Mappings[i];
+
+		if (m->shmem == NULL)
+			continue;
+
+		if (m->shmem_size == new_size)
+			continue;
+
+		/* Clean up some reserved space to resize into */
+		if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not unmap %zu from reserved shared memory %p: %m",
+							new_size - m->shmem_size, m->shmem)));
+
+		/* Claim the unused space */
+		elog(DEBUG1, "segment[%s]: remap from %zu to %zu at address %p",
+					 MappingName(m->shmem_segment), m->shmem_size,
+					 new_size, m->shmem);
+
+		ptr = mremap(m->shmem, m->shmem_size, new_size, 0);
+		if (ptr == MAP_FAILED)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not resize shared memory segment %s [%p] to %d (%zu): %m",
+							MappingName(m->shmem_segment), m->shmem, NBuffers,
+							new_size)));
+
+		reinit = true;
+		m->shmem_size = new_size;
+	}
+
+	if (reinit)
+	{
+		if(IsUnderPostmaster &&
+			LWLockConditionalAcquire(ShmemResizeLock, LW_EXCLUSIVE))
+		{
+			/*
+			 * If the new NBuffers was already broadcasted, the buffer pool was
+			 * already initialized before.
+			 *
+			 * Since we're not on a hot path, we use lwlocks and do not need to
+			 * involve memory barrier.
+			 */
+			if(pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers)
+			{
+				/*
+				 * Allow the first backend that managed to get the lock to
+				 * reinitialize the new portion of buffer pool. Every other
+				 * process will wait on the shared barrier for that to finish,
+				 * since it's a part of the SHMEM_RESIZE_DONE phase.
+				 *
+				 * Note that it's enough when only one backend will do that,
+				 * even the ShmemInitStruct part. The reason is that resized
+				 * shared memory will maintain the same addresses, meaning that
+				 * all the pointers are still valid, and we only need to update
+				 * structures size in the ShmemIndex once -- any other backend
+				 * will pick up this shared structure from the index.
+				 *
+				 * XXX: This is the right place for buffer eviction as well.
+				 */
+				BufferManagerShmemInit(NBuffersOld);
+
+				/* If all fine, broadcast the new value */
+				pg_atomic_write_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
+			}
+
+			LWLockRelease(ShmemResizeLock);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * We are asked to resize shared memory. Wait for all ProcSignal participants
+ * to join the barrier, then do the resize and wait on the barrier until all
+ * participating finish resizing as well -- otherwise we face danger of
+ * inconsistency between backends.
+ *
+ * XXX: If a backend is blocked on ReadCommand in PostgresMain, it will not
+ * proceed with AnonymousShmemResize after receiving SIGHUP, until something
+ * will be sent.
+ */
+bool
+ProcessBarrierShmemResize(Barrier *barrier)
+{
+	elog(DEBUG1, "Handle a barrier for shmem resizing from %d to %d, %d",
+		 NBuffersOld, NBuffersPending, pending_pm_shmem_resize);
+
+	/* Wait until we have seen the new NBuffers value */
+	if (!pending_pm_shmem_resize)
+		return false;
+
+	/*
+	 * First thing to do after attaching to the barrier is to wait for others.
+	 * We can't simply use BarrierArriveAndWait, because backends might arrive
+	 * here in disjoint groups, e.g. first two backends, pause, then second two
+	 * backends. If the resize is quick enough that can lead to a situation
+	 * when the first group is already finished before the second has appeared,
+	 * and the barrier will only synchonize withing those groups.
+	 */
+	if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED)
+		WaitForProcSignalBarrierReceived(
+				pg_atomic_read_u64(&ShmemCtrl->Generation));
+
+	/*
+	 * Now start the procedure, and elect one backend to ping postmaster to do
+	 * the same.
+	 *
+	 * XXX: If we need to be able to abort resizing, this has to be done later,
+	 * after the SHMEM_RESIZE_DONE.
+	 */
+	if (BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START))
+	{
+		Assert(IsUnderPostmaster);
+		SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
+	}
+
+	AnonymousShmemResize();
+
+	/* The second phase means the resize has finished, SHMEM_RESIZE_DONE */
+	BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_DONE);
+
+	BarrierDetach(barrier);
+	return true;
+}
+
+/*
+ * GUC assign hook for shared_buffers. It's recommended for an assign hook to
+ * be as minimal as possible, thus we just request shared memory resize and
+ * remember the previous value.
+ */
+void
+assign_shared_buffers(int newval, void *extra, bool *pending)
+{
+	elog(DEBUG1, "Received SIGHUP for shmem resizing");
+
+	/* Request shared memory resize only when it was initialized */
+	if (next_free_segment != 0)
+	{
+		elog(DEBUG1, "Set pending signal");
+		pending_pm_shmem_resize = true;
+		*pending = true;
+		NBuffersPending = newval;
+	}
+
+	NBuffersOld = NBuffers;
+}
+
+/*
+ * Test if we have somehow missed a shmem resize signal and NBuffers value
+ * differs from NSharedBuffers. If yes, catchup and do resize.
+ */
+void
+AdjustShmemSize(void)
+{
+	uint32 NSharedBuffers = pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers);
+
+	if (NSharedBuffers != NBuffers)
+	{
+		/*
+		 * If the broadcasted shared_buffers is different from the one we see,
+		 * it could be that the backend has missed a resize signal. To avoid
+		 * any inconsistency, adjust the shared mappings, before having a
+		 * chance to access the buffer pool.
+		 */
+		ereport(LOG,
+				(errmsg("shared_buffers has been changed from %d to %d, "
+						"resize shared memory",
+						NBuffers, NSharedBuffers)));
+		NBuffers = NSharedBuffers;
+		AnonymousShmemResize();
+	}
+}
+
+/*
+ * Start resizing procedure, making sure all existing processes will have
+ * consistent view of shared memory size. Must be called only in postmaster.
+ */
+void
+CoordinateShmemResize(void)
+{
+	elog(DEBUG1, "Coordinating shmem resize from %d to %d",
+		 NBuffersOld, NBuffers);
+	Assert(!IsUnderPostmaster);
+
+	/*
+	 * We use dynamic barrier to help dealing with backends that were spawned
+	 * during the resize.
+	 */
+	BarrierInit(&ShmemCtrl->Barrier, 0);
+
+	/*
+	 * If the value did not change, or shared memory segments are not
+	 * initialized yet, skip the resize.
+	 */
+	if (NBuffersPending == NBuffersOld || next_free_segment == 0)
+	{
+		elog(DEBUG1, "Skip resizing, new %d, old %d, free segment %d",
+			 NBuffers, NBuffersOld, next_free_segment);
+		return;
+	}
+
+	/*
+	 * Shared memory resize requires some coordination done by postmaster,
+	 * and consists of three phases:
+	 *
+	 * - Before the resize all existing backends have the same old NBuffers.
+	 * - When resize is in progress, backends are expected to have a
+	 *   mixture of old a new values. They're not allowed to touch buffer
+	 *   pool during this time frame.
+	 * - After resize has been finished, all existing backends, that can access
+	 *   the buffer pool, are expected to have the same new value of NBuffers.
+	 *
+	 * Those phases are ensured by joining the shared barrier associated with
+	 * the procedure. Since resizing takes time, we need to take into account
+	 * that during that time:
+	 *
+	 * - New backends can be spawned. They will check status of the barrier
+	 *   early during the bootstrap, and wait until everything is over to work
+	 *   with the new NBuffers value.
+	 *
+	 * - Old backends can exit before attempting to resize. Synchronization
+	 *   used between backends relies on ProcSignalBarrier and waits for all
+	 *   participants received the message at the beginning to gather all
+	 *   existing backends.
+	 *
+	 * - Some backends might be blocked and not responsing either before or
+	 *   after receiving the message. In the first case such backend still
+	 *   have ProcSignalSlot and should be waited for, in the second case
+	 *   shared barrier will make sure we still waiting for those backends. In
+	 *   any case there is an unbounded wait.
+	 *
+	 * - Backends might join barrier in disjoint groups with some time in
+	 *   between. That means that relying only on the shared dynamic barrier is
+	 *   not enough -- it will only synchronize resize procedure withing those
+	 *   groups. That's why we wait first for all participants of ProcSignal
+	 *   mechanism who received the message.
+	 */
+	elog(DEBUG1, "Emit a barrier for shmem resizing");
+	pg_atomic_init_u64(&ShmemCtrl->Generation,
+					   EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SHMEM_RESIZE));
+
+	/* To order everything after setting Generation value */
+	pg_memory_barrier();
+
+	/*
+	 * After that postmaster waits for PMSIGNAL_SHMEM_RESIZE as a sign that all
+	 * the rest of the pack has started the procedure and it can resize shared
+	 * memory as well.
+	 *
+	 * Normally we would call WaitForProcSignalBarrier here to wait until every
+	 * backend has reported on the ProcSignalBarrier. But for shared memory
+	 * resize we don't need this, as every participating backend will
+	 * synchronize on the ProcSignal barrier. In fact even if we would like to
+	 * wait here, it wouldn't be possible -- we're in the postmaster, without
+	 * any waiting infrastructure available.
+	 *
+	 * If at some point it will turn out that waiting is essential, we would
+	 * need to consider some alternatives. E.g. it could be a designated
+	 * coordination process, which is not a postmaster. Another option would be
+	 * to introduce a CoordinateShmemResize lock and allow only one process to
+	 * take it (this probably would have to be something different than
+	 * LWLocks, since they block interrupts, and coordination relies on them).
+	 */
+}
+
 /*
  * PGSharedMemoryCreate
  *
@@ -1271,3 +1637,50 @@ PGSharedMemoryDetach(void)
 		}
 	}
 }
+
+void
+WaitOnShmemBarrier()
+{
+	Barrier *barrier = &ShmemCtrl->Barrier;
+
+	/* Nothing to do if resizing is not started */
+	if (BarrierPhase(barrier) < SHMEM_RESIZE_START)
+		return;
+
+	BarrierAttach(barrier);
+
+	/* Otherwise wait through all available phases */
+	while (BarrierPhase(barrier) < SHMEM_RESIZE_DONE)
+	{
+		ereport(LOG, (errmsg("ProcSignal barrier is in phase %d, waiting",
+							 BarrierPhase(barrier))));
+
+		BarrierArriveAndWait(barrier, 0);
+	}
+
+	BarrierDetach(barrier);
+}
+
+void
+ShmemControlInit(void)
+{
+	bool foundShmemCtrl;
+
+	ShmemCtrl = (ShmemControl *)
+	ShmemInitStruct("Shmem Control", sizeof(ShmemControl),
+									 &foundShmemCtrl);
+
+	if (!foundShmemCtrl)
+	{
+		/*
+		 * The barrier is missing here, it will be initialized right before
+		 * starting the resizing process as a convenient way to reset it.
+		 */
+
+		/* Initialize with the currently known value */
+		pg_atomic_init_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
+
+		/* shmem_resizable should be initialized by now */
+		ShmemCtrl->Resizable = shmem_resizable;
+	}
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 3fe45de5da0..196f233fe0e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -425,6 +425,7 @@ static void process_pm_pmsignal(void);
 static void process_pm_child_exit(void);
 static void process_pm_reload_request(void);
 static void process_pm_shutdown_request(void);
+static void process_pm_shmem_resize(void);
 static void dummy_handler(SIGNAL_ARGS);
 static void CleanupBackend(PMChild *bp, int exitstatus);
 static void HandleChildCrash(int pid, int exitstatus, const char *procname);
@@ -1693,6 +1694,9 @@ ServerLoop(void)
 			if (pending_pm_pmsignal)
 				process_pm_pmsignal();
 
+			if (pending_pm_shmem_resize)
+				process_pm_shmem_resize();
+
 			if (events[i].events & WL_SOCKET_ACCEPT)
 			{
 				ClientSocket s;
@@ -2038,6 +2042,17 @@ process_pm_reload_request(void)
 	}
 }
 
+static void
+process_pm_shmem_resize(void)
+{
+	/*
+	 * Failure to resize is considered to be fatal and will not be
+	 * retried, which means we can disable pending flag right here.
+	 */
+	pending_pm_shmem_resize = false;
+	CoordinateShmemResize();
+}
+
 /*
  * pg_ctl uses SIGTERM, SIGINT and SIGQUIT to request different types of
  * shutdown.
@@ -3851,6 +3866,9 @@ process_pm_pmsignal(void)
 		request_state_update = true;
 	}
 
+	if (CheckPostmasterSignal(PMSIGNAL_SHMEM_RESIZE))
+		AnonymousShmemResize();
+
 	/*
 	 * Try to advance postmaster's state machine, if a child requests it.
 	 */
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index bd68b69ee98..ac844b114bd 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -17,6 +17,7 @@
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -24,7 +25,6 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
-
 /*
  * Data Structures:
  *		buffers live in a freelist and a lookup data structure.
@@ -62,18 +62,28 @@ CkptSortItem *CkptBufferIds;
  * Initialize shared buffer pool
  *
  * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend). Size of data structures initialized
- * here depends on NBuffers, and to be able to change NBuffers without a
- * restart we store each structure into a separate shared memory segment, which
- * could be resized on demand.
+ * postmaster, or in a standalone backend) or during shared-memory resize. Size
+ * of data structures initialized here depends on NBuffers, and to be able to
+ * change NBuffers without a restart we store each structure into a separate
+ * shared memory segment, which could be resized on demand.
+ *
+ * FirstBufferToInit tells where to start initializing buffers. For
+ * initialization it always will be zero, but when resizing shared-memory it
+ * indicates the number of already initialized buffers.
+ *
+ * No locks are taking in this function, it is the caller responsibility to
+ * make sure only one backend can work with new buffers.
  */
 void
-BufferManagerShmemInit(void)
+BufferManagerShmemInit(int FirstBufferToInit)
 {
 	bool		foundBufs,
 				foundDescs,
 				foundIOCV,
 				foundBufCkpt;
+	int			i;
+	elog(DEBUG1, "BufferManagerShmemInit from %d to %d",
+				 FirstBufferToInit, NBuffers);
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
@@ -110,43 +120,44 @@ BufferManagerShmemInit(void)
 	{
 		/* should find all of these, or none of them */
 		Assert(foundDescs && foundBufs && foundIOCV && foundBufCkpt);
-		/* note: this path is only taken in EXEC_BACKEND case */
-	}
-	else
-	{
-		int			i;
-
 		/*
-		 * Initialize all the buffer headers.
+		 * note: this path is only taken in EXEC_BACKEND case when initializing
+		 * shared memory, or in all cases when resizing shared memory.
 		 */
-		for (i = 0; i < NBuffers; i++)
-		{
-			BufferDesc *buf = GetBufferDescriptor(i);
+	}
 
-			ClearBufferTag(&buf->tag);
+#ifndef EXEC_BACKEND
+	/*
+	 * Initialize all the buffer headers.
+	 */
+	for (i = FirstBufferToInit; i < NBuffers; i++)
+	{
+		BufferDesc *buf = GetBufferDescriptor(i);
 
-			pg_atomic_init_u32(&buf->state, 0);
-			buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
+		ClearBufferTag(&buf->tag);
 
-			buf->buf_id = i;
+		pg_atomic_init_u32(&buf->state, 0);
+		buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
 
-			pgaio_wref_clear(&buf->io_wref);
+		buf->buf_id = i;
 
-			/*
-			 * Initially link all the buffers together as unused. Subsequent
-			 * management of this list is done by freelist.c.
-			 */
-			buf->freeNext = i + 1;
+		pgaio_wref_clear(&buf->io_wref);
 
-			LWLockInitialize(BufferDescriptorGetContentLock(buf),
-							 LWTRANCHE_BUFFER_CONTENT);
+		/*
+		 * Initially link all the buffers together as unused. Subsequent
+		 * management of this list is done by freelist.c.
+		 */
+		buf->freeNext = i + 1;
 
-			ConditionVariableInit(BufferDescriptorGetIOCV(buf));
-		}
+		LWLockInitialize(BufferDescriptorGetContentLock(buf),
+						 LWTRANCHE_BUFFER_CONTENT);
 
-		/* Correct last entry of linked list */
-		GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
+		ConditionVariableInit(BufferDescriptorGetIOCV(buf));
 	}
+#endif
+
+	/* Correct last entry of linked list */
+	GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
 
 	/* Init other shared buffer-management stuff */
 	StrategyInitialize(!foundDescs);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 9d00b80b4f8..abeb91e24fd 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -84,6 +84,9 @@ RequestAddinShmemSpace(Size size)
  *
  * If num_semaphores is not NULL, it will be set to the number of semaphores
  * required.
+ *
+ * XXX: Calculation for non main shared memory segments are incorrect, it
+ * includes more than needed for buffers only.
  */
 Size
 CalculateShmemSize(int *num_semaphores, int shmem_segment)
@@ -151,6 +154,14 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
 
+	/*
+	 * XXX: For some reason slightly more memory is needed for larger
+	 * shared_buffers, but this size is enough for any large value I've tested
+	 * with. Is it a mistake in how slots are split, or there was a hidden
+	 * inconsistency in shmem calculation?
+	 */
+	size = add_size(size, 1024 * 1024 * 100);
+
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
 
@@ -298,7 +309,7 @@ CreateOrAttachShmemStructs(void)
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
-	BufferManagerShmemInit();
+	BufferManagerShmemInit(0);
 
 	/*
 	 * Set up lock manager
@@ -310,6 +321,11 @@ CreateOrAttachShmemStructs(void)
 	 */
 	PredicateLockShmemInit();
 
+	/*
+	 * Set up shared memory resize manager
+	 */
+	ShmemControlInit();
+
 	/*
 	 * Set up process table
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 8e313ad9bf8..35c42f260a8 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -27,6 +27,7 @@
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
+#include "storage/pg_shmem.h"
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "storage/smgr.h"
@@ -112,6 +113,10 @@ static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
 
+#ifdef DEBUG_SHMEM_RESIZE
+bool delay_proc_signal_init = false;
+#endif
+
 /*
  * ProcSignalShmemSize
  *		Compute space needed for ProcSignal's shared memory
@@ -175,6 +180,43 @@ ProcSignalInit(char *cancel_key, int cancel_key_len)
 	uint32		old_pss_pid;
 
 	Assert(cancel_key_len >= 0 && cancel_key_len <= MAX_CANCEL_KEY_LENGTH);
+
+#ifdef DEBUG_SHMEM_RESIZE
+	/*
+	 * Introduced for debugging purposes. You can change the variable at
+	 * runtime using gdb, then start new backends with delayed ProcSignal
+	 * initialization. Simple pg_usleep wont work here due to SIGHUP interrupt
+	 * needed for testing. Taken from pg_sleep;
+	 */
+	if (delay_proc_signal_init)
+	{
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+		float8		endtime = GetNowFloat() + 5;
+
+		for (;;)
+		{
+			float8		delay;
+			long		delay_ms;
+
+			CHECK_FOR_INTERRUPTS();
+
+			delay = endtime - GetNowFloat();
+			if (delay >= 600.0)
+				delay_ms = 600000;
+			else if (delay > 0.0)
+				delay_ms = (long) (delay * 1000.0);
+			else
+				break;
+
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 delay_ms,
+							 WAIT_EVENT_PG_SLEEP);
+			ResetLatch(MyLatch);
+		}
+	}
+#endif
+
 	if (MyProcNumber < 0)
 		elog(ERROR, "MyProcNumber not set");
 	if (MyProcNumber >= NumProcSignalSlots)
@@ -614,6 +656,10 @@ ProcessProcSignalBarrier(void)
 					case PROCSIGNAL_BARRIER_SMGRRELEASE:
 						processed = ProcessBarrierSmgrRelease();
 						break;
+					case PROCSIGNAL_BARRIER_SHMEM_RESIZE:
+						processed = ProcessBarrierShmemResize(
+								&ShmemCtrl->Barrier);
+						break;
 				}
 
 				/*
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 389abc82519..0fd421f004e 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -493,17 +493,26 @@ ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
 	{
 		/*
 		 * Structure is in the shmem index so someone else has allocated it
-		 * already.  The size better be the same as the size we are trying to
-		 * initialize to, or there is a name conflict (or worse).
+		 * already. Verify the structure's size:
+		 * - If it's the same, we've found the expected structure.
+		 * - If it's different, we're resizing the expected structure.
+		 *
+		 * XXX: There is an implicit assumption this can only happen in
+		 * "resizable" segments, where only one shared structure is allowed.
+		 * This has to be implemented more cleanly.
 		 */
 		if (result->size != size)
 		{
-			LWLockRelease(ShmemIndexLock);
-			ereport(ERROR,
-					(errmsg("ShmemIndex entry size is wrong for data structure"
-							" \"%s\": expected %zu, actual %zu",
-							name, size, result->size)));
+			Size delta = size - result->size;
+
+			result->size = size;
+
+			/* Reflect size change in the shared segment */
+			SpinLockAcquire(Segments[shmem_segment].ShmemLock);
+			Segments[shmem_segment].ShmemSegHdr->freeoffset += delta;
+			SpinLockRelease(Segments[shmem_segment].ShmemLock);
 		}
+
 		structPtr = result->location;
 	}
 	else
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index b1fba850f02..58f1a05fd2a 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -62,6 +62,7 @@
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -4311,6 +4312,15 @@ PostgresMain(const char *dbname, const char *username)
 	 */
 	BeginReportingGUCOptions();
 
+	/* Verify the shared barrier, if it's still active: join and wait. */
+	WaitOnShmemBarrier();
+
+	/*
+	 * After waiting on the barrier above we guaranteed to have NSharedBuffers
+	 * broadcasted, so we can use it in the function below.
+	 */
+	AdjustShmemSize();
+
 	/*
 	 * Also set up handler to log session end; we have to wait till now to be
 	 * sure Log_disconnections has its final value.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8bce14c38fd..e0ba8384fdd 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -155,6 +155,8 @@ REPLICATION_ORIGIN_DROP	"Waiting for a replication origin to become inactive so
 REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it can be dropped."
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
+SHMEM_RESIZE_START	"Waiting for other backends to start resizing shared memory."
+SHMEM_RESIZE_DONE	"Waiting for other backends to finish resizing shared memory."
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
 WAL_BUFFER_INIT	"Waiting on WAL buffer to be initialized."
 WAL_RECEIVER_EXIT	"Waiting for the WAL receiver to exit."
@@ -351,6 +353,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+ShmemResize	"Waiting to resize shared memory."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index dede37f7905..1e70853ccdb 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2354,14 +2354,14 @@ struct config_int ConfigureNamesInt[] =
 	 * checking for overflow, so we mustn't allow more than INT_MAX / 2.
 	 */
 	{
-		{"shared_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+		{"shared_buffers", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the number of shared memory buffers used by the server."),
 			NULL,
 			GUC_UNIT_BLOCKS
 		},
 		&NBuffers,
 		16384, 16, INT_MAX / 2,
-		NULL, NULL, NULL
+		NULL, assign_shared_buffers, NULL
 	},
 
 	{
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 0d8528b2875..405d0a7e65d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -173,6 +173,7 @@ extern PGDLLIMPORT char *DataDir;
 extern PGDLLIMPORT int data_directory_mode;
 
 extern PGDLLIMPORT int NBuffers;
+extern PGDLLIMPORT int MaxAvailableMemory;
 extern PGDLLIMPORT int MaxBackends;
 extern PGDLLIMPORT int MaxConnections;
 extern PGDLLIMPORT int max_worker_processes;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 1977001e533..52633dd7537 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -307,7 +307,7 @@ extern void LimitAdditionalLocalPins(uint32 *additional_pins);
 extern bool EvictUnpinnedBuffer(Buffer buf);
 
 /* in buf_init.c */
-extern void BufferManagerShmemInit(void);
+extern void BufferManagerShmemInit(int);
 extern Size BufferManagerShmemSize(int);
 
 /* in localbuf.c */
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 6ebda479ced..bb7ae4d33b3 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -64,6 +64,7 @@ typedef void (*shmem_startup_hook_type) (void);
 /* ipc.c */
 extern PGDLLIMPORT bool proc_exit_inprogress;
 extern PGDLLIMPORT bool shmem_exit_inprogress;
+extern PGDLLIMPORT volatile bool pending_pm_shmem_resize;
 
 pg_noreturn extern void proc_exit(int code);
 extern void shmem_exit(int code);
@@ -83,5 +84,7 @@ extern void CreateSharedMemoryAndSemaphores(void);
 extern void AttachSharedMemoryStructs(void);
 #endif
 extern void InitializeShmemGUCs(void);
+extern void CoordinateShmemResize(void);
+extern bool AnonymousShmemResize(void);
 
 #endif							/* IPC_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index a9681738146..558da6fdd55 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, ShmemResize)
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index c5009a1cd73..2e47b222cbb 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -24,6 +24,7 @@
 #ifndef PG_SHMEM_H
 #define PG_SHMEM_H
 
+#include "storage/barrier.h"
 #include "storage/dsm_impl.h"
 #include "storage/spin.h"
 
@@ -56,6 +57,25 @@ typedef struct ShmemSegment
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 
+/*
+ * ShmemControl is shared between backends and helps to coordinate shared
+ * memory resize.
+ */
+typedef struct
+{
+	pg_atomic_uint32 	NSharedBuffers;
+	Barrier 			Barrier;
+	pg_atomic_uint64 	Generation;
+	bool                Resizable;
+} ShmemControl;
+
+extern PGDLLIMPORT ShmemControl *ShmemCtrl;
+
+/* The phases for shared memory resizing, used by for ProcSignal barrier. */
+#define SHMEM_RESIZE_REQUESTED			0
+#define SHMEM_RESIZE_START				1
+#define SHMEM_RESIZE_DONE				2
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -107,6 +127,12 @@ extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 void *ReserveAnonymousMemory(Size reserve_size);
 
+bool ProcessBarrierShmemResize(Barrier *barrier);
+void assign_shared_buffers(int newval, void *extra, bool *pending);
+void AdjustShmemSize(void);
+extern void WaitOnShmemBarrier(void);
+extern void ShmemControlInit(void);
+
 /*
  * To be able to dynamically resize largest parts of the data stored in shared
  * memory, we split it into multiple shared memory mappings segments. Each
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 67fa9ac06e1..27bc6a81191 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
 	PMSIGNAL_XLOG_IS_SHUTDOWN,	/* ShutdownXLOG() completed */
+	PMSIGNAL_SHMEM_RESIZE,	/* resize shared memory */
 } PMSignalReason;
 
 #define NUM_PMSIGNALS (PMSIGNAL_XLOG_IS_SHUTDOWN+1)
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index defd8b66a19..522b8de1e02 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -54,6 +54,7 @@ typedef enum
 typedef enum
 {
 	PROCSIGNAL_BARRIER_SMGRRELEASE, /* ask smgr to close files */
+	PROCSIGNAL_BARRIER_SHMEM_RESIZE, /* ask backends to resize shared memory */
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1a30437ad96..6755b302858 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2738,6 +2738,7 @@ ShellTypeInfo
 ShippableCacheEntry
 ShippableCacheKey
 ShmemIndexEnt
+ShmemControl
 ShutdownForeignScan_function
 ShutdownInformation
 ShutdownMode
-- 
2.45.1

v4-0007-Use-anonymous-files-to-back-shared-memory-segment.patchtext/plain; charset=us-asciiDownload

From 0e3c671082743f2826a7e8a96a19a071f5c8aeb3 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sat, 15 Mar 2025 16:39:45 +0100
Subject: [PATCH v4 7/8] Use anonymous files to back shared memory segments

Allow to use anonymous files for shared memory, instead of plain
anonymous memory. Such an anonymous file is created via memfd_create, it
lives in memory, behaves like a regular file and semantically equivalent
to an anonymous memory allocated via mmap with MAP_ANONYMOUS.

Advantages of using anon files are following:

* We've got a file descriptor, which could be used for regular file
  operations (modification, truncation, you name it).

* The file could be given a name, which improves readability when it
  comes to process maps. Here is how it looks like

7f90cde00000-7f90d5126000 rw-s 00000000 00:01 5463 /memfd:main (deleted)
7f90d5126000-7f914de00000 ---p 00000000 00:00 0
7f914de00000-7f9175128000 rw-s 00000000 00:01 5466 /memfd:buffers (deleted)
7f9175128000-7f944de00000 ---p 00000000 00:00 0
7f944de00000-7f9455528000 rw-s 00000000 00:01 5469 /memfd:descriptors (deleted)
7f9455528000-7f94cde00000 ---p 00000000 00:00 0
7f94cde00000-7f94d5228000 rw-s 00000000 00:01 5472 /memfd:iocv (deleted)
7f94d5228000-7f954de00000 ---p 00000000 00:00 0
7f954de00000-7f9555266000 rw-s 00000000 00:01 5475 /memfd:checkpoint (deleted)
7f9555266000-7f958de00000 ---p 00000000 00:00 0
7f958de00000-7f95954aa000 rw-s 00000000 00:01 5478 /memfd:strategy (deleted)
7f95954aa000-7f95cde00000 ---p 00000000 00:00 0

* By default, Linux will not add file-backed shared mappings into a core dump,
  making it more convenient to work with them in PostgreSQL: no more huge dumps
  to process.

The downside is that memfd_create is Linux specific.
---
 src/backend/port/sysv_shmem.c  | 73 +++++++++++++++++++++++++++++-----
 src/backend/port/win32_shmem.c |  2 +-
 src/backend/storage/ipc/ipci.c |  2 +-
 src/include/portability/mem.h  |  2 +-
 src/include/storage/pg_shmem.h |  3 +-
 5 files changed, 68 insertions(+), 14 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index a3437973784..87000a24eea 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -107,6 +107,7 @@ typedef struct AnonymousMapping
 	Pointer shmem; 				/* Pointer to the start of the mapped memory */
 	Pointer seg_addr; 			/* SysV shared memory for the header */
 	unsigned long seg_id; 		/* IPC key */
+	int segment_fd; 			/* fd for the backing anon file */
 } AnonymousMapping;
 
 static AnonymousMapping Mappings[ANON_MAPPINGS];
@@ -127,7 +128,7 @@ static int next_free_segment = 0;
  * 00400000-00490000         /path/bin/postgres
  * ...
  * 012d9000-0133e000         [heap]
- * 7f443a800000-7f470a800000 /dev/zero (deleted)
+ * 7f443a800000-7f470a800000 /memfd:main (deleted)
  * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
  * ...
@@ -150,9 +151,9 @@ static int next_free_segment = 0;
  * The result would look like this:
  *
  * 012d9000-0133e000         [heap]
- * 7f4426f54000-7f442e010000 /dev/zero (deleted)
+ * 7f4426f54000-7f442e010000 /memfd:main (deleted)
  * 7f442e010000-7f443a800000                     # reserved empty space
- * 7f443a800000-7f444196c000 /dev/zero (deleted)
+ * 7f443a800000-7f444196c000 /memfd:buffers (deleted)
  * 7f444196c000-7f470a800000                     # reserved empty space
  * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
@@ -643,13 +644,14 @@ PGSharedMemoryAttach(IpcMemoryId shmId,
  * *hugepagesize and *mmap_flags are set to 0.
  */
 void
-GetHugePageSize(Size *hugepagesize, int *mmap_flags)
+GetHugePageSize(Size *hugepagesize, int *mmap_flags, int *memfd_flags)
 {
 #ifdef MAP_HUGETLB
 
 	Size		default_hugepagesize = 0;
 	Size		hugepagesize_local = 0;
 	int			mmap_flags_local = 0;
+	int			memfd_flags_local = 0;
 
 	/*
 	 * System-dependent code to find out the default huge page size.
@@ -708,6 +710,7 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 	}
 
 	mmap_flags_local = MAP_HUGETLB;
+	memfd_flags_local = MFD_HUGETLB;
 
 	/*
 	 * On recent enough Linux, also include the explicit page size, if
@@ -718,7 +721,16 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 	{
 		int			shift = pg_ceil_log2_64(hugepagesize_local);
 
-		mmap_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
+		memfd_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
+	}
+#endif
+
+#if defined(MFD_HUGE_MASK) && defined(MFD_HUGE_SHIFT)
+	if (hugepagesize_local != default_hugepagesize)
+	{
+		int			shift = pg_ceil_log2_64(hugepagesize_local);
+
+		memfd_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
 	}
 #endif
 
@@ -727,6 +739,8 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 		*mmap_flags = mmap_flags_local;
 	if (hugepagesize)
 		*hugepagesize = hugepagesize_local;
+	if (memfd_flags)
+		*memfd_flags = memfd_flags_local;
 
 #else
 
@@ -734,6 +748,8 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 		*hugepagesize = 0;
 	if (mmap_flags)
 		*mmap_flags = 0;
+	if (memfd_flags)
+		*memfd_flags = 0;
 
 #endif							/* MAP_HUGETLB */
 }
@@ -771,7 +787,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
-	int			mmap_flags = PG_MMAP_FLAGS;
+	int			mmap_flags = PG_MMAP_FLAGS, memfd_flags = 0;
 
 #ifndef MAP_HUGETLB
 	/* ReserveAnonymousMemory should have dealt with this case */
@@ -785,7 +801,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 		Assert(huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY);
 
 		/* Round up the request size to a suitable large value */
-		GetHugePageSize(&hugepagesize, &mmap_flags);
+		GetHugePageSize(&hugepagesize, &mmap_flags, &memfd_flags);
 
 		if (allocsize % hugepagesize != 0)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
@@ -794,6 +810,29 @@ CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 	}
 #endif
 
+	/*
+	 * Prepare an anonymous file backing the segment. Its size will be
+	 * specified later via ftruncate.
+	 *
+	 * The file behaves like a regular file, but lives in memory. Once all
+	 * references to the file are dropped,  it is automatically released.
+	 * Anonymous memory is used for all backing pages of the file, thus it has
+	 * the same semantics as anonymous memory allocations using mmap with the
+	 * MAP_ANONYMOUS flag.
+	 */
+	mapping->segment_fd = memfd_create(MappingName(mapping->shmem_segment),
+									   memfd_flags);
+
+	/*
+	 * Specify the segment file size using allocsize, which contains
+	 * potentially modified size.
+	 */
+	if(ftruncate(mapping->segment_fd, allocsize) == -1)
+		ereport(FATAL,
+				(errcode(ERRCODE_SYSTEM_ERROR),
+				 errmsg("could not truncase anonymous file for \"%s\": %m",
+						MappingName(mapping->shmem_segment))));
+
 	elog(DEBUG1, "segment[%s]: mmap(%zu) at address %p",
 		 MappingName(mapping->shmem_segment), allocsize, base + reserved_offset);
 
@@ -807,7 +846,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 	 * a restart.
 	 */
 	ptr = mmap(base + reserved_offset, allocsize, PROT_READ | PROT_WRITE,
-			   mmap_flags | MAP_FIXED, -1, 0);
+			   mmap_flags | MAP_FIXED, mapping->segment_fd, 0);
 	mmap_errno = errno;
 
 	if (ptr == MAP_FAILED)
@@ -817,8 +856,15 @@ CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 					 "fallback to the non-resizable allocation",
 			 MappingName(mapping->shmem_segment), allocsize, base + reserved_offset);
 
+		/* Specify the segment file size using allocsize. */
+		if(ftruncate(mapping->segment_fd, allocsize) == -1)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not truncase anonymous file for \"%s\": %m",
+							MappingName(mapping->shmem_segment))));
+
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-						   PG_MMAP_FLAGS, -1, 0);
+						   PG_MMAP_FLAGS, mapping->segment_fd, 0);
 		mmap_errno = errno;
 	}
 	else
@@ -889,7 +935,7 @@ ReserveAnonymousMemory(Size reserve_size)
 		Size		hugepagesize, total_size = 0;
 		int			mmap_flags;
 
-		GetHugePageSize(&hugepagesize, &mmap_flags);
+		GetHugePageSize(&hugepagesize, &mmap_flags, NULL);
 
 		/*
 		 * Figure out how much memory is needed for all segments, keeping in
@@ -1070,6 +1116,13 @@ AnonymousShmemResize(void)
 		if (m->shmem_size == new_size)
 			continue;
 
+		/* Resize the backing anon file. */
+		if(ftruncate(m->segment_fd, new_size) == -1)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not truncase anonymous file for \"%s\": %m",
+							MappingName(m->shmem_segment))));
+
 		/* Clean up some reserved space to resize into */
 		if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
 			ereport(FATAL,
diff --git a/src/backend/port/win32_shmem.c b/src/backend/port/win32_shmem.c
index ce719f1b412..ba972106de1 100644
--- a/src/backend/port/win32_shmem.c
+++ b/src/backend/port/win32_shmem.c
@@ -627,7 +627,7 @@ pgwin32_ReserveSharedMemoryRegion(HANDLE hChild)
  * use GetLargePageMinimum() instead.
  */
 void
-GetHugePageSize(Size *hugepagesize, int *mmap_flags)
+GetHugePageSize(Size *hugepagesize, int *mmap_flags, int *memfd_flags)
 {
 	if (hugepagesize)
 		*hugepagesize = 0;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index abeb91e24fd..dc2b4becf4a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -396,7 +396,7 @@ InitializeShmemGUCs(void)
 	/*
 	 * Calculate the number of huge pages required.
 	 */
-	GetHugePageSize(&hp_size, NULL);
+	GetHugePageSize(&hp_size, NULL, NULL);
 	if (hp_size != 0)
 	{
 		Size		hp_required;
diff --git a/src/include/portability/mem.h b/src/include/portability/mem.h
index ef9800732d9..40588ff6968 100644
--- a/src/include/portability/mem.h
+++ b/src/include/portability/mem.h
@@ -38,7 +38,7 @@
 #define MAP_NOSYNC			0
 #endif
 
-#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_HASSEMAPHORE)
 
 /* Some really old systems don't define MAP_FAILED. */
 #ifndef MAP_FAILED
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 2e47b222cbb..b9573520d9a 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -124,7 +124,8 @@ extern PGShmemHeader *PGSharedMemoryCreate(Size size,
 										   PGShmemHeader **shim, Pointer base);
 extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
-extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
+extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
+							int *memfd_flags);
 void *ReserveAnonymousMemory(Size reserve_size);
 
 bool ProcessBarrierShmemResize(Barrier *barrier);
-- 
2.45.1

v4-0008-Support-resize-for-hugetlb.patchtext/plain; charset=us-asciiDownload

From 08476af71724fcb3035fc907dc98a6ff351fe58e Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sat, 5 Apr 2025 19:51:33 +0200
Subject: [PATCH v4 8/8] Support resize for hugetlb

Linux kernel has a set of limitations on remapping hugetlb segments: it
can't increase size of such segment [1], and shrinking it will not
release the memory back. In fact support for hugetlb mremap was
implemented no so long time ago [2].

As a workaround, avoid mremap for resizing shared memory. Instead unmap
the whole segment and map it back at the same address with the new size,
relying on the fact that fd for the anon file behind the segment is
still open and will keep the memory content.

[1]: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/mremap.c?id=f4d2ef48250ad057e4f00087967b5ff366da9f39#n1593
[2]: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/mm/mremap.c?id=550a7d60bd5e35a56942dba6d8a26752beb26c9f
---
 src/backend/port/sysv_shmem.c | 60 +++++++++++++++++++++++++----------
 1 file changed, 44 insertions(+), 16 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 87000a24eea..f0b53ce1d7c 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -1109,6 +1109,7 @@ AnonymousShmemResize(void)
 		/* Note that CalculateShmemSize indirectly depends on NBuffers */
 		Size new_size = CalculateShmemSize(&numSemas, i);
 		AnonymousMapping *m = &Mappings[i];
+		int	mmap_flags = PG_MMAP_FLAGS;
 
 		if (m->shmem == NULL)
 			continue;
@@ -1116,6 +1117,44 @@ AnonymousShmemResize(void)
 		if (m->shmem_size == new_size)
 			continue;
 
+#ifndef MAP_HUGETLB
+		/* ReserveAnonymousMemory should have dealt with this case */
+		Assert(huge_pages != HUGE_PAGES_ON && !huge_pages_on);
+#else
+		if (huge_pages_on)
+		{
+			Size		hugepagesize;
+
+			/* Make sure nothing is messed up */
+			Assert(huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY);
+
+			/* Round up the new size to a suitable large value */
+			GetHugePageSize(&hugepagesize, &mmap_flags, NULL);
+
+			if (new_size % hugepagesize != 0)
+				new_size += hugepagesize - (new_size % hugepagesize);
+
+			mmap_flags = PG_MMAP_FLAGS | mmap_flags;
+		}
+#endif
+
+		/*
+		 * Linux limitations do not allow us to mremap hugetlb in the way we
+		 * want. E.g. no size increase is allowed, and for shrinking the memory
+		 * will not be released back. To work around this unmap the segment and
+		 * create a new one at the same address. Thanks for the backing anon
+		 * file the content will still be kept in memory.
+		 */
+		elog(DEBUG1, "segment[%s]: remap from %zu to %zu at address %p",
+					 MappingName(m->shmem_segment), m->shmem_size,
+					 new_size, m->shmem);
+
+		if (munmap(m->shmem, m->shmem_size) < 0)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not unmap shared memory segment %s [%p]: %m",
+							MappingName(m->shmem_segment), m->shmem)));
+
 		/* Resize the backing anon file. */
 		if(ftruncate(m->segment_fd, new_size) == -1)
 			ereport(FATAL,
@@ -1123,25 +1162,14 @@ AnonymousShmemResize(void)
 					 errmsg("could not truncase anonymous file for \"%s\": %m",
 							MappingName(m->shmem_segment))));
 
-		/* Clean up some reserved space to resize into */
-		if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
-			ereport(FATAL,
-					(errcode(ERRCODE_SYSTEM_ERROR),
-					 errmsg("could not unmap %zu from reserved shared memory %p: %m",
-							new_size - m->shmem_size, m->shmem)));
-
-		/* Claim the unused space */
-		elog(DEBUG1, "segment[%s]: remap from %zu to %zu at address %p",
-					 MappingName(m->shmem_segment), m->shmem_size,
-					 new_size, m->shmem);
-
-		ptr = mremap(m->shmem, m->shmem_size, new_size, 0);
+		/* Reclaim the space */
+		ptr = mmap(m->shmem, new_size, PROT_READ | PROT_WRITE,
+				   mmap_flags | MAP_FIXED, m->segment_fd, 0);
 		if (ptr == MAP_FAILED)
 			ereport(FATAL,
 					(errcode(ERRCODE_SYSTEM_ERROR),
-					 errmsg("could not resize shared memory segment %s [%p] to %d (%zu): %m",
-							MappingName(m->shmem_segment), m->shmem, NBuffers,
-							new_size)));
+					 errmsg("could not map shared memory segment %s [%p] with size %zu: %m",
+							MappingName(m->shmem_segment), m->shmem, new_size)));
 
 		reinit = true;
 		m->shmem_size = new_size;
-- 
2.45.1

#44

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

9 months ago

In reply to: Dmitry Dolgov (#43)

Re: Changing shared_buffers without restart

On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

In the new v4 version
of the patch the first option is implemented.

The patches don't apply cleanly using git am but patch -p1 applies
them cleanly. However I see following compilation errors
RuntimeError: command "ninja" failed with error [1/1954] Generating
src/include/utils/errcodes with a custom command
[2/1954] Generating src/include/storage/lwlocknames_h with a custom command
[3/1954] Generating src/include/utils/wait_event_names with a custom command
[4/1954] Compiling C object src/port/libpgport.a.p/pg_popcount_aarch64.c.o
[5/1954] Compiling C object src/port/libpgport.a.p/pg_numa.c.o
FAILED: src/port/libpgport.a.p/pg_numa.c.o
cc -Isrc/port/libpgport.a.p -Isrc/include
-I../../coderoot/pg/src/include -fdiagnostics-color=always
-D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Werror -g
-fno-strict-aliasing -fwrapv -fexcess-precision=standard -D_GNU_SOURCE
-Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wimplicit-fallthrough=3
-Wcast-function-type -Wshadow=compatible-local -Wformat-security
-Wdeclaration-after-statement -Wno-format-truncation
-Wno-stringop-truncation -fPIC -DFRONTEND -MD -MQ
src/port/libpgport.a.p/pg_numa.c.o -MF
src/port/libpgport.a.p/pg_numa.c.o.d -o
src/port/libpgport.a.p/pg_numa.c.o -c
../../coderoot/pg/src/port/pg_numa.c
In file included from ../../coderoot/pg/src/include/storage/spin.h:54,
from
../../coderoot/pg/src/include/storage/condition_variable.h:26,
from ../../coderoot/pg/src/include/storage/barrier.h:22,
from ../../coderoot/pg/src/include/storage/pg_shmem.h:27,
from ../../coderoot/pg/src/port/pg_numa.c:26:
../../coderoot/pg/src/include/storage/s_lock.h:93:2: error: #error
"s_lock.h may not be included from frontend code"
93 | #error "s_lock.h may not be included from frontend code"
| ^~~~~
In file included from ../../coderoot/pg/src/port/pg_numa.c:26:
../../coderoot/pg/src/include/storage/pg_shmem.h:66:9: error: unknown
type name ‘pg_atomic_uint32’
66 | pg_atomic_uint32 NSharedBuffers;
| ^~~~~~~~~~~~~~~~
../../coderoot/pg/src/include/storage/pg_shmem.h:68:9: error: unknown
type name ‘pg_atomic_uint64’
68 | pg_atomic_uint64 Generation;
| ^~~~~~~~~~~~~~~~
../../coderoot/pg/src/port/pg_numa.c: In function ‘pg_numa_get_pagesize’:
../../coderoot/pg/src/port/pg_numa.c:117:17: error: too few arguments
to function ‘GetHugePageSize’
117 | GetHugePageSize(&os_page_size, NULL);
| ^~~~~~~~~~~~~~~
In file included from ../../coderoot/pg/src/port/pg_numa.c:26:
../../coderoot/pg/src/include/storage/pg_shmem.h:127:13: note: declared here
127 | extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
| ^~~~~~~~~~~~~~~

--
Best Wishes,
Ashutosh Bapat

#45

Dmitry Dolgov

9erthalion6@gmail.com

9 months ago

In reply to: Ashutosh Bapat (#44)

Re: Changing shared_buffers without restart

On Wed, Apr 09, 2025 at 11:12:18AM GMT, Ashutosh Bapat wrote:
On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

In the new v4 version
of the patch the first option is implemented.

The patches don't apply cleanly using git am but patch -p1 applies
them cleanly. However I see following compilation errors
RuntimeError: command "ninja" failed with error

Becase it's relatively meaningless to apply a patch to the tip of the
master around the release freeze time :) Commit 65c298f61fc has
introduced new usage of GetHugePageSize, which was modified in my patch.
I'm going to address it with the next rebased version, in the meantime
you can always use the specified base commit to apply the changeset:

base-commit: 5e1915439085014140314979c4dd5e23bd677cac

#46

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

9 months ago

In reply to: Dmitry Dolgov (#45)

Re: Changing shared_buffers without restart

On Wed, Apr 9, 2025 at 1:15 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Wed, Apr 09, 2025 at 11:12:18AM GMT, Ashutosh Bapat wrote:
On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

In the new v4 version
of the patch the first option is implemented.

The patches don't apply cleanly using git am but patch -p1 applies
them cleanly. However I see following compilation errors
RuntimeError: command "ninja" failed with error

Becase it's relatively meaningless to apply a patch to the tip of the
master around the release freeze time :) Commit 65c298f61fc has
introduced new usage of GetHugePageSize, which was modified in my patch.
I'm going to address it with the next rebased version, in the meantime
you can always use the specified base commit to apply the changeset:

base-commit: 5e1915439085014140314979c4dd5e23bd677cac

There is a higher chance that people will try these patches now than
it was two days before and more chance if they find the patches
applicable easily.

../../coderoot/pg/src/include/storage/s_lock.h:93:2: error: #error
"s_lock.h may not be included from frontend code"

How about this? Why is that happening?

--
Best Wishes,
Ashutosh Bapat

#47

Dmitry Dolgov

9erthalion6@gmail.com

9 months ago

In reply to: Ashutosh Bapat (#46)

Re: Changing shared_buffers without restart

On Wed, Apr 09, 2025 at 01:20:16PM GMT, Ashutosh Bapat wrote:
../../coderoot/pg/src/include/storage/s_lock.h:93:2: error: #error
"s_lock.h may not be included from frontend code"

How about this? Why is that happening?

The same -- as you can see it comes from compiling pg_numa.c, which as
it seems used in frontend and imports pg_shmem.h . I wanted to reshuffle
includes in the patch anyway, that would be a good excuse to finally do
this.

#48

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

9 months ago

In reply to: Dmitry Dolgov (#43)

Re: Changing shared_buffers without restart

On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Yes, you're right, plain dynamic Barrier does not ensure all available
processes will be synchronized. I was aware about the scenario you
describe, it's mentioned in commentaries for the resize function. I was
under the impression this should be enough, but after some more thinking
I'm not so sure anymore. Let me try to structure it as a list of
possible corner cases that we need to worry about:

* New backend spawned while we're busy resizing shared memory. Those
should wait until the resizing is complete and get the new size as well.

* Old backend receives a resize message, but exits before attempting to
resize. Those should be excluded from coordination.

Should we detach barrier in on_exit()?

* A backend is blocked and not responding before or after the
ProcSignalBarrier message was sent. I'm thinking about a failure
situation, when one rogue backend is doing something without checking
for interrupts. We need to wait for those to become responsive, and
potentially abort shared memory resize after some timeout.

Right.

I think a relatively elegant solution is to extend ProcSignalBarrier
mechanism to track not only pss_barrierGeneration, as a sign that
everything was processed, but also something like
pss_barrierReceivedGeneration, indicating that the message was received
everywhere but not processed yet. That would be enough to allow
processes to wait until the resize message was received everywhere, then
use a global Barrier to wait until all processes are finished. It's
somehow similar to your proposal to use two signals, but has less
implementation overhead.

The way it's implemented in v4 still has the disjoint group problem.
Assume backends p1, p2, p3. All three of them are executing
ProcessProcSignalBarrier(). All three of them updated
pss_barrierReceivedGeneration

/* The message is observed, record that */
pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
shared_gen);

p1, p2 moved faster and reached following code from ProcessBarrierShmemResize()
if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED)
WaitForProcSignalBarrierReceived(pg_atomic_read_u64(&ShmemCtrl->Generation));

Since all the processes have received the barrier message, p1, p2 move
ahead and go through all the next phases and finish resizing even
before p3 gets a chance to call ProcessBarrierShmemResize() and attach
itself to Barrier. This could happen because it processed some other
ProcSignalBarrier message. p1 and p2 won't wait for p3 since it has
not attached itself to the barrier. Once p1, p2 finish, p3 will attach
itself to the barrier and resize buffers again - reinitializing the
shared memory, which might has been already modified by p1 or p2. Boom
- there's memory corruption.

Either every process has to make sure that all the other extant
backends have attached themselves to the barrier OR somebody has to
ensure that and signal all the backends to proceed. The implementation
doesn't do either.

* Shared memory address space is now reserved for future usage, making
shared memory segments clash (e.g. due to memory allocation)
impossible. There is a new GUC to control how much space to reserve,
which is called max_available_memory -- on the assumption that most of
the time it would make sense to set its value to the total amount of
memory on the machine. I'm open for suggestions regarding the name.

With 0006 applied
+ /* Clean up some reserved space to resize into */
+ if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
ze, m->shmem)));
... snip ...
+ ptr = mremap(m->shmem, m->shmem_size, new_size, 0);

We unmap the portion of reserved address space where the existing
segment would expand into. As long as we are just expanding this will
work. I am wondering how would this work for shrinking buffers? What
scheme do you have in mind?

* There is one more patch to address hugepages remap. As mentioned in
this thread above, Linux kernel has certain limitations when it comes
to mremap for segments allocated with huge pages. To work around it's
possible to replace mremap with a sequence of unmap and map again,
relying on the anon file behind the segment to keep the memory
content. I haven't found any downsides of this approach so far, but it
makes the anonymous file patch 0007 mandatory.

In 0008
if (munmap(m->shmem, m->shmem_size) < 0)
... snip ...
/* Resize the backing anon file. */
if(ftruncate(m->segment_fd, new_size) == -1)
...
/* Reclaim the space */
ptr = mmap(m->shmem, new_size, PROT_READ | PROT_WRITE,
mmap_flags | MAP_FIXED, m->segment_fd, 0);

How are we preventing something get mapped into the space after
m->shmem + newsize? We will need to add an unallocated but reserved
addressed space map after m->shmem+newsize right?

--
Best Wishes,
Ashutosh Bapat

#49

Dmitry Dolgov

9erthalion6@gmail.com

9 months ago

In reply to: Ashutosh Bapat (#48)

Re: Changing shared_buffers without restart

On Fri, Apr 11, 2025 at 08:04:39PM GMT, Ashutosh Bapat wrote:
On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Yes, you're right, plain dynamic Barrier does not ensure all available
processes will be synchronized. I was aware about the scenario you
describe, it's mentioned in commentaries for the resize function. I was
under the impression this should be enough, but after some more thinking
I'm not so sure anymore. Let me try to structure it as a list of
possible corner cases that we need to worry about:

* New backend spawned while we're busy resizing shared memory. Those
should wait until the resizing is complete and get the new size as well.

* Old backend receives a resize message, but exits before attempting to
resize. Those should be excluded from coordination.

Should we detach barrier in on_exit()?

Yeah, good point.

I think a relatively elegant solution is to extend ProcSignalBarrier
mechanism to track not only pss_barrierGeneration, as a sign that
everything was processed, but also something like
pss_barrierReceivedGeneration, indicating that the message was received
everywhere but not processed yet. That would be enough to allow
processes to wait until the resize message was received everywhere, then
use a global Barrier to wait until all processes are finished. It's
somehow similar to your proposal to use two signals, but has less
implementation overhead.

The way it's implemented in v4 still has the disjoint group problem.
Assume backends p1, p2, p3. All three of them are executing
ProcessProcSignalBarrier(). All three of them updated
pss_barrierReceivedGeneration

/* The message is observed, record that */
pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
shared_gen);

p1, p2 moved faster and reached following code from ProcessBarrierShmemResize()
if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED)
WaitForProcSignalBarrierReceived(pg_atomic_read_u64(&ShmemCtrl->Generation));

Since all the processes have received the barrier message, p1, p2 move
ahead and go through all the next phases and finish resizing even
before p3 gets a chance to call ProcessBarrierShmemResize() and attach
itself to Barrier. This could happen because it processed some other
ProcSignalBarrier message. p1 and p2 won't wait for p3 since it has
not attached itself to the barrier. Once p1, p2 finish, p3 will attach
itself to the barrier and resize buffers again - reinitializing the
shared memory, which might has been already modified by p1 or p2. Boom
- there's memory corruption.

It won't reinitialize anything, since this logic is controlled by the
ShmemCtrl->NSharedBuffers, if it's already updated nothing will be
changed.

About the race condition you mention, there is indeed a window between
receiving the ProcSignalBarrier and attaching to the global Barrier in
resize, but I don't think any process will be able to touch buffer pool
while inside this window. Even if it happens that the remapping itself
was blazing fast that this window was enough to make one process late
(e.g. if it was busy handling some other signal as you mention), as I've
showed above it shouldn't be a problem.

I can experiment with this case though, maybe there is a way to
completely close this window to not thing about even potential
scenarios.

* Shared memory address space is now reserved for future usage, making
shared memory segments clash (e.g. due to memory allocation)
impossible. There is a new GUC to control how much space to reserve,
which is called max_available_memory -- on the assumption that most of
the time it would make sense to set its value to the total amount of
memory on the machine. I'm open for suggestions regarding the name.
With 0006 applied
+ /* Clean up some reserved space to resize into */
+ if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
ze, m->shmem)));
... snip ...
+ ptr = mremap(m->shmem, m->shmem_size, new_size, 0);
We unmap the portion of reserved address space where the existing
segment would expand into. As long as we are just expanding this will
work. I am wondering how would this work for shrinking buffers? What
scheme do you have in mind?

I didn't like this part originally, and after changes to support hugetlb
I think it's worth it just to replace mremap with munmap/mmap. That way
there will be no such question, e.g. if a segment is getting shrinked
the unmapped area will again become a part of the reserved space.

* There is one more patch to address hugepages remap. As mentioned in
this thread above, Linux kernel has certain limitations when it comes
to mremap for segments allocated with huge pages. To work around it's
possible to replace mremap with a sequence of unmap and map again,
relying on the anon file behind the segment to keep the memory
content. I haven't found any downsides of this approach so far, but it
makes the anonymous file patch 0007 mandatory.

In 0008
if (munmap(m->shmem, m->shmem_size) < 0)
... snip ...
/* Resize the backing anon file. */
if(ftruncate(m->segment_fd, new_size) == -1)
...
/* Reclaim the space */
ptr = mmap(m->shmem, new_size, PROT_READ | PROT_WRITE,
mmap_flags | MAP_FIXED, m->segment_fd, 0);

How are we preventing something get mapped into the space after
m->shmem + newsize? We will need to add an unallocated but reserved
addressed space map after m->shmem+newsize right?

Nope, the segment is allocated from the reserved space already, with
some chunk of it left after the segment's end for resizing purposes. We
only take some part of the designated space, the rest is still reserved.

#50

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

9 months ago

In reply to: Dmitry Dolgov (#49)

Re: Changing shared_buffers without restart

On Fri, Apr 11, 2025 at 8:31 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

I think a relatively elegant solution is to extend ProcSignalBarrier
mechanism to track not only pss_barrierGeneration, as a sign that
everything was processed, but also something like
pss_barrierReceivedGeneration, indicating that the message was received
everywhere but not processed yet. That would be enough to allow
processes to wait until the resize message was received everywhere, then
use a global Barrier to wait until all processes are finished. It's
somehow similar to your proposal to use two signals, but has less
implementation overhead.

The way it's implemented in v4 still has the disjoint group problem.
Assume backends p1, p2, p3. All three of them are executing
ProcessProcSignalBarrier(). All three of them updated
pss_barrierReceivedGeneration

/* The message is observed, record that */
pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
shared_gen);

p1, p2 moved faster and reached following code from ProcessBarrierShmemResize()
if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED)
WaitForProcSignalBarrierReceived(pg_atomic_read_u64(&ShmemCtrl->Generation));

Since all the processes have received the barrier message, p1, p2 move
ahead and go through all the next phases and finish resizing even
before p3 gets a chance to call ProcessBarrierShmemResize() and attach
itself to Barrier. This could happen because it processed some other
ProcSignalBarrier message. p1 and p2 won't wait for p3 since it has
not attached itself to the barrier. Once p1, p2 finish, p3 will attach
itself to the barrier and resize buffers again - reinitializing the
shared memory, which might has been already modified by p1 or p2. Boom
- there's memory corruption.

It won't reinitialize anything, since this logic is controlled by the
ShmemCtrl->NSharedBuffers, if it's already updated nothing will be
changed.

Ah, I see it now
if(pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers)
{

Thanks for the clarification.

However, when we put back the patches to shrink buffers, we will evict
the extra buffers, and shrink - if all the processes haven't
participated in the barrier by then, some of them may try to access
those buffers - re-installing them and then bad things can happen.

About the race condition you mention, there is indeed a window between
receiving the ProcSignalBarrier and attaching to the global Barrier in
resize, but I don't think any process will be able to touch buffer pool
while inside this window. Even if it happens that the remapping itself
was blazing fast that this window was enough to make one process late
(e.g. if it was busy handling some other signal as you mention), as I've
showed above it shouldn't be a problem.

I can experiment with this case though, maybe there is a way to
completely close this window to not thing about even potential
scenarios.

The window may be small today but we have to make this future proof.
Multiple ProcSignalBarrier messages may be processed in a single call
to ProcessProcSignalBarrier() and if each of those takes as long as
buffer resizing, the window will get bigger and bigger. So we have to
close this window.

* Shared memory address space is now reserved for future usage, making
shared memory segments clash (e.g. due to memory allocation)
impossible. There is a new GUC to control how much space to reserve,
which is called max_available_memory -- on the assumption that most of
the time it would make sense to set its value to the total amount of
memory on the machine. I'm open for suggestions regarding the name.
With 0006 applied
+ /* Clean up some reserved space to resize into */
+ if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
ze, m->shmem)));
... snip ...
+ ptr = mremap(m->shmem, m->shmem_size, new_size, 0);
We unmap the portion of reserved address space where the existing
segment would expand into. As long as we are just expanding this will
work. I am wondering how would this work for shrinking buffers? What
scheme do you have in mind?
I didn't like this part originally, and after changes to support hugetlb
I think it's worth it just to replace mremap with munmap/mmap. That way
there will be no such question, e.g. if a segment is getting shrinked
the unmapped area will again become a part of the reserved space.

I might have not noticed it, but are we putting two mappings one
reserved and one allocated in the same address space, so that when the
allocated mapping shrinks or expands, the reserved mapping continues
to prohibit any other mapping from appearing there? I looked at some
of the previous emails, but didn't find anything that describes how
the reserved mapped space is managed.

--
Best Wishes,
Ashutosh Bapat

#51

Dmitry Dolgov

9erthalion6@gmail.com

9 months ago

In reply to: Ashutosh Bapat (#50)

Re: Changing shared_buffers without restart

On Mon, Apr 14, 2025 at 10:40:28AM GMT, Ashutosh Bapat wrote:

However, when we put back the patches to shrink buffers, we will evict
the extra buffers, and shrink - if all the processes haven't
participated in the barrier by then, some of them may try to access
those buffers - re-installing them and then bad things can happen.

As I've mentioned above, I don't see how a process could try to access a
buffer, if it's on the path between receiving the ProcSignalBarrier and
attaching to the global shmem Barrier, even if we shrink buffers.
AFAICT interrupt handles should not touch buffers, and otherwise the
process doesn't have any point withing this window where it might do
this. Do you have some particular scenario in mind?

I might have not noticed it, but are we putting two mappings one
reserved and one allocated in the same address space, so that when the
allocated mapping shrinks or expands, the reserved mapping continues
to prohibit any other mapping from appearing there? I looked at some
of the previous emails, but didn't find anything that describes how
the reserved mapped space is managed.

I though so, but this turns out to be incorrect. Just have done a small
experiment -- looks like when reserving some space, mapping and
unmapping a small segment from it leaves a non-mapped gap. That would
mean for shrinking the new available space has to be reserved again.

#52

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

9 months ago

In reply to: Dmitry Dolgov (#51)

Re: Changing shared_buffers without restart

On Mon, Apr 14, 2025 at 12:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Mon, Apr 14, 2025 at 10:40:28AM GMT, Ashutosh Bapat wrote:

However, when we put back the patches to shrink buffers, we will evict
the extra buffers, and shrink - if all the processes haven't
participated in the barrier by then, some of them may try to access
those buffers - re-installing them and then bad things can happen.

As I've mentioned above, I don't see how a process could try to access a
buffer, if it's on the path between receiving the ProcSignalBarrier and
attaching to the global shmem Barrier, even if we shrink buffers.
AFAICT interrupt handles should not touch buffers, and otherwise the
process doesn't have any point withing this window where it might do
this. Do you have some particular scenario in mind?

ProcessProcSignalBarrier() is not within an interrupt handler but it
responds to a flag set by an interrupt handler. After calling
pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
shared_gen); it will enter the loop

while (flags != 0)
where it may process many barriers before processing
PROCSIGNAL_BARRIER_SHMEM_RESIZE. Nothing stops the other barrier
processing code from touching buffers. Right now it's just smgrrelease
that gets called in the other barrier. But that's not guaranteed in
future.

I might have not noticed it, but are we putting two mappings one
reserved and one allocated in the same address space, so that when the
allocated mapping shrinks or expands, the reserved mapping continues
to prohibit any other mapping from appearing there? I looked at some
of the previous emails, but didn't find anything that describes how
the reserved mapped space is managed.

I though so, but this turns out to be incorrect. Just have done a small
experiment -- looks like when reserving some space, mapping and
unmapping a small segment from it leaves a non-mapped gap. That would
mean for shrinking the new available space has to be reserved again.

Right. That's what I thought. But I didn't see the corresponding code.
So we have to keep track of two mappings for every segment - 1 for
allocation and one for reserving space and resize those two while
shrinking and expanding buffers. Am I correct?

--
Best Wishes,
Ashutosh Bapat

#53

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

9 months ago

In reply to: Dmitry Dolgov (#51)

Re: Changing shared_buffers without restart

Hi Dmitry,

On Mon, Apr 14, 2025 at 12:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Mon, Apr 14, 2025 at 10:40:28AM GMT, Ashutosh Bapat wrote:

However, when we put back the patches to shrink buffers, we will evict
the extra buffers, and shrink - if all the processes haven't
participated in the barrier by then, some of them may try to access
those buffers - re-installing them and then bad things can happen.

As I've mentioned above, I don't see how a process could try to access a
buffer, if it's on the path between receiving the ProcSignalBarrier and
attaching to the global shmem Barrier, even if we shrink buffers.
AFAICT interrupt handles should not touch buffers, and otherwise the
process doesn't have any point withing this window where it might do
this. Do you have some particular scenario in mind?

I might have not noticed it, but are we putting two mappings one
reserved and one allocated in the same address space, so that when the
allocated mapping shrinks or expands, the reserved mapping continues
to prohibit any other mapping from appearing there? I looked at some
of the previous emails, but didn't find anything that describes how
the reserved mapped space is managed.

I though so, but this turns out to be incorrect. Just have done a small
experiment -- looks like when reserving some space, mapping and
unmapping a small segment from it leaves a non-mapped gap. That would
mean for shrinking the new available space has to be reserved again.

In an offlist chat Thomas Munro mentioned that just ftruncate() would
be enough to resize the shared memory without touching address maps
using mmap and munmap().

ftruncate man page seems to concur with him

If the effect of ftruncate() is to decrease the size of a memory
mapped file or a shared memory object and whole pages beyond the
new end were previously mapped, then the whole pages beyond the
new end shall be discarded.

References to discarded pages shall result in the generation of a
SIGBUS signal.

If the effect of ftruncate() is to increase the size of a memory
object, it is unspecified whether the contents of any mapped pages
between the old end-of-file and the new are flushed to the
underlying object.

ftruncate() when shrinking memory will release the extra pages and
also would cause segmentation fault when memory outside the size of
file is accessed even if the actual address map is larger than the
mapped file. The expanded memory is allocated as it is written to, and
those pages also become visible in the underlying object.

I played with the attached small program under debugger observing pmap
and /proc/<pid>/status after every memory operation. The address map
always shows that it's as long as 300K memory.
00007fffd2200000 307200K rw-s- memfd:mmap_fd_exp (deleted)

Immediately after mmap()
RssShmem: 0 kB

after first memset
RssShmem: 307200 kB

after ftruncate to 100MB (we don't need to wait for memset() to see
the effect on RssShmem)
RssShmem: 102400 kB

after ftruncate to 200MB (requires memset to see effect on RssShmem)
RssShmem: 102400 kB

after memsetting upto 200MB
RssShmem: 204800 kB

All the observations concur with the man page.

[1]: https://man7.org/linux/man-pages/man3/ftruncate.3p.html#:~:text=If%20the%20effect%20of%20ftruncate,generation%20of%20a%20SIGBUS%20signal.

--
Best Wishes,
Ashutosh Bapat

#54

Konstantin Knizhnik

knizhnik@garret.ru

9 months ago

In reply to: Dmitry Dolgov (#33)

Re: Changing shared_buffers without restart

On 25/02/2025 11:52 am, Dmitry Dolgov wrote:

On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
changing shared memory mapping layout. Any feedback is appreciated.

Hi Dmitry,

I am sorry that I have not participated in the discussion in this thread
from the very beginning, although I am also very interested in dynamic
shared buffer resizing and evn proposed my own implementation of it:
https://github.com/knizhnik/postgres/pull/2 based on memory ballooning
and using `madvise`. And it really works (returns unused memory to the
system).
This PoC allows me to understand the main drawbacks of this approach:

1. Performance of Postgres CLOCK page eviction algorithm depends on
number of shared buffers. My first native attempt just to mark unused
buffers as invalid cause significant degrade of performance

pgbench -c 32 -j 4 -T 100 -P1 -M prepared -S

(here shared_buffers - is maximal shared buffers size and
`available_buffers` - is used part:

| shared_buffers | available_buffers | TPS | | ------------------|
---------------------------- | ---- | | 128MB | -1 | 280k | | 1GB | -1 |
324k | | 2GB | -1 | 358k | | 32GB | -1 | 350k | | 2GB | 128Mb | 130k | |
2GB | 1Gb | 311k | | 32GB | 128Mb | 13k | | 32GB | 1Gb | 140k | | 32GB |
2Gb | 348k |

My first thought is to replace clock with LRU based in double-linked
list. As far as there is no lockless double-list implementation,
it need some global lock. This lock can become bottleneck. The standard
solution is partitioning: use N LRU lists instead of 1.
Just as partitioned has table used by buffer manager to lockup buffers.
Actually we can use the same partitions locks to protect LRU list.
But it not clear what to do with ring buffers (strategies).So I decided
not to perform such revolution in bufmgr, but optimize clock to more
efficiently split reserved buffers.
Just add|skip_count|field to buffer descriptor. And it helps! Now the
worst case shared_buffer/available_buffers = 32Gb/128Mb
shows the same performance 280k as shared_buffers=128Mb without ballooning.

2. There are several data structures i Postgres which size depends on
number of buffers.
In my patch I used in some cases dynamic shared buffer size, but if this
structure has to be allocated in shared memory then still maximal size
has to be used. We have the buffers themselves (8 kB per buffer), then
the main BufferDescriptors array (64 B), the BufferIOCVArray (16 B),
checkpoint's CkptBufferIds (20 B), and the hashmap on the buffer cache
(24B+8B/entry).
128 bytes per 8kb bytes seems to large overhead (~1%) but but it may be
quote noticeable with size differences larger than 2 orders of magnitude:
E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer
we'd have ~2GiB of static overhead on only 0.5GiB of actual buffers.

3. `madvise` is not portable.

Certainly you have moved much further in your proposal comparing with my
PoC (including huge pages support).
But it is still not quite clear to me how you are going to solve the
problems with large memory overhead in case of ~100x times variation of
shared buffers size.

#55

Thomas Munro

thomas.munro@gmail.com

9 months ago

In reply to: Peter Eisentraut (#9)

Re: Changing shared_buffers without restart

On Thu, Nov 21, 2024 at 8:55 PM Peter Eisentraut <peter@eisentraut.org> wrote:

On 19.11.24 14:29, Dmitry Dolgov wrote:

I see that memfd_create() has a MFD_HUGETLB flag. It's not very clear how
that interacts with the MAP_HUGETLB flag for mmap(). Do you need to specify
both of them if you want huge pages?

Correct, both (one flag in memfd_create and one for mmap) are needed to
use huge pages.

I was worried because the FreeBSD man page says

MFD_HUGETLB This flag is currently unsupported.

It looks like FreeBSD doesn't have MAP_HUGETLB, so maybe this is irrelevant.

But you should make sure in your patch that the right set of flags for
huge pages is passed.

MFD_HUGETLB does actually work on FreeBSD, but the man page doesn't
admit it (guessing an oversight, not sure, will see). And you don't
need the corresponding (non-existent) mmap flag. You also have to
specify a size eg MFD_HUGETLB | MFD_HUGE_2MB or you get ENOTSUPP, but
other than that quirk I see it definitely working with eg procstat -v.
That might be because FreeBSD doesn't have a default huge page size
concept? On Linux that's a boot time setting, I guess rarely changed.
I contemplated that once before, when I wrote a quick demo patch[1] to
implement huge_pages=on for FreeBSD (ie explicit rather than
transparent). I used a different function, not the Linuxoid one but
it's the same under the covers, and I wrote:

+ /*
+ * Find the matching page size index, or if huge_page_size wasn't set,
+ * then skip the smallest size and take the next one after that.
+ */

Swapping that topic back in, I was left wondering: (1) how to choose
between SHM_LARGEPAGE_ALLOC_DEFAULT, a policy that will cause
ftruncate() to try to defragment physical memory to fulfil your
request and can eat some serious CPU, and SHM_LARGEPAGE_ALLOC_NOWAIT,
and (2) if it's the second thing, well Linux is like that in respect
of failing fast, but for it to succeed you have to configure
nr_hugepages in the OS as a separate administrative step and *that's*
when it does any defragmentation required, and that's another concept
FreeBSD doesn't have. It's a bit of a weird concept too, I mean those
pages are not reserved for you in any way and anyone could nab them,
which is undeniably practical but it lacks a few qualities one might
hope for in a kernel facility... IDK. Anyway, the Linux-like
memfd_create() always does it the _DEFAULT way. EIther way, we can't
have identical "try" semantics: it'll actually put some effort into
trying, perhaps burning many seconds of CPU.

I took a peek at what we're doing for Windows and the man pages tell
me that it's like that too. I don't recall hearing any complaints
about that, but it's gated on a Windows permission that I assume very
few enabled, so "try" probably isn't trying for most systems.
Quoting:

"Large-page memory regions may be difficult to obtain after the system
has been running for a long time because the physical space for each
large page must be contiguous, but the memory may have become
fragmented. Allocating large pages under these conditions can
significantly affect system performance. Therefore, applications
should avoid making repeated large-page allocations and instead
allocate all large pages one time, at startup."

For Windows we also interpret "on" with GetLargePageMinimum(), which
sounds like my "second known page size" idea.

To make Windows do the thing that this thread wants, I found a thread
saying that calling VirtualAlloc(..., MEM_RESET) and then convincing
every process to call VirtualUnlock(...) might work:

https://groups.google.com/g/microsoft.public.win32.programmer.kernel/c/3SvznY38SSc/m/4Sx_xwon1vsJ

I'm not sure what to do about the other Unixen. One option is
nothing, no feature, patches welcome. Another is to use
shm_open(<made up name>), like DSM segments, except we never need to
reopen these ones so we could immediately call shm_unlink() to leave
only a very short window to crash and leak a name. It'd be low risk
name pollution in a name space that POSIX forgot to provide any way to
list. The other idea is non-standard madvise tricks but they seem
far too squishy to be part of a "portable" fallback if they even work
at all, so it might be better not to have the feature than that I
think.

#56

Dmitry Dolgov

9erthalion6@gmail.com

9 months ago

In reply to: Ashutosh Bapat (#53)

Re: Changing shared_buffers without restart

On Thu, Apr 17, 2025 at 03:22:28PM GMT, Ashutosh Bapat wrote:

In an offlist chat Thomas Munro mentioned that just ftruncate() would
be enough to resize the shared memory without touching address maps
using mmap and munmap().

ftruncate man page seems to concur with him

If the effect of ftruncate() is to decrease the size of a memory
mapped file or a shared memory object and whole pages beyond the
new end were previously mapped, then the whole pages beyond the
new end shall be discarded.

References to discarded pages shall result in the generation of a
SIGBUS signal.

If the effect of ftruncate() is to increase the size of a memory
object, it is unspecified whether the contents of any mapped pages
between the old end-of-file and the new are flushed to the
underlying object.

ftruncate() when shrinking memory will release the extra pages and
also would cause segmentation fault when memory outside the size of
file is accessed even if the actual address map is larger than the
mapped file. The expanded memory is allocated as it is written to, and
those pages also become visible in the underlying object.

Thanks for sharing. I need to do more thorough tests, but after a quick
look I'm not sure about that. ftruncate will take care about the memory,
but AFAICT the memory mapping will stay the same, is that what you mean?
In that case if the segment got increased, the memory still can't be
used because it's beyond the mapping end (at least in my test that's
what happened). If the segment got shrinked, the memory couldn't be
reclaimed, because, well, there is already a mapping. Or do I miss
something?

I might have not noticed it, but are we putting two mappings one
reserved and one allocated in the same address space, so that when the
allocated mapping shrinks or expands, the reserved mapping continues
to prohibit any other mapping from appearing there? I looked at some
of the previous emails, but didn't find anything that describes how
the reserved mapped space is managed.

I though so, but this turns out to be incorrect. Just have done a small
experiment -- looks like when reserving some space, mapping and
unmapping a small segment from it leaves a non-mapped gap. That would
mean for shrinking the new available space has to be reserved again.

Right. That's what I thought. But I didn't see the corresponding code.
So we have to keep track of two mappings for every segment - 1 for
allocation and one for reserving space and resize those two while
shrinking and expanding buffers. Am I correct?

Not necessarily, depending on what we want. Again, I'll do a bit more testing,
but after a quick check it seems that it's possible to "plug" the gap with a
new reservation mapping, then reallocate it to another mapping or unmap both
reservations (main and the "gap" one) at once. That would mean that for the
current functionality we don't need to track reservation in any way more than
just start and the end of the "main" reserved space. The only consequence I can
imagine is possible fragmentation of the reserved space in case of frequent
increase/decrease of a segment with even decreasing size. But since it's only
reserved space, which will not really be used, it's probably not going to be a
problem.

#57

Dmitry Dolgov

9erthalion6@gmail.com

9 months ago

In reply to: Konstantin Knizhnik (#54)

Re: Changing shared_buffers without restart

On Thu, Apr 17, 2025 at 02:21:07PM GMT, Konstantin Knizhnik wrote:

1. Performance of Postgres CLOCK page eviction algorithm depends on number
of shared buffers. My first native attempt just to mark unused buffers as
invalid cause significant degrade of performance

Thanks for sharing!

Right, but it concerns the case when the number of shared buffers is
high, independently from whether it was changed online or with a
restart, correct? In that case it's out of scope for this patch.

2. There are several data structures i Postgres which size depends on number
of buffers.
In my patch I used in some cases dynamic shared buffer size, but if this
structure has to be allocated in shared memory then still maximal size has
to be used. We have the buffers themselves (8 kB per buffer), then the main
BufferDescriptors array (64 B), the BufferIOCVArray (16 B), checkpoint's
CkptBufferIds (20 B), and the hashmap on the buffer cache (24B+8B/entry).
128 bytes per 8kb bytes seems toï¿½ large overhead (~1%) but but it may be
quote noticeable with size differences larger than 2 orders of magnitude:
E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer we'd
have ~2GiB of static overhead on only 0.5GiB of actual buffers.

Not sure what do you mean by using a maximal size, can you elaborate.

In the current patch those structures are allocated as before, except
each goes into a separate segment -- without any extra memory overhead
as far as I see.

3. `madvise` is not portable.

The current implementation doesn't rely on madvise so far (it might for
shared memory shrinking), but yeah there are plenty of other not very
portable things (MAP_FIXED, memfd_create). All of that is mentioned in
the corresponding patches as a limitation.

#58

Ni Ku

jakkuniku@gmail.com

9 months ago

In reply to: Ashutosh Bapat (#53)

Re: Changing shared_buffers without restart

Hi Ashutosh / Dmitry,

Thanks for the information and discussions, it's been very helpful.

I also have a related question about how ftruncate() is used in the patch.
In my testing I also see that when using ftruncate to shrink a shared
segment, the memory is freed immediately after the call, even if other
processes still have that memory mapped, and they will hit SIGBUS if they
try to access that memory again as the manpage says.

So am I correct to think that, to support the bufferpool shrinking case, it
would not be safe to call ftruncate in AnonymousShmemResize as-is, since at
that point other processes may still be using pages that belong to the
truncated memory?
It appears that for shrinking we should only call ftruncate when we're sure
no process will access those pages again (eg, all processes have handled
the resize interrupt signal barrier). I suppose this can be done by the
resize coordinator after synchronizing with all the other processes.
But in that case it seems we cannot use the postmaster as the coordinator
then? b/c I see some code comments saying the postmaster does not have
waiting infrastructure... (maybe even if the postmaster has waiting infra
we don't want to use it anyway since it can be blocked for a long time and
won't be able to serve other requests).

Regards,

Jack Ng

#59

Thomas Munro

thomas.munro@gmail.com

9 months ago

In reply to: Thomas Munro (#55)

Re: Changing shared_buffers without restart

On Fri, Apr 18, 2025 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote:

I contemplated that once before, when I wrote a quick demo patch[1] to
implement huge_pages=on for FreeBSD (ie explicit rather than
transparent). I used a different function, not the Linuxoid one but

Oops, I forgot to supply that link[1]/messages/by-id/CA+hUKGLmBWHF6gusP55R7jVS1=6T=GphbZpUXiOgMMHDUkVCgw@mail.gmail.com. And by the way all that
technical mumbo jumbo about FreeBSD was just me writing up why I
didn't pull the trigger and add explicit huge_pages support for it.
The short version is: you shouldn't try to use that flag at all on
FreeBSD yet, as it's a separate research project to add that feature.
I care about PostgreSQL/FreeBSD personally and may consider that again
as I learn more about virtual memory topics, but actually its
transparent super pages seem to do a pretty decent job already and
people don't seem to want to turn them off.

For an actionable plan that should be portable everywhere, how about
this: use shm_open(<tempname>, O_CREAT | O_EXCL, S_IRUSR | S_IWUSR)
followed by shm_unlink(<tempname>) to make this work on every Unix
(FreeBSD could use its slightly better SHM_ANON as the name and skip
the unlink), and redirect to memfd inside #ifdef __linux__. One thing
to consider is that shm_open() descriptors are implicitly set to
FD_CLOEXEC per POSIX, so I think you need to clear that flag with
fcntl() in EXEC_BACKEND builds, and then also set it again in children
so that they don't pass the descriptor to subprograms they run with
system() etc. memfd_create() needs the same consideration, except its
default is the other way: I think you need to supply the MFD_CLOEXEC
flag explicitly, unless it's an EXEC_BACKEND build, and use the same
fnctl() to clear it in children if it is. To restate that the other
way around, in non-EXEC_BACKEND builds shm_open() already does the
right thing and memfd_create() needs MFD_CLOEXEC, with no extra steps
after that.

The only systems I'm aware of that *don't* have shm_open() are (1)
Android, but it's Linux so I assume it has memfd_create() (just for
fun: you can run PostgreSQL on a phone with termux[2]https://github.com/termux/termux-packages/tree/master/packages/postgresql, and you can see
that their package supplies a fake shm_open() that redirects to plain
open(); I guess didn't realise they could have supplied an ENOSYS
dummy and just set dynamic_shared_memory_type=mmap instead, and we'd
have done that for them!), and (2) the capability-based research OS
projects like Capsicum (and probably the others like it) that rip out
all the global namespace Unix APIs for approximately the same reason
as Android (PostgreSQL can't run under those yet, but just for fun: I
had PostgreSQL mostly working under Capsicum once, and noticed that
the problems to be solved had significant overlap with the
multithreading project: the global namespace stuff like signals/PIDs
and onymous IPC go away, and the only other major thing is absolute
paths, many of which are easily made relative to a pgdata fd and
handled with openat() in fd.c, but I digress...).

[1]: /messages/by-id/CA+hUKGLmBWHF6gusP55R7jVS1=6T=GphbZpUXiOgMMHDUkVCgw@mail.gmail.com
[2]: https://github.com/termux/termux-packages/tree/master/packages/postgresql

#60

Konstantin Knizhnik

knizhnik@garret.ru

9 months ago

In reply to: Dmitry Dolgov (#57)

Re: Changing shared_buffers without restart

On 18/04/2025 12:26 am, Dmitry Dolgov wrote:

On Thu, Apr 17, 2025 at 02:21:07PM GMT, Konstantin Knizhnik wrote:

1. Performance of Postgres CLOCK page eviction algorithm depends on number
of shared buffers. My first native attempt just to mark unused buffers as
invalid cause significant degrade of performance

Thanks for sharing!

Right, but it concerns the case when the number of shared buffers is
high, independently from whether it was changed online or with a
restart, correct? In that case it's out of scope for this patch.

2. There are several data structures i Postgres which size depends on number
of buffers.
In my patch I used in some cases dynamic shared buffer size, but if this
structure has to be allocated in shared memory then still maximal size has
to be used. We have the buffers themselves (8 kB per buffer), then the main
BufferDescriptors array (64 B), the BufferIOCVArray (16 B), checkpoint's
CkptBufferIds (20 B), and the hashmap on the buffer cache (24B+8B/entry).
128 bytes per 8kb bytes seems to large overhead (~1%) but but it may be
quote noticeable with size differences larger than 2 orders of magnitude:
E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer we'd
have ~2GiB of static overhead on only 0.5GiB of actual buffers.

Not sure what do you mean by using a maximal size, can you elaborate.

In the current patch those structures are allocated as before, except
each goes into a separate segment -- without any extra memory overhead
as far as I see.

Thank you for explanation. I am sorry that I have not precisely
investigated your patch before writing: it seems to be that you are are
placing in separate segment only content of shared buffers.
Now I see that I was wrong and it is actually the main difference with
memory ballooning approach I have used. As far as you are are allocating
buffers descriptors and hash table in the same segment,
there is no extra memory overhead.
The only drawback is that we are loosing content of shared buffers in
case of resize. It may be sadly, but not looks like there is no better
alternative.

But there are still some dependencies on shared buffers size which are
not addressed in this PR.
I am not sure how critical they are and is it possible to do something
here, but at least I want to enumerate them:

1. Checkpointer: maximal number of checkpointer requests depends on
NBuffers. So if we start with small shared buffers and then upscale, it
may cause the too frequent checkpoints:

Size
CheckpointerShmemSize(void)
...
size = add_size(size, mul_size(NBuffers,
sizeof(CheckpointerRequest)));

CheckpointerShmemInit(void)
CheckpointerShmem->max_requests = NBuffers;

2. XLOG: number of xlog buffers is calculated depending on number of
shared buffers:

XLOGChooseNumBuffers(void)
{
...
xbuffers = NBuffers / 32;

Should not cause some errors, but may be not so efficient if once again
we start we tiny shared buffers.

3. AIO: AIO max concurrency is also calculated based on number of shared
buffers:

AioChooseMaxConcurrency(void)
{
...

max_proportional_pins = NBuffers / max_backends;

For small shared buffers (i.e. 1Mb, there will be no concurrency at all).

So none of this issues can cause some error, just some inefficient behavior.
But if we want to start with very small shared buffers and then increase
them on demand,
then it can be a problem.

In all this three cases NBuffers is used not just to calculate some
threshold value, but also determine size of the structure in shared memory.
The straightforward solution is to place them in the same segment as
shared buffers. But I am not sure how difficult it will be to implement.

#61

Thomas Munro

thomas.munro@gmail.com

9 months ago

In reply to: Dmitry Dolgov (#56)

Re: Changing shared_buffers without restart

On Fri, Apr 18, 2025 at 7:25 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Thu, Apr 17, 2025 at 03:22:28PM GMT, Ashutosh Bapat wrote:

In an offlist chat Thomas Munro mentioned that just ftruncate() would
be enough to resize the shared memory without touching address maps
using mmap and munmap().

ftruncate man page seems to concur with him

If the effect of ftruncate() is to decrease the size of a memory
mapped file or a shared memory object and whole pages beyond the
new end were previously mapped, then the whole pages beyond the
new end shall be discarded.

References to discarded pages shall result in the generation of a
SIGBUS signal.

If the effect of ftruncate() is to increase the size of a memory
object, it is unspecified whether the contents of any mapped pages
between the old end-of-file and the new are flushed to the
underlying object.

ftruncate() when shrinking memory will release the extra pages and
also would cause segmentation fault when memory outside the size of
file is accessed even if the actual address map is larger than the
mapped file. The expanded memory is allocated as it is written to, and
those pages also become visible in the underlying object.

Thanks for sharing. I need to do more thorough tests, but after a quick
look I'm not sure about that. ftruncate will take care about the memory,
but AFAICT the memory mapping will stay the same, is that what you mean?
In that case if the segment got increased, the memory still can't be
used because it's beyond the mapping end (at least in my test that's
what happened). If the segment got shrinked, the memory couldn't be
reclaimed, because, well, there is already a mapping. Or do I miss
something?

I was imagining that you might map some maximum possible size at the
beginning to reserve the address space permanently, and then adjust
the virtual memory object's size with ftruncate as required to provide
backing. Doesn't that achieve the goal with fewer steps, using only
portable* POSIX stuff, and keeping all pointers stable? I understand
that pointer stability may not be required (I can see roughly how that
argument is constructed), but isn't it still better to avoid having to
prove that and deal with various other problems completely? Is there
a downside/cost to having a large mapping that is only partially
backed? I suppose choosing that number might offend you but at least
there is an obvious upper bound: physical memory size.

*You might also want to use fallocate after ftruncate on Linux to
avoid SIGBUS on allocation failure on first touch page fault, which
raises portability questions since it's unspecified whether you can do
that with shm fds and fails on some systems, but it let's call that an
independent topic as it's not affected by this choice.

#62

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Thomas Munro (#61)

Re: Changing shared_buffers without restart

Hi,

On April 18, 2025 11:17:21 AM GMT+02:00, Thomas Munro <thomas.munro@gmail.com> wrote:

Doesn't that achieve the goal with fewer steps, using only
portable* POSIX stuff, and keeping all pointers stable? I understand
that pointer stability may not be required (I can see roughly how that
argument is constructed), but isn't it still better to avoid having to
prove that and deal with various other problems completely?

I think we should flat out reject any approach that does not maintain pointer stability. It would restrict future optimizations a lot if we can't rely on that (e.g. not materializing tuples when transporting them from worker to leader; pointering datastructures in shared buffers).

Greetings,

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#63

Thomas Munro

thomas.munro@gmail.com

9 months ago

In reply to: Thomas Munro (#61)

Re: Changing shared_buffers without restart

On Fri, Apr 18, 2025 at 9:17 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Apr 18, 2025 at 7:25 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Thanks for sharing. I need to do more thorough tests, but after a quick
look I'm not sure about that. ftruncate will take care about the memory,
but AFAICT the memory mapping will stay the same, is that what you mean?
In that case if the segment got increased, the memory still can't be
used because it's beyond the mapping end (at least in my test that's
what happened). If the segment got shrinked, the memory couldn't be
reclaimed, because, well, there is already a mapping. Or do I miss
something?

I was imagining that you might map some maximum possible size at the
beginning to reserve the address space permanently, and then adjust
the virtual memory object's size with ftruncate as required to provide
backing. Doesn't that achieve the goal with fewer steps, using only
portable* POSIX stuff, and keeping all pointers stable? I understand
that pointer stability may not be required (I can see roughly how that
argument is constructed), but isn't it still better to avoid having to
prove that and deal with various other problems completely? Is there
a downside/cost to having a large mapping that is only partially
backed? I suppose choosing that number might offend you but at least
there is an obvious upper bound: physical memory size.

TIL that mmap(size, fd) will actually extend a hugetlb memfd as a side
effect on Linux, as if you had called ftruncate on it (fully allocated
huge pages I expected up to the object's size, just not magical size
changes beyond that when I merely asked to map it). That doesn't
happen for regular page size, or for any page size on my local OS's
shm objects and doesn't seem to fit mmap's job description given an
fd*, but maybe I'm just confused. Anyway, a workaround seems to be
to start out with PROT_NONE and MAP_NORESERVE, then mprotect(PROT_READ
| PROT_WRITE) new regions after extending with ftruncate(), at least
in simple tests...

(*Hmm, wiild uninformed speculation: perhap the size-setting behaviour
needed when hugetlbfs is used secretly to implement MAP_ANONYMOUS is
being exposed also when a hugetlbfs fd is given explicitly to mmap,
generating this bizarro side effect?)

#64

Dmitry Dolgov

9erthalion6@gmail.com

9 months ago

In reply to: Thomas Munro (#63)

Re: Changing shared_buffers without restart

On Fri, Apr 18, 2025 at 09:17:21PM GMT, Thomas Munro wrote:
I was imagining that you might map some maximum possible size at the
beginning to reserve the address space permanently, and then adjust
the virtual memory object's size with ftruncate as required to provide
backing. Doesn't that achieve the goal with fewer steps, using only
portable* POSIX stuff, and keeping all pointers stable?

Ah, I see what you folks mean. So in the latest patch there is a single large
shared memory area reserved with PROT_NONE + MAP_NORESERVE. This area is
logically divided between shmem segments, and each segment is mmap'd out of it
and could be resized withing these logical boundaries. Now the suggestion is to
have one reserved area for each segment, and instead of really mmap'ing
something out of it, manage memory via ftruncate.

Yeah, that would work and will allow to avoid MAP_FIXED and mremap, which are
questionable from portability point of view. This leaves memfd_create, and I'm
still not completely clear on it's portability -- it seems to be specific to
Linux, but others provide compatible implementation as well.

Let me experiment with this idea a bit, I would like to make sure there are no
other limitations we might face.

I understand that pointer stability may not be required

Just to clarify, the current patch maintains this property (stable pointers),
which I also see as mandatory for any possible implementation.

*You might also want to use fallocate after ftruncate on Linux to
avoid SIGBUS on allocation failure on first touch page fault, which
raises portability questions since it's unspecified whether you can do
that with shm fds and fails on some systems, but it let's call that an
independent topic as it's not affected by this choice.

I'm afraid it would be strictly neccessary to do fallocate, otherwise we're
back where we were before reservation accounting for huge pages in Linux (lot's
of people were facing unexpected SIGBUS when dealing with cgroups).

TIL that mmap(size, fd) will actually extend a hugetlb memfd as a side
effect on Linux, as if you had called ftruncate on it (fully allocated
huge pages I expected up to the object's size, just not magical size
changes beyond that when I merely asked to map it). That doesn't
happen for regular page size, or for any page size on my local OS's
shm objects and doesn't seem to fit mmap's job description given an
fd*, but maybe I'm just confused. Anyway, a workaround seems to be
to start out with PROT_NONE and MAP_NORESERVE, then mprotect(PROT_READ
| PROT_WRITE) new regions after extending with ftruncate(), at least
in simple tests...

Right, it's similar to the currently implemented space reservation, which also
goes with PROT_NONE and MAP_NORESERVE. I assume it boils down to the way how
memory reservation accounting in Linux works.

#65

Dmitry Dolgov

9erthalion6@gmail.com

9 months ago

In reply to: Ni Ku (#58)

Re: Changing shared_buffers without restart

On Thu, Apr 17, 2025 at 07:05:36PM GMT, Ni Ku wrote:
I also have a related question about how ftruncate() is used in the patch.
In my testing I also see that when using ftruncate to shrink a shared
segment, the memory is freed immediately after the call, even if other
processes still have that memory mapped, and they will hit SIGBUS if they
try to access that memory again as the manpage says.

So am I correct to think that, to support the bufferpool shrinking case, it
would not be safe to call ftruncate in AnonymousShmemResize as-is, since at
that point other processes may still be using pages that belong to the
truncated memory?
It appears that for shrinking we should only call ftruncate when we're sure
no process will access those pages again (eg, all processes have handled
the resize interrupt signal barrier). I suppose this can be done by the
resize coordinator after synchronizing with all the other processes.
But in that case it seems we cannot use the postmaster as the coordinator
then? b/c I see some code comments saying the postmaster does not have
waiting infrastructure... (maybe even if the postmaster has waiting infra
we don't want to use it anyway since it can be blocked for a long time and
won't be able to serve other requests).

There is already a coordination infrastructure, implemented in the patch
0006, which will take care of this and prevent access to the shared
memory until everything is resized.

#66

Dmitry Dolgov

9erthalion6@gmail.com

9 months ago

In reply to: Konstantin Knizhnik (#60)

Re: Changing shared_buffers without restart

On Fri, Apr 18, 2025 at 10:06:23AM GMT, Konstantin Knizhnik wrote:
The only drawback is that we are loosing content of shared buffers in case
of resize. It may be sadly, but not looks like there is no better
alternative.

No, why would we loose the content? If we do mremap, it will leave the
content as it is. If we do munmap/mmap with an anonymous backing file,
it will also keep the content in memory. The same with another proposal
about using ftruncate/fallocate only, both will leave the content
untouch unless told to do otherwise.

But there are still some dependencies on shared buffers size which are not
addressed in this PR.
I am not sure how critical they are and is it possible to do something here,
but at least I want to enumerate them:

Righ, I'm aware about those (except the AIO one, which was added after
the first version of the patch), and didn't address them yet due to the
same reason you've mentioned -- they're not hard errors, rather
inefficiencies. But thanks for the reminder, I keep those in the back of
my mind, and when the rest of the design will be settled down, I'll try
to address them as well.

#67

Thomas Munro

thomas.munro@gmail.com

9 months ago

In reply to: Dmitry Dolgov (#64)

Re: Changing shared_buffers without restart

On Mon, Apr 21, 2025 at 9:30 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Yeah, that would work and will allow to avoid MAP_FIXED and mremap, which are
questionable from portability point of view. This leaves memfd_create, and I'm
still not completely clear on it's portability -- it seems to be specific to
Linux, but others provide compatible implementation as well.

Something like this should work, roughly based on DSM code except here
we don't really need the name so we unlink it immediately, at the
slight risk of leaking it if the postmaster is killed between those
lines (maybe someone should go and tell POSIX to support the special
name SHM_ANON or some other way to avoid that; I can't see any
portable workaround). Not tested/compiled, just a sketch:

#ifdef HAVE_MEMFD_CREATE
/* Anonymous shared memory region. */
fd = memfd_create("foo", MFD_CLOEXEC | huge_pages_flags);
#else
/* Standard POSIX insists on a name, which we unlink immediately. */
do
{
char tmp[80];
snprintf(tmp, sizeof(tmp), "PostgreSQL.%u",
pg_prng_uint32(&pg_global_prng_state));
fd.= shm_open(tmp, O_CREAT | O_EXCL);
if (fd >= 0)
shm_unlink(tmp);
} while (fd < 0 && errno == EXIST);
#endif

Let me experiment with this idea a bit, I would like to make sure there are no
other limitations we might face.

One thing I'm still wondering about is whether you really need all
this multi-phase barrier stuff, or even need to stop other backends
from running at all while doing the resize. I guess that's related to
your remapping scheme, but supposing you find the simple
ftruncate()-only approach to be good, my next question is: why isn't
it enough to wait for all backends to agree to stop allocating new
buffers in the range to be truncated, and then left them continue to
run as normal? As far as they would be concerned, the in-progress
downsize has already happened, though it could be reverted later if
the eviction phase fails. Then the coordinator could start evicting
buffers and truncating the shared memory object, which are
phases/steps, sure, but it's not clear to me why they need other
backends' help.

It sounds like Windows might need a second ProcSignalBarrier poke in
order to call VirtualUnlock() in every backend. That's based on that
Usenet discussion I lobbed in here the other day; I haven't tried it
myself or fully grokked why it works, and there could well be other
ways, IDK. Assuming it's the right approach, between the first poke
to make all backends accept the new lower size and the second poke to
unlock the memory, I don't see why they need to wait. I suppose it
would be the same ProcSignalBarrier, but behave differently based on a
control variables. I suppose there could also be a third poke, if you
want to consider the operation to be fully complete only once they
have all actually done that unlock step, but it may also be OK not to
worry about that, IDK.

On the other hand, maybe it just feels less risky if you stop the
whole world, or maybe you envisage parallelising the eviction work, or
there is some correctness concern I haven't grokked yet, but what?

*You might also want to use fallocate after ftruncate on Linux to
avoid SIGBUS on allocation failure on first touch page fault, which
raises portability questions since it's unspecified whether you can do
that with shm fds and fails on some systems, but it let's call that an
independent topic as it's not affected by this choice.

I'm afraid it would be strictly neccessary to do fallocate, otherwise we're
back where we were before reservation accounting for huge pages in Linux (lot's
of people were facing unexpected SIGBUS when dealing with cgroups).

Yeah. FWIW here is where we decided to gate that on __linux__ while
fixing that for DSM:

/messages/by-id/CAEepm=0euOKPaYWz0-gFv9xfG+8ptAjhFjiQEX0CCJaYN--sDQ@mail.gmail.com

#68

Jack Ng

Jack.Ng@huawei.com

8 months ago

In reply to: Dmitry Dolgov (#65)

RE: Changing shared_buffers without restart

Thanks Dmitry. Right, the coordination mechanism in v4-0006 works as expected in various tests (sorry, I misunderstood some details initially).

I also want to report a couple of minor issues found during testing (which you may be aware of already):

1. For memory segments other the first one ('main'), the start address passed to mmap may not be aligned to 4KB or huge page size (since reserved_offset may not be aligned) and cause mmap to fail.

2. Since the ratio for main/desc/iocv/checkpt/strategy in SHMEM_RESIZE_RATIO are relatively small, I think we need to guard against the case where 'max_available_memory' is too small for the required sizes of these segments (from CalculateShmemSize).
Like when max_available_memory=default and shared_numbers=128kB, 'main' still needs ~109MB, but since only 10% of max_available_memory is reserved for it (~102MB) and start address of the next segment is calculated based on reserved_offset, this would cause the mappings to overlap and memory problems later (I hit this after fixing 1.)
I suppose we can change the minimum value of max_available_memory to be large enough, and may also adjust the ratios in SHMEM_RESIZE_RATIO to ensure the reserved space of those segments are sufficient.

Regards,

Jack Ng

-----Original Message-----
From: Dmitry Dolgov <9erthalion6@gmail.com>
Sent: Monday, April 21, 2025 5:33 AM
To: Ni Ku <jakkuniku@gmail.com>
Cc: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>; pgsql-hackers@postgresql.org; Robert Haas <robertmhaas@gmail.com>
Subject: Re: Changing shared_buffers without restart

On Thu, Apr 17, 2025 at 07:05:36PM GMT, Ni Ku wrote:
I also have a related question about how ftruncate() is used in the patch.
In my testing I also see that when using ftruncate to shrink a shared
segment, the memory is freed immediately after the call, even if other
processes still have that memory mapped, and they will hit SIGBUS if
they try to access that memory again as the manpage says.

So am I correct to think that, to support the bufferpool shrinking
case, it would not be safe to call ftruncate in AnonymousShmemResize
as-is, since at that point other processes may still be using pages
that belong to the truncated memory?
It appears that for shrinking we should only call ftruncate when we're
sure no process will access those pages again (eg, all processes have
handled the resize interrupt signal barrier). I suppose this can be
done by the resize coordinator after synchronizing with all the other processes.
But in that case it seems we cannot use the postmaster as the
coordinator then? b/c I see some code comments saying the postmaster
does not have waiting infrastructure... (maybe even if the postmaster
has waiting infra we don't want to use it anyway since it can be
blocked for a long time and won't be able to serve other requests).

There is already a coordination infrastructure, implemented in the patch 0006, which will take care of this and prevent access to the shared memory until everything is resized.

#69

Dmitry Dolgov

9erthalion6@gmail.com

8 months ago

In reply to: Jack Ng (#68)

Re: Changing shared_buffers without restart

On Tue, May 06, 2025 at 04:23:07AM GMT, Jack Ng wrote:
Thanks Dmitry. Right, the coordination mechanism in v4-0006 works as expected in various tests (sorry, I misunderstood some details initially).

Great, thanks for checking.

I also want to report a couple of minor issues found during testing (which you may be aware of already):

1. For memory segments other the first one ('main'), the start address passed to mmap may not be aligned to 4KB or huge page size (since reserved_offset may not be aligned) and cause mmap to fail.

2. Since the ratio for main/desc/iocv/checkpt/strategy in SHMEM_RESIZE_RATIO are relatively small, I think we need to guard against the case where 'max_available_memory' is too small for the required sizes of these segments (from CalculateShmemSize).
Like when max_available_memory=default and shared_numbers=128kB, 'main' still needs ~109MB, but since only 10% of max_available_memory is reserved for it (~102MB) and start address of the next segment is calculated based on reserved_offset, this would cause the mappings to overlap and memory problems later (I hit this after fixing 1.)
I suppose we can change the minimum value of max_available_memory to be large enough, and may also adjust the ratios in SHMEM_RESIZE_RATIO to ensure the reserved space of those segments are sufficient.

Yeah, good points. I've introduced max_available_memory expecting some
heated discussions about it, and thus didn't put lots of efforts into
covering all the possible scenarios. But now I'm reworking it along the
lines suggested by Thomas, and will address those as well. Thanks!

#70

Jack Ng

Jack.Ng@huawei.com

8 months ago

In reply to: Dmitry Dolgov (#69)

RE: Changing shared_buffers without restart

all the possible scenarios. But now I'm reworking it along the lines suggested
by Thomas, and will address those as well. Thanks!

Thanks for the info, Dmitry.
Just want to confirm my understanding of Thomas' suggestion and your discussions... I think the simpler and more portable solution goes something like the following?

* For each BP resource segment (main, desc, buffers, etc):
1. create an anonymous file as backing
2. mmap a large reserved shared memory area with PROTO_READ/WRITE + MAP_NORESERVE using the anon fd
3. use ftruncate to back the in-use region (and maybe posix_fallocate too to avoid SIGBUS on alloc failure during first-touch), but no need to create a memory mapping for it
4. also no need to create a separate mapping for the reserved region (already covered by the mapping created in 2.)

|-- Memory mapping (MAP_NORESERVE) for BUFFER --|
|-- In-use region --|----- Reserved region -----|

* During resize, simply calculate the new size and call ftruncate on each segment to adjust memory accordingly, no need to mmap/munmap or modify any memory mapping.

I tried this approach with a test program (with huge pages), and both expand and shrink seem to work as expected --for shrink, the memory is freed right after the resize ftruncate.

Regards,

Jack Ng

#71

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

8 months ago

In reply to: Jack Ng (#70)

1 attachment(s)

Re: Changing shared_buffers without restart

On Wed, May 7, 2025 at 11:04 AM Jack Ng <Jack.Ng@huawei.com> wrote:

all the possible scenarios. But now I'm reworking it along the lines

suggested

by Thomas, and will address those as well. Thanks!

Thanks for the info, Dmitry.
Just want to confirm my understanding of Thomas' suggestion and your
discussions... I think the simpler and more portable solution goes
something like the following?

* For each BP resource segment (main, desc, buffers, etc):
1. create an anonymous file as backing
2. mmap a large reserved shared memory area with PROTO_READ/WRITE +
MAP_NORESERVE using the anon fd
3. use ftruncate to back the in-use region (and maybe posix_fallocate
too to avoid SIGBUS on alloc failure during first-touch), but no need to
create a memory mapping for it
4. also no need to create a separate mapping for the reserved region
(already covered by the mapping created in 2.)

|-- Memory mapping (MAP_NORESERVE) for BUFFER --|
|-- In-use region --|----- Reserved region -----|

* During resize, simply calculate the new size and call ftruncate on each
segment to adjust memory accordingly, no need to mmap/munmap or modify any
memory mapping.

That's same as my understanding.

I tried this approach with a test program (with huge pages), and both
expand and shrink seem to work as expected --for shrink, the memory is
freed right after the resize ftruncate.

I thought I had shared a test program upthread, but I don't find it now.

Attached here. Can you please share your test program?

There are concerns around portability of this approach, though.

--
Best Wishes,
Ashutosh Bapat

#72

Jack Ng

Jack.Ng@huawei.com

8 months ago

In reply to: Ashutosh Bapat (#71)

1 attachment(s)

RE: Changing shared_buffers without restart

Hi Ashutosh,

* During resize, simply calculate the new size and call ftruncate on each
segment to adjust memory accordingly, no need to mmap/munmap or modify any
memory mapping.

That's same as my understanding.

Great, thanks for confirming!

I thought I had shared a test program upthread, but I don't find it now. Attached here. Can you please share your test program?

Sure, mine is attached here (it’s based on another test program you shared before :-)

Regards,

Jack Ng

#73

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

7 months ago

In reply to: Thomas Munro (#67)

12 attachment(s)

Re: Changing shared_buffers without restart

On Mon, Apr 21, 2025 at 7:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Mon, Apr 21, 2025 at 9:30 PM Dmitry Dolgov <9erthalion6@gmail.com>
wrote:

Yeah, that would work and will allow to avoid MAP_FIXED and mremap,

which are

questionable from portability point of view. This leaves memfd_create,

and I'm

still not completely clear on it's portability -- it seems to be

specific to

Linux, but others provide compatible implementation as well.

Something like this should work, roughly based on DSM code except here
we don't really need the name so we unlink it immediately, at the
slight risk of leaking it if the postmaster is killed between those
lines (maybe someone should go and tell POSIX to support the special
name SHM_ANON or some other way to avoid that; I can't see any
portable workaround). Not tested/compiled, just a sketch:

#ifdef HAVE_MEMFD_CREATE
/* Anonymous shared memory region. */
fd = memfd_create("foo", MFD_CLOEXEC | huge_pages_flags);
#else
/* Standard POSIX insists on a name, which we unlink immediately. */
do
{
char tmp[80];
snprintf(tmp, sizeof(tmp), "PostgreSQL.%u",
pg_prng_uint32(&pg_global_prng_state));
fd.= shm_open(tmp, O_CREAT | O_EXCL);
if (fd >= 0)
shm_unlink(tmp);
} while (fd < 0 && errno == EXIST);
#endif

Let me experiment with this idea a bit, I would like to make sure there

are no

other limitations we might face.

One thing I'm still wondering about is whether you really need all
this multi-phase barrier stuff, or even need to stop other backends
from running at all while doing the resize. I guess that's related to
your remapping scheme, but supposing you find the simple
ftruncate()-only approach to be good, my next question is: why isn't
it enough to wait for all backends to agree to stop allocating new
buffers in the range to be truncated, and then left them continue to
run as normal? As far as they would be concerned, the in-progress
downsize has already happened, though it could be reverted later if
the eviction phase fails. Then the coordinator could start evicting
buffers and truncating the shared memory object, which are
phases/steps, sure, but it's not clear to me why they need other
backends' help.

AFAIU, we required the phased approach since mremap needed to happen in
every backend after buffer eviction but before making modifications to the
shared memory. If we don't need to call mremap in every backend and just
ftruncate + initializing memory (when expanding buffers) is enough, I think
phased approach isn't needed. But I haven't tried it myself.

Here's patchset rebased on 3feff3916ee106c084eca848527dc2d2c3ef4e89.
0001 - 0008 are same as the previous patchset

0009 adds support to shrink shared buffers. It has two changes: a. evict
the buffers outside the new buffer size b. remove buffers with buffer id
outside the new buffer size from the free list. If a buffer being evicted
is pinned, the operation is aborted and a FATAL error is raised. I think we
need to change this behaviour to be less severe like rolling back the
operation or waiting for the pinned buffer to be unpinned etc. Better even
if we could let users control the behaviour. But we need better
infrastructure to do such things. That's one TODO left in the patch.

0010 is about reinitializing the Strategy reinitialization. Once we expand
the buffers, the new buffers need to be added to the free list. Some
StrategyControl area members (not all) need to be adjusted. That's what
this patch does. But a deeper adjustment in BgBufferSync() and
ClockSweepTick() is required. Further we need to do something about the
buffer lookup table. More on that later in the email.

0011-0012 fix compilation issues in these patches but those fixes are not
correct. The patches are there so that binaries can be built without any
compilation issues and someone can experiment with buffer resizing. Good
thing is the compilation fixes are in SQL callable functions
pg_get_shmem_pagesize() and pg_get_shmem_numa(). So there's no ill-effect
because of these patches as long as those two functions are not called.

Buffer lookup table resizing
------------------------------------
The size of the buffer lookup table depends upon (number of shared
buffers + number of partitions in the shared buffer lookup table). If we
shrink the buffer pool, the buffer lookup table will become sparse but
still useful. If we expand the buffers we need to expand the buffer lookup
table too. That's not implemented in the current patchset. There are two
solutions here:

1. We map a lot of extra address space (not memory) initially to
accomodate for future expansion of shared buffer pool. Let's say that the
total address space is sufficient to accomodate Nx buffers. Simple solution
is to allocate a buffer lookup table with Nx initial entries so that we
don't have to resize the buffer lookup table ever. It will waste memory but
we might be ok with that as version 1 solution. According to my offline
discussion with David Rowley, buffer lookups in sparse hash tables are
inefficient because or more cacheline faults. Whether that translates to
any noticeable performance degradation in TPS needs to be measured.

2. Alternate solution is to resize the buffer mapping table as well. This
means that we rehash all the entries again which may take a longer time and
the partitions will remain locked for that amount of time. Not to mention
this will require non-trivial change to dynahash implementation.

Next I will look at BgBufferSync() and ClockSweepTick() adjustments and
then buffer lookup table fix with approach 1.

--
Best Wishes,
Ashutosh Bapat

Attachments:

0005-Introduce-pss_barrierReceivedGeneration-20250610.patchtext/x-patch; charset=US-ASCII; name=0005-Introduce-pss_barrierReceivedGeneration-20250610.patchDownload

From fb470c019742f2e9eaa7666ab81a24f816066387 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 4 Apr 2025 21:46:14 +0200
Subject: [PATCH 05/17] Introduce pss_barrierReceivedGeneration

Currently WaitForProcSignalBarrier allows to make sure the message sent
via EmitProcSignalBarrier was processed by all ProcSignal mechanism
participants.

Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
which will be updated when a process has received the message, but not
processed it yet. This makes it possible to support a new mode of
waiting, when ProcSignal participants want to synchronize message
processing. To do that, a participant can wait via
WaitForProcSignalBarrierReceived when processing a message, effectively
making sure that all processes are going to start processing
ProcSignalBarrier simultaneously.
---
 src/backend/storage/ipc/procsignal.c | 67 ++++++++++++++++++++++------
 src/include/storage/procsignal.h     |  1 +
 2 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index a9bb540b55a..c6bec9be423 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -58,7 +58,10 @@
  * of it. For such use cases, we set a bit in pss_barrierCheckMask and then
  * increment the current "barrier generation"; when the new barrier generation
  * (or greater) appears in the pss_barrierGeneration flag of every process,
- * we know that the message has been received everywhere.
+ * we know that the message has been received and processed everywhere. In case
+ * if we only need to know only that the message was received everywhere (e.g.
+ * receiving processes need to handle the message in a coordinated fashion)
+ * use pss_barrierReceivedGeneration in the same way.
  */
 typedef struct
 {
@@ -70,6 +73,7 @@ typedef struct
 
 	/* Barrier-related fields (not protected by pss_mutex) */
 	pg_atomic_uint64 pss_barrierGeneration;
+	pg_atomic_uint64 pss_barrierReceivedGeneration;
 	pg_atomic_uint32 pss_barrierCheckMask;
 	ConditionVariable pss_barrierCV;
 } ProcSignalSlot;
@@ -152,6 +156,8 @@ ProcSignalShmemInit(void)
 			slot->pss_cancel_key_len = 0;
 			MemSet(slot->pss_signalFlags, 0, sizeof(slot->pss_signalFlags));
 			pg_atomic_init_u64(&slot->pss_barrierGeneration, PG_UINT64_MAX);
+			pg_atomic_init_u64(&slot->pss_barrierReceivedGeneration,
+							   PG_UINT64_MAX);
 			pg_atomic_init_u32(&slot->pss_barrierCheckMask, 0);
 			ConditionVariableInit(&slot->pss_barrierCV);
 		}
@@ -199,6 +205,8 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
 	barrier_generation =
 		pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration);
 	pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation);
+	pg_atomic_write_u64(&slot->pss_barrierReceivedGeneration,
+						barrier_generation);
 
 	if (cancel_key_len > 0)
 		memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
@@ -263,6 +271,7 @@ CleanupProcSignalState(int status, Datum arg)
 	 * no barrier waits block on it.
 	 */
 	pg_atomic_write_u64(&slot->pss_barrierGeneration, PG_UINT64_MAX);
+	pg_atomic_write_u64(&slot->pss_barrierReceivedGeneration, PG_UINT64_MAX);
 
 	SpinLockRelease(&slot->pss_mutex);
 
@@ -416,12 +425,8 @@ EmitProcSignalBarrier(ProcSignalBarrierType type)
 	return generation;
 }
 
-/*
- * WaitForProcSignalBarrier - wait until it is guaranteed that all changes
- * requested by a specific call to EmitProcSignalBarrier() have taken effect.
- */
-void
-WaitForProcSignalBarrier(uint64 generation)
+static void
+WaitForProcSignalBarrierInternal(uint64 generation, bool receivedOnly)
 {
 	Assert(generation <= pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration));
 
@@ -436,12 +441,17 @@ WaitForProcSignalBarrier(uint64 generation)
 		uint64		oldval;
 
 		/*
-		 * It's important that we check only pss_barrierGeneration here and
-		 * not pss_barrierCheckMask. Bits in pss_barrierCheckMask get cleared
-		 * before the barrier is actually absorbed, but pss_barrierGeneration
+		 * It's important that we check only pss_barrierGeneration &
+		 * pss_barrierGeneration here and not pss_barrierCheckMask. Bits in
+		 * pss_barrierCheckMask get cleared before the barrier is actually
+		 * absorbed, but pss_barrierGeneration & pss_barrierReceivedGeneration
 		 * is updated only afterward.
 		 */
-		oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+		if (receivedOnly)
+			oldval = pg_atomic_read_u64(&slot->pss_barrierReceivedGeneration);
+		else
+			oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+
 		while (oldval < generation)
 		{
 			if (ConditionVariableTimedSleep(&slot->pss_barrierCV,
@@ -450,7 +460,11 @@ WaitForProcSignalBarrier(uint64 generation)
 				ereport(LOG,
 						(errmsg("still waiting for backend with PID %d to accept ProcSignalBarrier",
 								(int) pg_atomic_read_u32(&slot->pss_pid))));
-			oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+
+			if (receivedOnly)
+				oldval = pg_atomic_read_u64(&slot->pss_barrierReceivedGeneration);
+			else
+				oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
 		}
 		ConditionVariableCancelSleep();
 	}
@@ -464,12 +478,33 @@ WaitForProcSignalBarrier(uint64 generation)
 	 * The caller is probably calling this function because it wants to read
 	 * the shared state or perform further writes to shared state once all
 	 * backends are known to have absorbed the barrier. However, the read of
-	 * pss_barrierGeneration was performed unlocked; insert a memory barrier
-	 * to separate it from whatever follows.
+	 * pss_barrierGeneration & pss_barrierReceivedGeneration was performed
+	 * unlocked; insert a memory barrier to separate it from whatever follows.
 	 */
 	pg_memory_barrier();
 }
 
+/*
+ * WaitForProcSignalBarrier - wait until it is guaranteed that all changes
+ * requested by a specific call to EmitProcSignalBarrier() have taken effect.
+ */
+void
+WaitForProcSignalBarrier(uint64 generation)
+{
+	WaitForProcSignalBarrierInternal(generation, false);
+}
+
+/*
+ * WaitForProcSignalBarrierReceived - wait until it is guaranteed that all
+ * backends have observed the message sent by a specific call to
+ * EmitProcSignalBarrier().
+ */
+void
+WaitForProcSignalBarrierReceived(uint64 generation)
+{
+	WaitForProcSignalBarrierInternal(generation, true);
+}
+
 /*
  * Handle receipt of an interrupt indicating a global barrier event.
  *
@@ -523,6 +558,10 @@ ProcessProcSignalBarrier(void)
 	if (local_gen == shared_gen)
 		return;
 
+	/* The message is observed, record that */
+	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
+						shared_gen);
+
 	/*
 	 * Get and clear the flags that are set for this backend. Note that
 	 * pg_atomic_exchange_u32 is a full barrier, so we're guaranteed that the
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index afeeb1ca019..2733bbb8c5b 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -79,6 +79,7 @@ extern void SendCancelRequest(int backendPID, const uint8 *cancel_key, int cance
 
 extern uint64 EmitProcSignalBarrier(ProcSignalBarrierType type);
 extern void WaitForProcSignalBarrier(uint64 generation);
+extern void WaitForProcSignalBarrierReceived(uint64 generation);
 extern void ProcessProcSignalBarrier(void);
 
 extern void procsignal_sigusr1_handler(SIGNAL_ARGS);
-- 
2.34.1

0004-Introduce-pending-flag-for-GUC-assign-hooks-20250610.patchtext/x-patch; charset=US-ASCII; name=0004-Introduce-pending-flag-for-GUC-assign-hooks-20250610.patchDownload

From 1afa2048d803e0b3372de348868b26542ccfb3cd Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sun, 6 Apr 2025 16:40:32 +0200
Subject: [PATCH 04/17] Introduce pending flag for GUC assign hooks

Currently an assing hook can perform some preprocessing of a new value,
but it cannot change the behavior, which dictates that the new value
will be applied immediately after the hook. Certain GUC options (like
shared_buffers, coming in subsequent patches) may need coordinating work
between backends to change, meaning we cannot apply it right away.

Add a new flag "pending" for an assign hook to allow the hook indicate
exactly that. If the pending flag is set after the hook, the new value
will not be applied and it's handling becomes the hook's implementation
responsibility.

Note, that this also requires changes in the way how GUCs are getting
reported, but the patch does not cover that yet.
---
 src/backend/access/transam/xlog.c    |  2 +-
 src/backend/commands/variable.c      |  6 +--
 src/backend/libpq/pqcomm.c           |  8 ++--
 src/backend/tcop/postgres.c          |  2 +-
 src/backend/utils/misc/guc.c         | 59 +++++++++++++++++++---------
 src/backend/utils/misc/stack_depth.c |  2 +-
 src/include/utils/guc.h              |  2 +-
 src/include/utils/guc_hooks.h        | 20 +++++-----
 8 files changed, 61 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1914859b2ee..5e204341bde 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2321,7 +2321,7 @@ CalculateCheckpointSegments(void)
 }
 
 void
-assign_max_wal_size(int newval, void *extra)
+assign_max_wal_size(int newval, void *extra, bool *pending)
 {
 	max_wal_size_mb = newval;
 	CalculateCheckpointSegments();
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 608f10d9412..e40dae2ddf2 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1143,7 +1143,7 @@ check_cluster_name(char **newval, void **extra, GucSource source)
  * GUC assign_hook for maintenance_io_concurrency
  */
 void
-assign_maintenance_io_concurrency(int newval, void *extra)
+assign_maintenance_io_concurrency(int newval, void *extra, bool *pending)
 {
 	/*
 	 * Reconfigure recovery prefetching, because a setting it depends on
@@ -1161,12 +1161,12 @@ assign_maintenance_io_concurrency(int newval, void *extra)
  * they may be assigned in either order.
  */
 void
-assign_io_max_combine_limit(int newval, void *extra)
+assign_io_max_combine_limit(int newval, void *extra, bool *pending)
 {
 	io_combine_limit = Min(newval, io_combine_limit_guc);
 }
 void
-assign_io_combine_limit(int newval, void *extra)
+assign_io_combine_limit(int newval, void *extra, bool *pending)
 {
 	io_combine_limit = Min(io_max_combine_limit, newval);
 }
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index e5171467de1..2a6a587ef76 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -1952,7 +1952,7 @@ pq_settcpusertimeout(int timeout, Port *port)
  * GUC assign_hook for tcp_keepalives_idle
  */
 void
-assign_tcp_keepalives_idle(int newval, void *extra)
+assign_tcp_keepalives_idle(int newval, void *extra, bool *pending)
 {
 	/*
 	 * The kernel API provides no way to test a value without setting it; and
@@ -1985,7 +1985,7 @@ show_tcp_keepalives_idle(void)
  * GUC assign_hook for tcp_keepalives_interval
  */
 void
-assign_tcp_keepalives_interval(int newval, void *extra)
+assign_tcp_keepalives_interval(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_setkeepalivesinterval(newval, MyProcPort);
@@ -2008,7 +2008,7 @@ show_tcp_keepalives_interval(void)
  * GUC assign_hook for tcp_keepalives_count
  */
 void
-assign_tcp_keepalives_count(int newval, void *extra)
+assign_tcp_keepalives_count(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_setkeepalivescount(newval, MyProcPort);
@@ -2031,7 +2031,7 @@ show_tcp_keepalives_count(void)
  * GUC assign_hook for tcp_user_timeout
  */
 void
-assign_tcp_user_timeout(int newval, void *extra)
+assign_tcp_user_timeout(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_settcpusertimeout(newval, MyProcPort);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 2f8c3d5f918..0d1b6466d1e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3591,7 +3591,7 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 
 /* GUC assign hook for transaction_timeout */
 void
-assign_transaction_timeout(int newval, void *extra)
+assign_transaction_timeout(int newval, void *extra, bool *pending)
 {
 	if (IsTransactionState())
 	{
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 667df448732..bb681f5bc60 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1679,6 +1679,7 @@ InitializeOneGUCOption(struct config_generic *gconf)
 				struct config_int *conf = (struct config_int *) gconf;
 				int			newval = conf->boot_val;
 				void	   *extra = NULL;
+				bool 	   pending = false;
 
 				Assert(newval >= conf->min);
 				Assert(newval <= conf->max);
@@ -1687,9 +1688,13 @@ InitializeOneGUCOption(struct config_generic *gconf)
 					elog(FATAL, "failed to initialize %s to %d",
 						 conf->gen.name, newval);
 				if (conf->assign_hook)
-					conf->assign_hook(newval, extra);
-				*conf->variable = conf->reset_val = newval;
-				conf->gen.extra = conf->reset_extra = extra;
+					conf->assign_hook(newval, extra, &pending);
+
+				if (!pending)
+				{
+					*conf->variable = conf->reset_val = newval;
+					conf->gen.extra = conf->reset_extra = extra;
+				}
 				break;
 			}
 		case PGC_REAL:
@@ -2041,13 +2046,18 @@ ResetAllOptions(void)
 			case PGC_INT:
 				{
 					struct config_int *conf = (struct config_int *) gconf;
+					bool 			  pending = false;
 
 					if (conf->assign_hook)
 						conf->assign_hook(conf->reset_val,
-										  conf->reset_extra);
-					*conf->variable = conf->reset_val;
-					set_extra_field(&conf->gen, &conf->gen.extra,
-									conf->reset_extra);
+										  conf->reset_extra,
+										  &pending);
+					if (!pending)
+					{
+						*conf->variable = conf->reset_val;
+						set_extra_field(&conf->gen, &conf->gen.extra,
+										conf->reset_extra);
+					}
 					break;
 				}
 			case PGC_REAL:
@@ -2424,16 +2434,21 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
 							struct config_int *conf = (struct config_int *) gconf;
 							int			newval = newvalue.val.intval;
 							void	   *newextra = newvalue.extra;
+							bool 	    pending = false;
 
 							if (*conf->variable != newval ||
 								conf->gen.extra != newextra)
 							{
 								if (conf->assign_hook)
-									conf->assign_hook(newval, newextra);
-								*conf->variable = newval;
-								set_extra_field(&conf->gen, &conf->gen.extra,
-												newextra);
-								changed = true;
+									conf->assign_hook(newval, newextra, &pending);
+
+								if (!pending)
+								{
+									*conf->variable = newval;
+									set_extra_field(&conf->gen, &conf->gen.extra,
+													newextra);
+									changed = true;
+								}
 							}
 							break;
 						}
@@ -3850,18 +3865,24 @@ set_config_with_handle(const char *name, config_handle *handle,
 
 				if (changeVal)
 				{
+					bool pending = false;
+
 					/* Save old value to support transaction abort */
 					if (!makeDefault)
 						push_old_value(&conf->gen, action);
 
 					if (conf->assign_hook)
-						conf->assign_hook(newval, newextra);
-					*conf->variable = newval;
-					set_extra_field(&conf->gen, &conf->gen.extra,
-									newextra);
-					set_guc_source(&conf->gen, source);
-					conf->gen.scontext = context;
-					conf->gen.srole = srole;
+						conf->assign_hook(newval, newextra, &pending);
+
+					if (!pending)
+					{
+						*conf->variable = newval;
+						set_extra_field(&conf->gen, &conf->gen.extra,
+										newextra);
+						set_guc_source(&conf->gen, source);
+						conf->gen.scontext = context;
+						conf->gen.srole = srole;
+					}
 				}
 				if (makeDefault)
 				{
diff --git a/src/backend/utils/misc/stack_depth.c b/src/backend/utils/misc/stack_depth.c
index 8f7cf531fbc..ef59ae62008 100644
--- a/src/backend/utils/misc/stack_depth.c
+++ b/src/backend/utils/misc/stack_depth.c
@@ -156,7 +156,7 @@ check_max_stack_depth(int *newval, void **extra, GucSource source)
 
 /* GUC assign hook for max_stack_depth */
 void
-assign_max_stack_depth(int newval, void *extra)
+assign_max_stack_depth(int newval, void *extra, bool *pending)
 {
 	ssize_t		newval_bytes = newval * (ssize_t) 1024;
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index f619100467d..8802ad8a3cb 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -187,7 +187,7 @@ typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource sourc
 typedef bool (*GucEnumCheckHook) (int *newval, void **extra, GucSource source);
 
 typedef void (*GucBoolAssignHook) (bool newval, void *extra);
-typedef void (*GucIntAssignHook) (int newval, void *extra);
+typedef void (*GucIntAssignHook) (int newval, void *extra, bool *pending);
 typedef void (*GucRealAssignHook) (double newval, void *extra);
 typedef void (*GucStringAssignHook) (const char *newval, void *extra);
 typedef void (*GucEnumAssignHook) (int newval, void *extra);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 799fa7ace68..c8300cffa8e 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -81,14 +81,14 @@ extern bool check_log_stats(bool *newval, void **extra, GucSource source);
 extern bool check_log_timezone(char **newval, void **extra, GucSource source);
 extern void assign_log_timezone(const char *newval, void *extra);
 extern const char *show_log_timezone(void);
-extern void assign_maintenance_io_concurrency(int newval, void *extra);
-extern void assign_io_max_combine_limit(int newval, void *extra);
-extern void assign_io_combine_limit(int newval, void *extra);
+extern void assign_maintenance_io_concurrency(int newval, void *extra, bool *pending);
+extern void assign_io_max_combine_limit(int newval, void *extra, bool *pending);
+extern void assign_io_combine_limit(int newval, void *extra, bool *pending);
 extern bool check_max_slot_wal_keep_size(int *newval, void **extra,
 										 GucSource source);
-extern void assign_max_wal_size(int newval, void *extra);
+extern void assign_max_wal_size(int newval, void *extra, bool *pending);
 extern bool check_max_stack_depth(int *newval, void **extra, GucSource source);
-extern void assign_max_stack_depth(int newval, void *extra);
+extern void assign_max_stack_depth(int newval, void *extra, bool *pending);
 extern bool check_multixact_member_buffers(int *newval, void **extra,
 										   GucSource source);
 extern bool check_multixact_offset_buffers(int *newval, void **extra,
@@ -143,13 +143,13 @@ extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 extern void assign_syslog_facility(int newval, void *extra);
 extern void assign_syslog_ident(const char *newval, void *extra);
-extern void assign_tcp_keepalives_count(int newval, void *extra);
+extern void assign_tcp_keepalives_count(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_count(void);
-extern void assign_tcp_keepalives_idle(int newval, void *extra);
+extern void assign_tcp_keepalives_idle(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_idle(void);
-extern void assign_tcp_keepalives_interval(int newval, void *extra);
+extern void assign_tcp_keepalives_interval(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_interval(void);
-extern void assign_tcp_user_timeout(int newval, void *extra);
+extern void assign_tcp_user_timeout(int newval, void *extra, bool *pending);
 extern const char *show_tcp_user_timeout(void);
 extern bool check_temp_buffers(int *newval, void **extra, GucSource source);
 extern bool check_temp_tablespaces(char **newval, void **extra,
@@ -165,7 +165,7 @@ extern bool check_transaction_buffers(int *newval, void **extra, GucSource sourc
 extern bool check_transaction_deferrable(bool *newval, void **extra, GucSource source);
 extern bool check_transaction_isolation(int *newval, void **extra, GucSource source);
 extern bool check_transaction_read_only(bool *newval, void **extra, GucSource source);
-extern void assign_transaction_timeout(int newval, void *extra);
+extern void assign_transaction_timeout(int newval, void *extra, bool *pending);
 extern const char *show_unix_socket_permissions(void);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern bool check_wal_consistency_checking(char **newval, void **extra,
-- 
2.34.1

0001-Allow-to-use-multiple-shared-memory-mapping-20250610.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-to-use-multiple-shared-memory-mapping-20250610.patchDownload

From 25a501f17a36523be0b133f992393433428d73c5 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 28 Feb 2025 19:54:47 +0100
Subject: [PATCH 01/17] Allow to use multiple shared memory mappings

Currently all the work with shared memory is done via a single anonymous
memory mapping, which limits ways how the shared memory could be organized.

Introduce possibility to allocate multiple shared memory mappings, where
a single mapping is associated with a specified shared memory segment.
There is only fixed amount of available segments, currently only one
main shared memory segment is allocated. A new shared memory API is
introduces, extended with a segment as a new parameter. As a path of
least resistance, the original API is kept in place, utilizing the main
shared memory segment.
---
 src/backend/port/posix_sema.c     |   4 +-
 src/backend/port/sysv_sema.c      |   4 +-
 src/backend/port/sysv_shmem.c     | 138 ++++++++++++++++++++---------
 src/backend/port/win32_sema.c     |   2 +-
 src/backend/storage/ipc/ipc.c     |   4 +-
 src/backend/storage/ipc/ipci.c    |  63 +++++++------
 src/backend/storage/ipc/shmem.c   | 141 +++++++++++++++++++++---------
 src/backend/storage/lmgr/lwlock.c |  13 ++-
 src/include/storage/ipc.h         |   2 +-
 src/include/storage/pg_sema.h     |   2 +-
 src/include/storage/pg_shmem.h    |  18 ++++
 src/include/storage/shmem.h       |  12 +++
 12 files changed, 278 insertions(+), 125 deletions(-)

diff --git a/src/backend/port/posix_sema.c b/src/backend/port/posix_sema.c
index 269c7460817..401e1113fa1 100644
--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -193,7 +193,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * we don't have to expose the counters to other processes.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	struct stat statbuf;
 
@@ -220,7 +220,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSegment(PGSemaphoreShmemSize(maxSemas), shmem_segment);
 #endif
 
 	numSems = 0;
diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c
index 423b2b4f9d6..4ce2cfb662b 100644
--- a/src/backend/port/sysv_sema.c
+++ b/src/backend/port/sysv_sema.c
@@ -307,7 +307,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * have clobbered.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	struct stat statbuf;
 
@@ -328,7 +328,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSegment(PGSemaphoreShmemSize(maxSemas), shmem_segment);
 	numSharedSemas = 0;
 	maxSharedSemas = maxSemas;
 
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..56af0231d24 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -94,8 +94,19 @@ typedef enum
 unsigned long UsedShmemSegID = 0;
 void	   *UsedShmemSegAddr = NULL;
 
-static Size AnonymousShmemSize;
-static void *AnonymousShmem = NULL;
+typedef struct AnonymousMapping
+{
+	int shmem_segment;
+	Size shmem_size; 			/* Size of the mapping */
+	Pointer shmem; 				/* Pointer to the start of the mapped memory */
+	Pointer seg_addr; 			/* SysV shared memory for the header */
+	unsigned long seg_id; 		/* IPC key */
+} AnonymousMapping;
+
+static AnonymousMapping Mappings[ANON_MAPPINGS];
+
+/* Keeps track of used mapping segments */
+static int next_free_segment = 0;
 
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
@@ -104,6 +115,28 @@ static IpcMemoryState PGSharedMemoryAttach(IpcMemoryId shmId,
 										   void *attachAt,
 										   PGShmemHeader **addr);
 
+static const char*
+MappingName(int shmem_segment)
+{
+	switch (shmem_segment)
+	{
+		case MAIN_SHMEM_SEGMENT:
+			return "main";
+		default:
+			return "unknown";
+	}
+}
+
+static void
+DebugMappings()
+{
+	for(int i = 0; i < next_free_segment; i++)
+	{
+		AnonymousMapping m = Mappings[i];
+		elog(DEBUG1, "Mapping[%s]: addr %p, size %zu",
+			 MappingName(i), m.shmem, m.shmem_size);
+	}
+}
 
 /*
  *	InternalIpcMemoryCreate(memKey, size)
@@ -591,14 +624,13 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
 /*
  * Creates an anonymous mmap()ed shared memory segment.
  *
- * Pass the requested size in *size.  This function will modify *size to the
- * actual size of the allocation, if it ends up allocating a segment that is
- * larger than requested.
+ * This function will modify mapping size to the actual size of the allocation,
+ * if it ends up allocating a segment that is larger than requested.
  */
-static void *
-CreateAnonymousSegment(Size *size)
+static void
+CreateAnonymousSegment(AnonymousMapping *mapping)
 {
-	Size		allocsize = *size;
+	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
 
@@ -623,8 +655,11 @@ CreateAnonymousSegment(Size *size)
 				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
-			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-				 allocsize);
+		{
+			DebugMappings();
+			elog(DEBUG1, "segment[%s]: mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
+				 MappingName(mapping->shmem_segment), allocsize);
+		}
 	}
 #endif
 
@@ -642,7 +677,7 @@ CreateAnonymousSegment(Size *size)
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
-		allocsize = *size;
+		allocsize = mapping->shmem_size;
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
 				   PG_MMAP_FLAGS, -1, 0);
 		mmap_errno = errno;
@@ -651,8 +686,10 @@ CreateAnonymousSegment(Size *size)
 	if (ptr == MAP_FAILED)
 	{
 		errno = mmap_errno;
+		DebugMappings();
 		ereport(FATAL,
-				(errmsg("could not map anonymous shared memory: %m"),
+				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
+						MappingName(mapping->shmem_segment)),
 				 (mmap_errno == ENOMEM) ?
 				 errhint("This error usually means that PostgreSQL's request "
 						 "for a shared memory segment exceeded available memory, "
@@ -663,8 +700,8 @@ CreateAnonymousSegment(Size *size)
 						 allocsize) : 0));
 	}
 
-	*size = allocsize;
-	return ptr;
+	mapping->shmem = ptr;
+	mapping->shmem_size = allocsize;
 }
 
 /*
@@ -674,13 +711,18 @@ CreateAnonymousSegment(Size *size)
 static void
 AnonymousShmemDetach(int status, Datum arg)
 {
-	/* Release anonymous shared memory block, if any. */
-	if (AnonymousShmem != NULL)
+	for(int i = 0; i < next_free_segment; i++)
 	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		AnonymousMapping m = Mappings[i];
+
+		/* Release anonymous shared memory block, if any. */
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
 
@@ -705,6 +747,7 @@ PGSharedMemoryCreate(Size size,
 	PGShmemHeader *hdr;
 	struct stat statbuf;
 	Size		sysvsize;
+	AnonymousMapping *mapping = &Mappings[next_free_segment];
 
 	/*
 	 * We use the data directory's ID info (inode and device numbers) to
@@ -733,11 +776,15 @@ PGSharedMemoryCreate(Size size,
 
 	/* Room for a header? */
 	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
+	mapping->shmem_size = size;
+	mapping->shmem_segment = next_free_segment;
 
 	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
-		AnonymousShmem = CreateAnonymousSegment(&size);
-		AnonymousShmemSize = size;
+		/* On success, mapping data will be modified. */
+		CreateAnonymousSegment(mapping);
+
+		next_free_segment++;
 
 		/* Register on-exit routine to unmap the anonymous segment */
 		on_shmem_exit(AnonymousShmemDetach, (Datum) 0);
@@ -760,7 +807,7 @@ PGSharedMemoryCreate(Size size,
 	 * loop simultaneously.  (CreateDataDirLockFile() does not entirely ensure
 	 * that, but prefer fixing it over coping here.)
 	 */
-	NextShmemSegID = statbuf.st_ino;
+	NextShmemSegID = statbuf.st_ino + next_free_segment;
 
 	for (;;)
 	{
@@ -852,13 +899,13 @@ PGSharedMemoryCreate(Size size,
 	/*
 	 * Initialize space allocation status for segment.
 	 */
-	hdr->totalsize = size;
+	hdr->totalsize = mapping->shmem_size;
 	hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader));
 	*shim = hdr;
 
 	/* Save info for possible future use */
-	UsedShmemSegAddr = memAddress;
-	UsedShmemSegID = (unsigned long) NextShmemSegID;
+	mapping->seg_addr = memAddress;
+	mapping->seg_id = (unsigned long) NextShmemSegID;
 
 	/*
 	 * If AnonymousShmem is NULL here, then we're not using anonymous shared
@@ -866,10 +913,10 @@ PGSharedMemoryCreate(Size size,
 	 * block. Otherwise, the System V shared memory block is only a shim, and
 	 * we must return a pointer to the real block.
 	 */
-	if (AnonymousShmem == NULL)
+	if (mapping->shmem == NULL)
 		return hdr;
-	memcpy(AnonymousShmem, hdr, sizeof(PGShmemHeader));
-	return (PGShmemHeader *) AnonymousShmem;
+	memcpy(mapping->shmem, hdr, sizeof(PGShmemHeader));
+	return (PGShmemHeader *) mapping->shmem;
 }
 
 #ifdef EXEC_BACKEND
@@ -969,23 +1016,28 @@ PGSharedMemoryNoReAttach(void)
 void
 PGSharedMemoryDetach(void)
 {
-	if (UsedShmemSegAddr != NULL)
+	for(int i = 0; i < next_free_segment; i++)
 	{
-		if ((shmdt(UsedShmemSegAddr) < 0)
+		AnonymousMapping m = Mappings[i];
+
+		if (m.seg_addr != NULL)
+		{
+			if ((shmdt(m.seg_addr) < 0)
 #if defined(EXEC_BACKEND) && defined(__CYGWIN__)
-		/* Work-around for cygipc exec bug */
-			&& shmdt(NULL) < 0
+			/* Work-around for cygipc exec bug */
+				&& shmdt(NULL) < 0
 #endif
-			)
-			elog(LOG, "shmdt(%p) failed: %m", UsedShmemSegAddr);
-		UsedShmemSegAddr = NULL;
-	}
+				)
+				elog(LOG, "shmdt(%p) failed: %m", m.seg_addr);
+			m.seg_addr = NULL;
+		}
 
-	if (AnonymousShmem != NULL)
-	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
diff --git a/src/backend/port/win32_sema.c b/src/backend/port/win32_sema.c
index 5854ad1f54d..e7365ff8060 100644
--- a/src/backend/port/win32_sema.c
+++ b/src/backend/port/win32_sema.c
@@ -44,7 +44,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * process exits.
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	mySemSet = (HANDLE *) malloc(maxSemas * sizeof(HANDLE));
 	if (mySemSet == NULL)
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index 567739b5be9..5b55bec8d9d 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -61,6 +61,8 @@ static void proc_exit_prepare(int code);
  * but provide some additional features we need --- in particular,
  * we want to register callbacks to invoke when we are disconnecting
  * from a broken shared-memory context but not exiting the postmaster.
+ * Maximum number of such exit callbacks depends on the number of shared
+ * segments.
  *
  * Callback functions can take zero, one, or two args: the first passed
  * arg is the integer exitcode, the second is the Datum supplied when
@@ -68,7 +70,7 @@ static void proc_exit_prepare(int code);
  * ----------------------------------------------------------------
  */
 
-#define MAX_ON_EXITS 20
+#define MAX_ON_EXITS 40
 
 struct ONEXIT
 {
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..8b38e985327 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -86,7 +86,7 @@ RequestAddinShmemSpace(Size size)
  * required.
  */
 Size
-CalculateShmemSize(int *num_semaphores)
+CalculateShmemSize(int *num_semaphores, int shmem_segment)
 {
 	Size		size;
 	int			numSemas;
@@ -206,33 +206,38 @@ CreateSharedMemoryAndSemaphores(void)
 
 	Assert(!IsUnderPostmaster);
 
-	/* Compute the size of the shared-memory block */
-	size = CalculateShmemSize(&numSemas);
-	elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
-
-	/*
-	 * Create the shmem segment
-	 */
-	seghdr = PGSharedMemoryCreate(size, &shim);
-
-	/*
-	 * Make sure that huge pages are never reported as "unknown" while the
-	 * server is running.
-	 */
-	Assert(strcmp("unknown",
-				  GetConfigOption("huge_pages_status", false, false)) != 0);
-
-	InitShmemAccess(seghdr);
-
-	/*
-	 * Create semaphores
-	 */
-	PGReserveSemaphores(numSemas);
-
-	/*
-	 * Set up shared memory allocation mechanism
-	 */
-	InitShmemAllocation();
+	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
+	{
+		/* Compute the size of the shared-memory block */
+		size = CalculateShmemSize(&numSemas, segment);
+		elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
+
+		/*
+		 * Create the shmem segment.
+		 *
+		 * XXX: Do multiple shims are needed, one per segment?
+		 */
+		seghdr = PGSharedMemoryCreate(size, &shim);
+
+		/*
+		 * Make sure that huge pages are never reported as "unknown" while the
+		 * server is running.
+		 */
+		Assert(strcmp("unknown",
+					  GetConfigOption("huge_pages_status", false, false)) != 0);
+
+		InitShmemAccessInSegment(seghdr, segment);
+
+		/*
+		 * Create semaphores
+		 */
+		PGReserveSemaphores(numSemas, segment);
+
+		/*
+		 * Set up shared memory allocation mechanism
+		 */
+		InitShmemAllocationInSegment(segment);
+	}
 
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
@@ -363,7 +368,7 @@ InitializeShmemGUCs(void)
 	/*
 	 * Calculate the shared memory size and round up to the nearest megabyte.
 	 */
-	size_b = CalculateShmemSize(&num_semas);
+	size_b = CalculateShmemSize(&num_semas, MAIN_SHMEM_SEGMENT);
 	size_mb = add_size(size_b, (1024 * 1024) - 1) / (1024 * 1024);
 	sprintf(buf, "%zu", size_mb);
 	SetConfigOption("shared_memory_size", buf,
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index c9ae3b45b76..7e1a9b43fae 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -76,19 +76,19 @@
 #include "utils/builtins.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
+static void *ShmemAllocRawInSegment(Size size, Size *allocated_size,
+								 int shmem_segment);
 
 /* shared memory global variables */
 
-static PGShmemHeader *ShmemSegHdr;	/* shared mem segment header */
+ShmemSegment Segments[ANON_MAPPINGS];
 
-static void *ShmemBase;			/* start address of shared memory */
-
-static void *ShmemEnd;			/* end+1 address of shared memory */
-
-slock_t    *ShmemLock;			/* spinlock for shared memory and LWLock
-								 * allocation */
-
-static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/*
+ * Primary index hashtable for shmem, for simplicity we use a single for all
+ * shared memory segments. There can be performance consequences of that, and
+ * an alternative option would be to have one index per shared memory segments.
+ */
+static HTAB *ShmemIndex = NULL;
 
 /* To get reliable results for NUMA inquiry we need to "touch pages" once */
 static bool firstNumaTouch = true;
@@ -101,9 +101,17 @@ Datum		pg_numa_available(PG_FUNCTION_ARGS);
 void
 InitShmemAccess(PGShmemHeader *seghdr)
 {
-	ShmemSegHdr = seghdr;
-	ShmemBase = seghdr;
-	ShmemEnd = (char *) ShmemBase + seghdr->totalsize;
+	InitShmemAccessInSegment(seghdr, MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAccessInSegment(PGShmemHeader *seghdr, int shmem_segment)
+{
+	PGShmemHeader *shmhdr = (PGShmemHeader *) seghdr;
+	ShmemSegment *seg = &Segments[shmem_segment];
+	seg->ShmemSegHdr = shmhdr;
+	seg->ShmemBase = (void *) shmhdr;
+	seg->ShmemEnd = (char *) seg->ShmemBase + shmhdr->totalsize;
 }
 
 /*
@@ -114,7 +122,13 @@ InitShmemAccess(PGShmemHeader *seghdr)
 void
 InitShmemAllocation(void)
 {
-	PGShmemHeader *shmhdr = ShmemSegHdr;
+	InitShmemAllocationInSegment(MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAllocationInSegment(int shmem_segment)
+{
+	PGShmemHeader *shmhdr = Segments[shmem_segment].ShmemSegHdr;
 	char	   *aligned;
 
 	Assert(shmhdr != NULL);
@@ -123,9 +137,9 @@ InitShmemAllocation(void)
 	 * Initialize the spinlock used by ShmemAlloc.  We must use
 	 * ShmemAllocUnlocked, since obviously ShmemAlloc can't be called yet.
 	 */
-	ShmemLock = (slock_t *) ShmemAllocUnlocked(sizeof(slock_t));
+	Segments[shmem_segment].ShmemLock = (slock_t *) ShmemAllocUnlockedInSegment(sizeof(slock_t), shmem_segment);
 
-	SpinLockInit(ShmemLock);
+	SpinLockInit(Segments[shmem_segment].ShmemLock);
 
 	/*
 	 * Allocations after this point should go through ShmemAlloc, which
@@ -150,11 +164,17 @@ InitShmemAllocation(void)
  */
 void *
 ShmemAlloc(Size size)
+{
+	return ShmemAllocInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocInSegment(Size size, int shmem_segment)
 {
 	void	   *newSpace;
 	Size		allocated_size;
 
-	newSpace = ShmemAllocRaw(size, &allocated_size);
+	newSpace = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
 	if (!newSpace)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -184,6 +204,12 @@ ShmemAllocNoError(Size size)
  */
 static void *
 ShmemAllocRaw(Size size, Size *allocated_size)
+{
+	return ShmemAllocRawInSegment(size, allocated_size, MAIN_SHMEM_SEGMENT);
+}
+
+static void *
+ShmemAllocRawInSegment(Size size, Size *allocated_size, int shmem_segment)
 {
 	Size		newStart;
 	Size		newFree;
@@ -203,22 +229,22 @@ ShmemAllocRaw(Size size, Size *allocated_size)
 	size = CACHELINEALIGN(size);
 	*allocated_size = size;
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-	SpinLockAcquire(ShmemLock);
+	SpinLockAcquire(Segments[shmem_segment].ShmemLock);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree <= ShmemSegHdr->totalsize)
+	if (newFree <= Segments[shmem_segment].ShmemSegHdr->totalsize)
 	{
-		newSpace = (char *) ShmemBase + newStart;
-		ShmemSegHdr->freeoffset = newFree;
+		newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
+		Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 	}
 	else
 		newSpace = NULL;
 
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[shmem_segment].ShmemLock);
 
 	/* note this assert is okay with newSpace == NULL */
 	Assert(newSpace == (void *) CACHELINEALIGN(newSpace));
@@ -236,6 +262,12 @@ ShmemAllocRaw(Size size, Size *allocated_size)
  */
 void *
 ShmemAllocUnlocked(Size size)
+{
+	return ShmemAllocUnlockedInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocUnlockedInSegment(Size size, int shmem_segment)
 {
 	Size		newStart;
 	Size		newFree;
@@ -246,19 +278,19 @@ ShmemAllocUnlocked(Size size)
 	 */
 	size = MAXALIGN(size);
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree > ShmemSegHdr->totalsize)
+	if (newFree > Segments[shmem_segment].ShmemSegHdr->totalsize)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of shared memory (%zu bytes requested)",
 						size)));
-	ShmemSegHdr->freeoffset = newFree;
+	Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 
-	newSpace = (char *) ShmemBase + newStart;
+	newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
 
 	Assert(newSpace == (void *) MAXALIGN(newSpace));
 
@@ -273,7 +305,13 @@ ShmemAllocUnlocked(Size size)
 bool
 ShmemAddrIsValid(const void *addr)
 {
-	return (addr >= ShmemBase) && (addr < ShmemEnd);
+	return ShmemAddrIsValidInSegment(addr, MAIN_SHMEM_SEGMENT);
+}
+
+bool
+ShmemAddrIsValidInSegment(const void *addr, int shmem_segment)
+{
+	return (addr >= Segments[shmem_segment].ShmemBase) && (addr < Segments[shmem_segment].ShmemEnd);
 }
 
 /*
@@ -334,6 +372,18 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
+{
+	return ShmemInitHashInSegment(name, init_size, max_size, infoP, hash_flags,
+							   MAIN_SHMEM_SEGMENT);
+}
+
+HTAB *
+ShmemInitHashInSegment(const char *name,		/* table string name for shmem index */
+			  long init_size,		/* initial table size */
+			  long max_size,		/* max size of the table */
+			  HASHCTL *infoP,		/* info about key and bucket size */
+			  int hash_flags,		/* info about infoP */
+			  int shmem_segment) 	/* in which segment to keep the table */
 {
 	bool		found;
 	void	   *location;
@@ -350,9 +400,9 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 	hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
 
 	/* look it up in the shmem index */
-	location = ShmemInitStruct(name,
+	location = ShmemInitStructInSegment(name,
 							   hash_get_shared_size(infoP, hash_flags),
-							   &found);
+							   &found, shmem_segment);
 
 	/*
 	 * if it already exists, attach to it rather than allocate and initialize
@@ -385,6 +435,13 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
  */
 void *
 ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+{
+	return ShmemInitStructInSegment(name, size, foundPtr, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
+					  int shmem_segment)
 {
 	ShmemIndexEnt *result;
 	void	   *structPtr;
@@ -393,7 +450,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	if (!ShmemIndex)
 	{
-		PGShmemHeader *shmemseghdr = ShmemSegHdr;
+		PGShmemHeader *shmemseghdr = Segments[shmem_segment].ShmemSegHdr;
 
 		/* Must be trying to create/attach to ShmemIndex itself */
 		Assert(strcmp(name, "ShmemIndex") == 0);
@@ -416,7 +473,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 			 * process can be accessing shared memory yet.
 			 */
 			Assert(shmemseghdr->index == NULL);
-			structPtr = ShmemAlloc(size);
+			structPtr = ShmemAllocInSegment(size, shmem_segment);
 			shmemseghdr->index = structPtr;
 			*foundPtr = false;
 		}
@@ -433,8 +490,8 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		LWLockRelease(ShmemIndexLock);
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("could not create ShmemIndex entry for data structure \"%s\"",
-						name)));
+				 errmsg("could not create ShmemIndex entry for data structure \"%s\" in segment %d",
+						name, shmem_segment)));
 	}
 
 	if (*foundPtr)
@@ -459,7 +516,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		Size		allocated_size;
 
 		/* It isn't in the table yet. allocate and initialize it */
-		structPtr = ShmemAllocRaw(size, &allocated_size);
+		structPtr = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
 		if (structPtr == NULL)
 		{
 			/* out of memory; remove the failed ShmemIndex entry */
@@ -478,14 +535,13 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	LWLockRelease(ShmemIndexLock);
 
-	Assert(ShmemAddrIsValid(structPtr));
+	Assert(ShmemAddrIsValidInSegment(structPtr, shmem_segment));
 
 	Assert(structPtr == (void *) CACHELINEALIGN(structPtr));
 
 	return structPtr;
 }
 
-
 /*
  * Add two Size values, checking for overflow
  */
@@ -542,10 +598,11 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 
 	/* output all allocated entries */
 	memset(nulls, 0, sizeof(nulls));
+	/* XXX: take all shared memory segments into account. */
 	while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
 	{
 		values[0] = CStringGetTextDatum(ent->key);
-		values[1] = Int64GetDatum((char *) ent->location - (char *) ShmemSegHdr);
+		values[1] = Int64GetDatum((char *) ent->location - (char *) Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr);
 		values[2] = Int64GetDatum(ent->size);
 		values[3] = Int64GetDatum(ent->allocated_size);
 		named_allocated += ent->allocated_size;
@@ -557,15 +614,15 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 	/* output shared memory allocated but not counted via the shmem index */
 	values[0] = CStringGetTextDatum("<anonymous>");
 	nulls[1] = true;
-	values[2] = Int64GetDatum(ShmemSegHdr->freeoffset - named_allocated);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset - named_allocated);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
 	/* output as-of-yet unused shared memory */
 	nulls[0] = true;
-	values[1] = Int64GetDatum(ShmemSegHdr->freeoffset);
+	values[1] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
 	nulls[1] = false;
-	values[2] = Int64GetDatum(ShmemSegHdr->totalsize - ShmemSegHdr->freeoffset);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->totalsize - Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 46f44bc4511..a36b08895c8 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -80,6 +80,8 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
+#include "postmaster/postmaster.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/proclist.h"
 #include "storage/procnumber.h"
@@ -618,10 +620,15 @@ LWLockNewTrancheId(void)
 	int		   *LWLockCounter;
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - sizeof(int));
-	/* We use the ShmemLock spinlock to protect LWLockCounter */
-	SpinLockAcquire(ShmemLock);
+	/*
+	 * We use the ShmemLock spinlock to protect LWLockCounter.
+	 *
+	 * XXX: Looks like this is the only use of Segments outside of shmem.c,
+	 * it's maybe worth it to reshape this part to hide Segments structure.
+	 */
+	SpinLockAcquire(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 	result = (*LWLockCounter)++;
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 
 	return result;
 }
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 3baf418b3d1..6ebda479ced 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -77,7 +77,7 @@ extern void check_on_shmem_exit_lists_are_empty(void);
 /* ipci.c */
 extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;
 
-extern Size CalculateShmemSize(int *num_semaphores);
+extern Size CalculateShmemSize(int *num_semaphores, int shmem_segment);
 extern void CreateSharedMemoryAndSemaphores(void);
 #ifdef EXEC_BACKEND
 extern void AttachSharedMemoryStructs(void);
diff --git a/src/include/storage/pg_sema.h b/src/include/storage/pg_sema.h
index fa6ca35a51f..8ae9637fcd0 100644
--- a/src/include/storage/pg_sema.h
+++ b/src/include/storage/pg_sema.h
@@ -41,7 +41,7 @@ typedef HANDLE PGSemaphore;
 extern Size PGSemaphoreShmemSize(int maxSemas);
 
 /* Module initialization (called during postmaster start or shmem reinit) */
-extern void PGReserveSemaphores(int maxSemas);
+extern void PGReserveSemaphores(int maxSemas, int shmem_segment);
 
 /* Allocate a PGSemaphore structure with initial count 1 */
 extern PGSemaphore PGSemaphoreCreate(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 5f7d4b83a60..2348c59b5a0 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -25,6 +25,7 @@
 #define PG_SHMEM_H
 
 #include "storage/dsm_impl.h"
+#include "storage/spin.h"
 
 typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 {
@@ -41,6 +42,20 @@ typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 #endif
 } PGShmemHeader;
 
+typedef struct ShmemSegment
+{
+	PGShmemHeader *ShmemSegHdr; 	/* shared mem segment header */
+	void *ShmemBase; 				/* start address of shared memory */
+	void *ShmemEnd; 				/* end+1 address of shared memory */
+	slock_t    *ShmemLock; 			/* spinlock for shared memory and LWLock
+									 * allocation */
+} ShmemSegment;
+
+/* Number of available segments for anonymous memory mappings */
+#define ANON_MAPPINGS 1
+
+extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -91,4 +106,7 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+/* The main segment, contains everything except buffer blocks and related data. */
+#define MAIN_SHMEM_SEGMENT 0
+
 #endif							/* PG_SHMEM_H */
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index c1f668ded95..69663d412c3 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -29,15 +29,27 @@
 extern PGDLLIMPORT slock_t *ShmemLock;
 struct PGShmemHeader;			/* avoid including storage/pg_shmem.h here */
 extern void InitShmemAccess(struct PGShmemHeader *seghdr);
+extern void InitShmemAccessInSegment(struct PGShmemHeader *seghdr,
+									 int shmem_segment);
 extern void InitShmemAllocation(void);
+extern void InitShmemAllocationInSegment(int shmem_segment);
 extern void *ShmemAlloc(Size size);
+extern void *ShmemAllocInSegment(Size size, int shmem_segment);
 extern void *ShmemAllocNoError(Size size);
 extern void *ShmemAllocUnlocked(Size size);
+extern void *ShmemAllocUnlockedInSegment(Size size, int shmem_segment);
 extern bool ShmemAddrIsValid(const void *addr);
+extern bool ShmemAddrIsValidInSegment(const void *addr, int shmem_segment);
 extern void InitShmemIndex(void);
+extern void InitVariableShmemIndex(void);
 extern HTAB *ShmemInitHash(const char *name, long init_size, long max_size,
 						   HASHCTL *infoP, int hash_flags);
+extern HTAB *ShmemInitHashInSegment(const char *name, long init_size,
+									long max_size, HASHCTL *infoP,
+									int hash_flags, int shmem_segment);
 extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
+extern void *ShmemInitStructInSegment(const char *name, Size size,
+									  bool *foundPtr, int shmem_segment);
 extern Size add_size(Size s1, Size s2);
 extern Size mul_size(Size s1, Size s2);
 

base-commit: 3feff3916ee106c084eca848527dc2d2c3ef4e89
-- 
2.34.1

0002-Address-space-reservation-for-shared-memory-20250610.patchtext/x-patch; charset=US-ASCII; name=0002-Address-space-reservation-for-shared-memory-20250610.patchDownload

From acb2308031b68862a2f1238bf7ac803210063f71 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Wed, 16 Oct 2024 20:21:33 +0200
Subject: [PATCH 02/17] Address space reservation for shared memory

Currently the kernel is responsible to chose an address, where to place each
shared memory mapping, which is the lowest possible address that do not clash
with any other mappings. This is considered to be the most portable approach,
but one of the downsides is that there is no place to resize allocated mappings
anymore. Here is how it looks like for one mapping in /proc/$PID/maps,
/dev/zero represents the anonymous shared memory we talk about:

    00400000-00490000         /path/bin/postgres
    ...
    012d9000-0133e000         [heap]
    7f443a800000-7f470a800000 /dev/zero (deleted)
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
    ...
    7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842
    7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted)

By specifying the mapping address directly it's possible to place the
mapping in a way that leaves room for resizing. The idea is:

* To reserve some address space via mmap'ing a large chunk of memory
  with PROT_NONE and MAP_NORESERVE. This way we prepare a playground for
  preparing shared memory layout without risking anything interfering
  with that.

* To slice the reserved space up into sections, one to use for each
  shared segment.

* Allocate shared memory segments out of corresponding slices and
  leaving unclaimed space in between them. This is implemented via
  mmap'ing memory at a specified address from the reserved space with
  MAP_FIXED.

The result looks like this:

    012d9000-0133e000         [heap]
    7f443a800000-7f444196c000 /dev/zero (deleted)
    7f444196c000-7f470a800000                     # reserved space
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2

Things like address space randomization should not be a problem in this
context, since the randomization is applied to the mmap base, which is
one per process.

This approach also do not impact the actual memory usage as reported by
the kernel. Here is the output of /proc/$PID/status for the master
version with shared_buffers = 128 MB:

    // Peak virtual memory size, which is described as total pages
    // mapped in mm_struct. It corresponds to the mapped reserved space
    // and is the only number that grows with it.
    VmPeak:          2043192 kB
    // Size of memory portions. It contains RssAnon + RssFile + RssShmem
    VmRSS:             22908 kB
    // Size of resident anonymous memory
    RssAnon:             768 kB
    // Size of resident file mappings
    RssFile:           10364 kB
    // Size of resident shmem memory (includes SysV shm, mapping of tmpfs and
    // shared anonymous mappings)
    RssShmem:          11776 kB

Here is the same for the patch when reserving 20GB of space:

    VmPeak:         21250648 kB
    VmRSS:             22948 kB
    RssAnon:             768 kB
    RssFile:           10404 kB
    RssShmem:          11776 kB

Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup
was created with the memory limit 256 MB, then PostgreSQL was launched withing
this cgroup with shared_buffers = 128 MB:

    $ cd /sys/fs/cgroup
    $ mkdir postgres
    $ cd postres
    $ echo 268435456 > memory.max

    $ echo $MASTER_PID_SHELL > cgroup.procs
    # postgres from the master branch has being successfully launched
    #  from that shell
    $ cat memory.current
    17465344 (~16.6 MB)
    # stop postgres

    $ echo $PATCH_PID_SHELL > cgroup.procs
    # postgres from the patch has being successfully launched from that shell
    $ cat memory.current
    17637376 (~16.8 MB)

To control the amount of space reserved a new GUC max_available_memory
is introduced. Ideally it should be based on the maximum available
memory, hense the name.
---
 src/backend/port/sysv_shmem.c       | 284 ++++++++++++++++++++++++----
 src/backend/port/win32_shmem.c      |   2 +-
 src/backend/storage/ipc/ipci.c      |   5 +-
 src/backend/utils/init/globals.c    |   1 +
 src/backend/utils/misc/guc_tables.c |  14 ++
 src/include/storage/pg_shmem.h      |   4 +-
 6 files changed, 271 insertions(+), 39 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 56af0231d24..a0f03ff868f 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -108,6 +108,66 @@ static AnonymousMapping Mappings[ANON_MAPPINGS];
 /* Keeps track of used mapping segments */
 static int next_free_segment = 0;
 
+/*
+ * Anonymous mapping placing (/dev/zero (deleted) below) looks like this:
+ *
+ * 00400000-00490000         /path/bin/postgres
+ * ...
+ * 012d9000-0133e000         [heap]
+ * 7f443a800000-7f470a800000 /dev/zero (deleted)
+ * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
+ * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
+ * ...
+ * 7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842
+ * 7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted)
+ * ...
+ *
+ * We would like to place multiple mappings in such a way, that there will be
+ * enough space between them in the address space to be able to resize up to
+ * certain size, but without counting towards the total memory consumption.
+ *
+ * To achieve that we first reserve some shared memory address space by
+ * mmap'ing a segment of MaxAvailableMemory size with PROT_NONE and
+ * MAP_NORESERVE (these flags allow to make sure this space will not be used by
+ * anything else, yet do not count against memory limits). Having the reserved
+ * space, we allocate out of it actual chunks of shared memory as usual,
+ * updating a pointer to the current available reserved space for the next
+ * allocation with the gap between segments in mind.
+ *
+ * The result would look like this:
+ *
+ * 012d9000-0133e000         [heap]
+ * 7f4426f54000-7f442e010000 /dev/zero (deleted)
+ * 7f442e010000-7f443a800000                     # reserved empty space
+ * 7f443a800000-7f444196c000 /dev/zero (deleted)
+ * 7f444196c000-7f470a800000                     # reserved empty space
+ * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
+ * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
+ * [...]
+ *
+ * The reserved space pointer is calculated to slice up the total reserved
+ * space into fixed fractions of address space for each segment, as specified
+ * in the SHMEM_RESIZE_RATIO array.
+ */
+static double SHMEM_RESIZE_RATIO[1] = {
+	1.0, 									/* MAIN_SHMEM_SLOT */
+};
+
+/*
+ * Offset from the beginning of the reserved space, which indicates currently
+ * available range. New shared memory segments have to be allocated at this
+ * offset related to the reserved space.
+ */
+static Size reserved_offset = 0;
+
+/*
+ * Flag telling that we have decided to use huge pages.
+ *
+ * XXX: It's possible to use GetConfigOption("huge_pages_status", false, false)
+ * instead, but it feels like an overkill.
+ */
+static bool huge_pages_on = false;
+
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
@@ -626,39 +686,198 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
  *
  * This function will modify mapping size to the actual size of the allocation,
  * if it ends up allocating a segment that is larger than requested.
+ *
+ * Note that we do not switch from huge pages to regular pages in this
+ * function, this decision was already made in ReserveAnonymousMemory and we
+ * stick to it.
  */
 static void
-CreateAnonymousSegment(AnonymousMapping *mapping)
+CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 {
 	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
+	int			mmap_flags = PG_MMAP_FLAGS;
 
 #ifndef MAP_HUGETLB
-	/* PGSharedMemoryCreate should have dealt with this case */
-	Assert(huge_pages != HUGE_PAGES_ON);
+	/* ReserveAnonymousMemory should have dealt with this case */
+	Assert(huge_pages != HUGE_PAGES_ON && !huge_pages_on);
 #else
-	if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
+	if (huge_pages_on)
 	{
-		/*
-		 * Round up the request size to a suitable large value.
-		 */
 		Size		hugepagesize;
-		int			mmap_flags;
 
+		/* Make sure nothing is messed up */
+		Assert(huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY);
+
+		/* Round up the request size to a suitable large value */
 		GetHugePageSize(&hugepagesize, &mmap_flags);
 
 		if (allocsize % hugepagesize != 0)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
+		mmap_flags = PG_MMAP_FLAGS | mmap_flags;
+	}
+#endif
+
+	elog(DEBUG1, "segment[%s]: mmap(%zu) at address %p",
+		 MappingName(mapping->shmem_segment), allocsize, base + reserved_offset);
+
+	/*
+	 * Try to create mapping at an address out of the reserved range, which
+	 * will allow to extend it later. Use reserved_offset to allocate the
+	 * segment, then update currently available reserved range.
+	 *
+	 * If the last step has failed, fallback to the regular mapping
+	 * creation and signal that shared buffers could not be resized without
+	 * a restart.
+	 */
+	ptr = mmap(base + reserved_offset, allocsize, PROT_READ | PROT_WRITE,
+			   mmap_flags | MAP_FIXED, -1, 0);
+	mmap_errno = errno;
+
+	if (ptr == MAP_FAILED)
+	{
+		DebugMappings();
+		elog(DEBUG1, "segment[%s]: mmap(%zu) at address %p failed: %m, "
+					 "fallback to the non-resizable allocation",
+			 MappingName(mapping->shmem_segment), allocsize, base + reserved_offset);
+
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+						   PG_MMAP_FLAGS, -1, 0);
+		mmap_errno = errno;
+	}
+	else
+	{
+		Size total_reserved = (Size) MaxAvailableMemory * BLCKSZ;
+
+		reserved_offset += total_reserved * SHMEM_RESIZE_RATIO[next_free_segment];
+	}
+
+	if (ptr == MAP_FAILED)
+	{
+		errno = mmap_errno;
+		DebugMappings();
+		ereport(FATAL,
+				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
+						MappingName(mapping->shmem_segment)),
+				 (mmap_errno == ENOMEM) ?
+				 errhint("This error usually means that PostgreSQL's request "
+						 "for a shared memory segment exceeded available memory, "
+						 "swap space, or huge pages. To reduce the request size "
+						 "(currently %zu bytes), reduce PostgreSQL's shared "
+						 "memory usage, perhaps by reducing \"shared_buffers\" or "
+						 "\"max_connections\".",
+						 allocsize) : 0));
+	}
+
+	mapping->shmem = ptr;
+	mapping->shmem_size = allocsize;
+}
+
+/*
+ * ReserveAnonymousMemory
+ *
+ * Reserve shared memory address space, from which shared memory segments are
+ * going to be sliced out. The goal of this exercise is to support segments
+ * resizing, for which we need a reserved space free of potential clashes with
+ * other mmap'd areas that are not under our control. Reservation is done via
+ * mmap, and will not allocate any memory until it will be actually used, and
+ * MAP_NORESERVE allows to make it not counting againt kernel reservation
+ * limits (e.g. in cgroups or for huge pages). Do not get confused because of
+ * MAP_NORESERVE -- we need to reserve some space, but not the actual memory,
+ * and that is that this flag is about.
+ *
+ * Note, that with MAP_NORESERVE a reservation with hugetlb will succeed even
+ * if there is actually not enough huge pages. Hence this function is
+ * responsible for deciding whether to use huge pages or not. To achieve that
+ * we need to probe first and try to allocate needed memory for all segments --
+ * if this succeeds, we unmap the probe segment and use hugetlb; if it fails,
+ * we proceed with the regular memory.
+ */
+void *
+ReserveAnonymousMemory(Size reserve_size)
+{
+	Size		allocsize = reserve_size;
+	void	   *ptr = MAP_FAILED;
+	int			mmap_errno = 0;
+
+	/* Complain if hugepages demanded but we can't possibly support them */
+#if !defined(MAP_HUGETLB)
+	if (huge_pages == HUGE_PAGES_ON)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("huge pages not supported on this platform")));
+#else
+	if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
+	{
+		Size		hugepagesize, total_size = 0;
+		int			mmap_flags;
+
+		GetHugePageSize(&hugepagesize, &mmap_flags);
+
+		/*
+		 * Figure out how much memory is needed for all segments, keeping in
+		 * mind that for every segment this value will be rounding up by the
+		 * huge page size. The resulting value will be used to probe memory and
+		 * decide whether we will allocate huge pages or not.
+		 *
+		 * We could actually have a mix and match of segments with and without
+		 * huge pages. But in that case we need to have multiple reservation
+		 * spaces to use corresponding memory (hugetlb adress space reserved
+		 * for hugetlb segments, regular memory for others), and it doesn't
+		 * seem to worth the complexity for now.
+		 */
+		for(int segment = 0; segment < ANON_MAPPINGS; segment++)
+		{
+			int	numSemas;
+			Size segment_size = CalculateShmemSize(&numSemas, segment);
+
+			if (segment_size % hugepagesize != 0)
+				segment_size += hugepagesize - (segment_size % hugepagesize);
+
+			total_size += segment_size;
+		}
+
+		/* Map total amount of memory to test its availability. */
+		elog(DEBUG1, "reserving space: probe mmap(%zu) with MAP_HUGETLB",
+					 total_size);
+		ptr = mmap(NULL, total_size, PROT_NONE,
+				   PG_MMAP_FLAGS | MAP_ANONYMOUS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 		{
-			DebugMappings();
-			elog(DEBUG1, "segment[%s]: mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-				 MappingName(mapping->shmem_segment), allocsize);
+			/* No huge pages, we will go with the regular page size */
+			elog(DEBUG1, "reserving space: probe mmap(%zu) with MAP_HUGETLB "
+						 "failed, huge pages disabled: %m", total_size);
+		}
+		else
+		{
+			/*
+			 * All fine, unmap the temporary segment and proceed with reserving
+			 * using huge pages.
+			 */
+			if (munmap(ptr, total_size) < 0)
+				elog(LOG, "reservice space: munmap(%p, %zu) failed: %m",
+					 ptr, total_size);
+
+			/* Round up the requested size to a suitable large value. */
+			if (allocsize % hugepagesize != 0)
+				allocsize += hugepagesize - (allocsize % hugepagesize);
+
+			elog(DEBUG1, "reserving space: mmap(%zu) with MAP_HUGETLB",
+						 allocsize);
+			ptr = mmap(NULL, allocsize, PROT_NONE,
+					   PG_MMAP_FLAGS | MAP_ANONYMOUS | MAP_NORESERVE | mmap_flags,
+					   -1, 0);
+			mmap_errno = errno;
+
+			/* This should not happen, but handle errors anyway */
+			if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
+			{
+				elog(DEBUG1, "reserving space: mmap(%zu) with MAP_HUGETLB "
+							 "failed, huge pages disabled: %m", allocsize);
+			}
 		}
 	}
 #endif
@@ -666,10 +885,12 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 	/*
 	 * Report whether huge pages are in use.  This needs to be tracked before
 	 * the second mmap() call if attempting to use huge pages failed
-	 * previously.
+	 * previously. At this point ptr is either pointing to the probe segment,
+	 * if we couldn't mmap it, or the reservation space.
 	 */
 	SetConfigOption("huge_pages_status", (ptr == MAP_FAILED) ? "off" : "on",
 					PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
+	huge_pages_on = ptr != MAP_FAILED;
 
 	if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
 	{
@@ -677,10 +898,11 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
-		allocsize = mapping->shmem_size;
-		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS, -1, 0);
-		mmap_errno = errno;
+		allocsize = reserve_size;
+
+		elog(DEBUG1, "reserving space: mmap(%zu)", allocsize);
+		ptr = mmap(NULL, allocsize, PROT_NONE,
+				   MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE, -1, 0);
 	}
 
 	if (ptr == MAP_FAILED)
@@ -688,20 +910,18 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		errno = mmap_errno;
 		DebugMappings();
 		ereport(FATAL,
-				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
-						MappingName(mapping->shmem_segment)),
+				(errmsg("reserving space: could not map anonymous shared "
+						"memory: %m"),
 				 (mmap_errno == ENOMEM) ?
 				 errhint("This error usually means that PostgreSQL's request "
-						 "for a shared memory segment exceeded available memory, "
-						 "swap space, or huge pages. To reduce the request size "
-						 "(currently %zu bytes), reduce PostgreSQL's shared "
-						 "memory usage, perhaps by reducing \"shared_buffers\" or "
-						 "\"max_connections\".",
+						 "for a reserved shared memory address space exceeded "
+						 "available memory, swap space, or huge pages. To "
+						 "reduce the request reservation size (currently %zu "
+						 "bytes), reduce PostgreSQL's \"maximum_shared_buffers\".",
 						 allocsize) : 0));
 	}
 
-	mapping->shmem = ptr;
-	mapping->shmem_size = allocsize;
+	return ptr;
 }
 
 /*
@@ -740,7 +960,7 @@ AnonymousShmemDetach(int status, Datum arg)
  */
 PGShmemHeader *
 PGSharedMemoryCreate(Size size,
-					 PGShmemHeader **shim)
+					 PGShmemHeader **shim, Pointer base)
 {
 	IpcMemoryKey NextShmemSegID;
 	void	   *memAddress;
@@ -760,14 +980,6 @@ PGSharedMemoryCreate(Size size,
 				 errmsg("could not stat data directory \"%s\": %m",
 						DataDir)));
 
-	/* Complain if hugepages demanded but we can't possibly support them */
-#if !defined(MAP_HUGETLB)
-	if (huge_pages == HUGE_PAGES_ON)
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("huge pages not supported on this platform")));
-#endif
-
 	/* For now, we don't support huge pages in SysV memory */
 	if (huge_pages == HUGE_PAGES_ON && shared_memory_type != SHMEM_TYPE_MMAP)
 		ereport(ERROR,
@@ -782,7 +994,7 @@ PGSharedMemoryCreate(Size size,
 	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
 		/* On success, mapping data will be modified. */
-		CreateAnonymousSegment(mapping);
+		CreateAnonymousSegment(mapping, base);
 
 		next_free_segment++;
 
diff --git a/src/backend/port/win32_shmem.c b/src/backend/port/win32_shmem.c
index 4dee856d6bd..ce719f1b412 100644
--- a/src/backend/port/win32_shmem.c
+++ b/src/backend/port/win32_shmem.c
@@ -205,7 +205,7 @@ EnableLockPagesPrivilege(int elevel)
  */
 PGShmemHeader *
 PGSharedMemoryCreate(Size size,
-					 PGShmemHeader **shim)
+					 PGShmemHeader **shim, Pointer base)
 {
 	void	   *memAddress;
 	PGShmemHeader *hdr;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 8b38e985327..076888c0172 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -203,9 +203,12 @@ CreateSharedMemoryAndSemaphores(void)
 	PGShmemHeader *seghdr;
 	Size		size;
 	int			numSemas;
+	void 		*base;
 
 	Assert(!IsUnderPostmaster);
 
+	base = ReserveAnonymousMemory((Size) MaxAvailableMemory * BLCKSZ);
+
 	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
 	{
 		/* Compute the size of the shared-memory block */
@@ -217,7 +220,7 @@ CreateSharedMemoryAndSemaphores(void)
 		 *
 		 * XXX: Do multiple shims are needed, one per segment?
 		 */
-		seghdr = PGSharedMemoryCreate(size, &shim);
+		seghdr = PGSharedMemoryCreate(size, &shim, base);
 
 		/*
 		 * Make sure that huge pages are never reported as "unknown" while the
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..9ccb7d455d6 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -140,6 +140,7 @@ int			max_parallel_maintenance_workers = 2;
  * register background workers.
  */
 int			NBuffers = 16384;
+int			MaxAvailableMemory = 131072;
 int			MaxConnections = 100;
 int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f04bfedb2fd..e63521e5a2d 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2376,6 +2376,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"max_available_memory", PGC_SIGHUP, RESOURCES_MEM,
+			gettext_noop("Sets the upper limit for the shared_buffers value."),
+			gettext_noop("Shared memory could be resized at runtime, this "
+						 "parameters sets the upper limit for it, beyond which "
+						 "resizing would not be supported. Normally this value "
+						 "would be the same as the total available memory."),
+			GUC_UNIT_BLOCKS
+		},
+		&MaxAvailableMemory,
+		131072, 16, INT_MAX / 2,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_buffer_usage_limit", PGC_USERSET, RESOURCES_MEM,
 			gettext_noop("Sets the buffer pool size for VACUUM, ANALYZE, and autovacuum."),
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 2348c59b5a0..8cb1e159917 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
 extern PGDLLIMPORT int huge_page_size;
 extern PGDLLIMPORT int huge_pages_status;
+extern PGDLLIMPORT int MaxAvailableMemory;
 
 /* Possible values for huge_pages and huge_pages_status */
 typedef enum
@@ -101,10 +102,11 @@ extern void PGSharedMemoryNoReAttach(void);
 #endif
 
 extern PGShmemHeader *PGSharedMemoryCreate(Size size,
-										   PGShmemHeader **shim);
+										   PGShmemHeader **shim, Pointer base);
 extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
+void *ReserveAnonymousMemory(Size reserve_size);
 
 /* The main segment, contains everything except buffer blocks and related data. */
 #define MAIN_SHMEM_SEGMENT 0
-- 
2.34.1

0003-Introduce-multiple-shmem-segments-for-share-20250610.patchtext/x-patch; charset=US-ASCII; name=0003-Introduce-multiple-shmem-segments-for-share-20250610.patchDownload

From 92e4e639547207e883bc010c485177afa32d5c72 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sat, 15 Mar 2025 16:38:59 +0100
Subject: [PATCH 03/17] Introduce multiple shmem segments for shared buffers

Add more shmem segments to split shared buffers into following chunks:
* BUFFERS_SHMEM_SEGMENT: contains buffer blocks
* BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
* BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
* CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
* STRATEGY_SHMEM_SEGMENT: contains buffer strategy status

Size of the corresponding shared data directly depends on NBuffers,
meaning that if we would like to change NBuffers, they have to be
resized correspondingly. Placing each of them in a separate shmem
segment allows to achieve that.

There are some asumptions made about each of shmem segments upper size
limit. The buffer blocks have the largest, while the rest claim less
extra room for resize. Ideally those limits have to be deduced from the
maximum allowed shared memory.
---
 src/backend/port/sysv_shmem.c          | 24 +++++++-
 src/backend/storage/buffer/buf_init.c  | 79 +++++++++++++++++---------
 src/backend/storage/buffer/buf_table.c |  6 +-
 src/backend/storage/buffer/freelist.c  |  5 +-
 src/backend/storage/ipc/ipci.c         |  2 +-
 src/include/storage/bufmgr.h           |  2 +-
 src/include/storage/pg_shmem.h         | 24 +++++++-
 7 files changed, 105 insertions(+), 37 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index a0f03ff868f..f46d9d5d9cd 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -147,10 +147,18 @@ static int next_free_segment = 0;
  *
  * The reserved space pointer is calculated to slice up the total reserved
  * space into fixed fractions of address space for each segment, as specified
- * in the SHMEM_RESIZE_RATIO array.
+ * in the SHMEM_RESIZE_RATIO array. E.g. we allow BUFFERS_SHMEM_SEGMENT to take
+ * up to 60% of the whole space when resizing, based on the fact that it most
+ * likely will be the main consumer of this memory. Those numbers are pulled
+ * out of thin air for now, makes sense to evaluate them more precise.
  */
-static double SHMEM_RESIZE_RATIO[1] = {
-	1.0, 									/* MAIN_SHMEM_SLOT */
+static double SHMEM_RESIZE_RATIO[6] = {
+	0.1,    /* MAIN_SHMEM_SEGMENT */
+	0.6,    /* BUFFERS_SHMEM_SEGMENT */
+	0.1,    /* BUFFER_DESCRIPTORS_SHMEM_SEGMENT */
+	0.1,    /* BUFFER_IOCV_SHMEM_SEGMENT */
+	0.05,   /* CHECKPOINT_BUFFERS_SHMEM_SEGMENT */
+	0.05,   /* STRATEGY_SHMEM_SEGMENT */
 };
 
 /*
@@ -182,6 +190,16 @@ MappingName(int shmem_segment)
 	{
 		case MAIN_SHMEM_SEGMENT:
 			return "main";
+		case BUFFERS_SHMEM_SEGMENT:
+			return "buffers";
+		case BUFFER_DESCRIPTORS_SHMEM_SEGMENT:
+			return "descriptors";
+		case BUFFER_IOCV_SHMEM_SEGMENT:
+			return "iocv";
+		case CHECKPOINT_BUFFERS_SHMEM_SEGMENT:
+			return "checkpoint";
+		case STRATEGY_SHMEM_SEGMENT:
+			return "strategy";
 		default:
 			return "unknown";
 	}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..bd68b69ee98 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -62,7 +62,10 @@ CkptSortItem *CkptBufferIds;
  * Initialize shared buffer pool
  *
  * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend).
+ * postmaster, or in a standalone backend). Size of data structures initialized
+ * here depends on NBuffers, and to be able to change NBuffers without a
+ * restart we store each structure into a separate shared memory segment, which
+ * could be resized on demand.
  */
 void
 BufferManagerShmemInit(void)
@@ -74,22 +77,22 @@ BufferManagerShmemInit(void)
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
+		ShmemInitStructInSegment("Buffer Descriptors",
 						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+						&foundDescs, BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
 		TYPEALIGN(PG_IO_ALIGN_SIZE,
-				  ShmemInitStruct("Buffer Blocks",
+				  ShmemInitStructInSegment("Buffer Blocks",
 								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
-								  &foundBufs));
+								  &foundBufs, BUFFERS_SHMEM_SEGMENT));
 
 	/* Align condition variables to cacheline boundary. */
 	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
-		ShmemInitStruct("Buffer IO Condition Variables",
+		ShmemInitStructInSegment("Buffer IO Condition Variables",
 						NBuffers * sizeof(ConditionVariableMinimallyPadded),
-						&foundIOCV);
+						&foundIOCV, BUFFER_IOCV_SHMEM_SEGMENT);
 
 	/*
 	 * The array used to sort to-be-checkpointed buffer ids is located in
@@ -99,8 +102,9 @@ BufferManagerShmemInit(void)
 	 * painful.
 	 */
 	CkptBufferIds = (CkptSortItem *)
-		ShmemInitStruct("Checkpoint BufferIds",
-						NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
+		ShmemInitStructInSegment("Checkpoint BufferIds",
+						NBuffers * sizeof(CkptSortItem), &foundBufCkpt,
+						CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
 
 	if (foundDescs || foundBufs || foundIOCV || foundBufCkpt)
 	{
@@ -156,33 +160,54 @@ BufferManagerShmemInit(void)
  * BufferManagerShmemSize
  *
  * compute the size of shared memory for the buffer pool including
- * data pages, buffer descriptors, hash tables, etc.
+ * data pages, buffer descriptors, hash tables, etc. based on the
+ * shared memory segment. The main segment must not allocate anything
+ * related to buffers, every other segment will receive part of the
+ * data.
  */
 Size
-BufferManagerShmemSize(void)
+BufferManagerShmemSize(int shmem_segment)
 {
 	Size		size = 0;
 
-	/* size of buffer descriptors */
-	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
-	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_segment == MAIN_SHMEM_SEGMENT)
+		return size;
 
-	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
-	size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	if (shmem_segment == BUFFER_DESCRIPTORS_SHMEM_SEGMENT)
+	{
+		/* size of buffer descriptors */
+		size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
+		/* to allow aligning buffer descriptors */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
 
-	/* size of stuff controlled by freelist.c */
-	size = add_size(size, StrategyShmemSize());
+	if (shmem_segment == BUFFERS_SHMEM_SEGMENT)
+	{
+		/* size of data pages, plus alignment padding */
+		size = add_size(size, PG_IO_ALIGN_SIZE);
+		size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	}
 
-	/* size of I/O condition variables */
-	size = add_size(size, mul_size(NBuffers,
-								   sizeof(ConditionVariableMinimallyPadded)));
-	/* to allow aligning the above */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_segment == STRATEGY_SHMEM_SEGMENT)
+	{
+		/* size of stuff controlled by freelist.c */
+		size = add_size(size, StrategyShmemSize());
+	}
 
-	/* size of checkpoint sort array in bufmgr.c */
-	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	if (shmem_segment == BUFFER_IOCV_SHMEM_SEGMENT)
+	{
+		/* size of I/O condition variables */
+		size = add_size(size, mul_size(NBuffers,
+									   sizeof(ConditionVariableMinimallyPadded)));
+		/* to allow aligning the above */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
+
+	if (shmem_segment == CHECKPOINT_BUFFERS_SHMEM_SEGMENT)
+	{
+		/* size of checkpoint sort array in bufmgr.c */
+		size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	}
 
 	return size;
 }
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index a50955d5286..a9952b36eba 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -22,6 +22,7 @@
 #include "postgres.h"
 
 #include "storage/buf_internals.h"
+#include "storage/pg_shmem.h"
 
 /* entry for buffer lookup hashtable */
 typedef struct
@@ -59,10 +60,11 @@ InitBufTable(int size)
 	info.entrysize = sizeof(BufferLookupEnt);
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
-	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
+	SharedBufHash = ShmemInitHashInSegment("Shared Buffer Lookup Table",
 								  size, size,
 								  &info,
-								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION,
+								  STRATEGY_SHMEM_SEGMENT);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..bd390f2709d 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
@@ -491,9 +492,9 @@ StrategyInitialize(bool init)
 	 * Get or create the shared strategy control block
 	 */
 	StrategyControl = (BufferStrategyControl *)
-		ShmemInitStruct("Buffer Strategy Status",
+		ShmemInitStructInSegment("Buffer Strategy Status",
 						sizeof(BufferStrategyControl),
-						&found);
+						&found, STRATEGY_SHMEM_SEGMENT);
 
 	if (!found)
 	{
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 076888c0172..9d00b80b4f8 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -113,7 +113,7 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
 											 sizeof(ShmemIndexEnt)));
 	size = add_size(size, dsm_estimate_size());
 	size = add_size(size, DSMRegistryShmemSize());
-	size = add_size(size, BufferManagerShmemSize());
+	size = add_size(size, BufferManagerShmemSize(shmem_segment));
 	size = add_size(size, LockManagerShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41fdc1e7693..edac9db6a12 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -318,7 +318,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
-extern Size BufferManagerShmemSize(void);
+extern Size BufferManagerShmemSize(int);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 8cb1e159917..f8459a5a421 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -52,7 +52,7 @@ typedef struct ShmemSegment
 } ShmemSegment;
 
 /* Number of available segments for anonymous memory mappings */
-#define ANON_MAPPINGS 1
+#define ANON_MAPPINGS 6
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 
@@ -108,7 +108,29 @@ extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 void *ReserveAnonymousMemory(Size reserve_size);
 
+/*
+ * To be able to dynamically resize largest parts of the data stored in shared
+ * memory, we split it into multiple shared memory mappings segments. Each
+ * segment contains only certain part of the data, which size depends on
+ * NBuffers.
+ */
+
 /* The main segment, contains everything except buffer blocks and related data. */
 #define MAIN_SHMEM_SEGMENT 0
 
+/* Buffer blocks */
+#define BUFFERS_SHMEM_SEGMENT 1
+
+/* Buffer descriptors */
+#define BUFFER_DESCRIPTORS_SHMEM_SEGMENT 2
+
+/* Condition variables for buffers */
+#define BUFFER_IOCV_SHMEM_SEGMENT 3
+
+/* Checkpoint BufferIds */
+#define CHECKPOINT_BUFFERS_SHMEM_SEGMENT 4
+
+/* Buffer strategy status */
+#define STRATEGY_SHMEM_SEGMENT 5
+
 #endif							/* PG_SHMEM_H */
-- 
2.34.1

0007-Use-anonymous-files-to-back-shared-memory-s-20250610.patchtext/x-patch; charset=US-ASCII; name=0007-Use-anonymous-files-to-back-shared-memory-s-20250610.patchDownload

From 441f537b64b6bc8f0f00fa0de7850911acff621c Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sat, 15 Mar 2025 16:39:45 +0100
Subject: [PATCH 07/17] Use anonymous files to back shared memory segments

Allow to use anonymous files for shared memory, instead of plain
anonymous memory. Such an anonymous file is created via memfd_create, it
lives in memory, behaves like a regular file and semantically equivalent
to an anonymous memory allocated via mmap with MAP_ANONYMOUS.

Advantages of using anon files are following:

* We've got a file descriptor, which could be used for regular file
  operations (modification, truncation, you name it).

* The file could be given a name, which improves readability when it
  comes to process maps. Here is how it looks like

7f90cde00000-7f90d5126000 rw-s 00000000 00:01 5463 /memfd:main (deleted)
7f90d5126000-7f914de00000 ---p 00000000 00:00 0
7f914de00000-7f9175128000 rw-s 00000000 00:01 5466 /memfd:buffers (deleted)
7f9175128000-7f944de00000 ---p 00000000 00:00 0
7f944de00000-7f9455528000 rw-s 00000000 00:01 5469 /memfd:descriptors (deleted)
7f9455528000-7f94cde00000 ---p 00000000 00:00 0
7f94cde00000-7f94d5228000 rw-s 00000000 00:01 5472 /memfd:iocv (deleted)
7f94d5228000-7f954de00000 ---p 00000000 00:00 0
7f954de00000-7f9555266000 rw-s 00000000 00:01 5475 /memfd:checkpoint (deleted)
7f9555266000-7f958de00000 ---p 00000000 00:00 0
7f958de00000-7f95954aa000 rw-s 00000000 00:01 5478 /memfd:strategy (deleted)
7f95954aa000-7f95cde00000 ---p 00000000 00:00 0

* By default, Linux will not add file-backed shared mappings into a core dump,
  making it more convenient to work with them in PostgreSQL: no more huge dumps
  to process.

The downside is that memfd_create is Linux specific.
---
 src/backend/port/sysv_shmem.c  | 73 +++++++++++++++++++++++++++++-----
 src/backend/port/win32_shmem.c |  2 +-
 src/backend/storage/ipc/ipci.c |  2 +-
 src/include/portability/mem.h  |  2 +-
 src/include/storage/pg_shmem.h |  3 +-
 5 files changed, 68 insertions(+), 14 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index a3437973784..87000a24eea 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -107,6 +107,7 @@ typedef struct AnonymousMapping
 	Pointer shmem; 				/* Pointer to the start of the mapped memory */
 	Pointer seg_addr; 			/* SysV shared memory for the header */
 	unsigned long seg_id; 		/* IPC key */
+	int segment_fd; 			/* fd for the backing anon file */
 } AnonymousMapping;
 
 static AnonymousMapping Mappings[ANON_MAPPINGS];
@@ -127,7 +128,7 @@ static int next_free_segment = 0;
  * 00400000-00490000         /path/bin/postgres
  * ...
  * 012d9000-0133e000         [heap]
- * 7f443a800000-7f470a800000 /dev/zero (deleted)
+ * 7f443a800000-7f470a800000 /memfd:main (deleted)
  * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
  * ...
@@ -150,9 +151,9 @@ static int next_free_segment = 0;
  * The result would look like this:
  *
  * 012d9000-0133e000         [heap]
- * 7f4426f54000-7f442e010000 /dev/zero (deleted)
+ * 7f4426f54000-7f442e010000 /memfd:main (deleted)
  * 7f442e010000-7f443a800000                     # reserved empty space
- * 7f443a800000-7f444196c000 /dev/zero (deleted)
+ * 7f443a800000-7f444196c000 /memfd:buffers (deleted)
  * 7f444196c000-7f470a800000                     # reserved empty space
  * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
@@ -643,13 +644,14 @@ PGSharedMemoryAttach(IpcMemoryId shmId,
  * *hugepagesize and *mmap_flags are set to 0.
  */
 void
-GetHugePageSize(Size *hugepagesize, int *mmap_flags)
+GetHugePageSize(Size *hugepagesize, int *mmap_flags, int *memfd_flags)
 {
 #ifdef MAP_HUGETLB
 
 	Size		default_hugepagesize = 0;
 	Size		hugepagesize_local = 0;
 	int			mmap_flags_local = 0;
+	int			memfd_flags_local = 0;
 
 	/*
 	 * System-dependent code to find out the default huge page size.
@@ -708,6 +710,7 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 	}
 
 	mmap_flags_local = MAP_HUGETLB;
+	memfd_flags_local = MFD_HUGETLB;
 
 	/*
 	 * On recent enough Linux, also include the explicit page size, if
@@ -718,7 +721,16 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 	{
 		int			shift = pg_ceil_log2_64(hugepagesize_local);
 
-		mmap_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
+		memfd_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
+	}
+#endif
+
+#if defined(MFD_HUGE_MASK) && defined(MFD_HUGE_SHIFT)
+	if (hugepagesize_local != default_hugepagesize)
+	{
+		int			shift = pg_ceil_log2_64(hugepagesize_local);
+
+		memfd_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
 	}
 #endif
 
@@ -727,6 +739,8 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 		*mmap_flags = mmap_flags_local;
 	if (hugepagesize)
 		*hugepagesize = hugepagesize_local;
+	if (memfd_flags)
+		*memfd_flags = memfd_flags_local;
 
 #else
 
@@ -734,6 +748,8 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 		*hugepagesize = 0;
 	if (mmap_flags)
 		*mmap_flags = 0;
+	if (memfd_flags)
+		*memfd_flags = 0;
 
 #endif							/* MAP_HUGETLB */
 }
@@ -771,7 +787,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
-	int			mmap_flags = PG_MMAP_FLAGS;
+	int			mmap_flags = PG_MMAP_FLAGS, memfd_flags = 0;
 
 #ifndef MAP_HUGETLB
 	/* ReserveAnonymousMemory should have dealt with this case */
@@ -785,7 +801,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 		Assert(huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY);
 
 		/* Round up the request size to a suitable large value */
-		GetHugePageSize(&hugepagesize, &mmap_flags);
+		GetHugePageSize(&hugepagesize, &mmap_flags, &memfd_flags);
 
 		if (allocsize % hugepagesize != 0)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
@@ -794,6 +810,29 @@ CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 	}
 #endif
 
+	/*
+	 * Prepare an anonymous file backing the segment. Its size will be
+	 * specified later via ftruncate.
+	 *
+	 * The file behaves like a regular file, but lives in memory. Once all
+	 * references to the file are dropped,  it is automatically released.
+	 * Anonymous memory is used for all backing pages of the file, thus it has
+	 * the same semantics as anonymous memory allocations using mmap with the
+	 * MAP_ANONYMOUS flag.
+	 */
+	mapping->segment_fd = memfd_create(MappingName(mapping->shmem_segment),
+									   memfd_flags);
+
+	/*
+	 * Specify the segment file size using allocsize, which contains
+	 * potentially modified size.
+	 */
+	if(ftruncate(mapping->segment_fd, allocsize) == -1)
+		ereport(FATAL,
+				(errcode(ERRCODE_SYSTEM_ERROR),
+				 errmsg("could not truncase anonymous file for \"%s\": %m",
+						MappingName(mapping->shmem_segment))));
+
 	elog(DEBUG1, "segment[%s]: mmap(%zu) at address %p",
 		 MappingName(mapping->shmem_segment), allocsize, base + reserved_offset);
 
@@ -807,7 +846,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 	 * a restart.
 	 */
 	ptr = mmap(base + reserved_offset, allocsize, PROT_READ | PROT_WRITE,
-			   mmap_flags | MAP_FIXED, -1, 0);
+			   mmap_flags | MAP_FIXED, mapping->segment_fd, 0);
 	mmap_errno = errno;
 
 	if (ptr == MAP_FAILED)
@@ -817,8 +856,15 @@ CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 					 "fallback to the non-resizable allocation",
 			 MappingName(mapping->shmem_segment), allocsize, base + reserved_offset);
 
+		/* Specify the segment file size using allocsize. */
+		if(ftruncate(mapping->segment_fd, allocsize) == -1)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not truncase anonymous file for \"%s\": %m",
+							MappingName(mapping->shmem_segment))));
+
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-						   PG_MMAP_FLAGS, -1, 0);
+						   PG_MMAP_FLAGS, mapping->segment_fd, 0);
 		mmap_errno = errno;
 	}
 	else
@@ -889,7 +935,7 @@ ReserveAnonymousMemory(Size reserve_size)
 		Size		hugepagesize, total_size = 0;
 		int			mmap_flags;
 
-		GetHugePageSize(&hugepagesize, &mmap_flags);
+		GetHugePageSize(&hugepagesize, &mmap_flags, NULL);
 
 		/*
 		 * Figure out how much memory is needed for all segments, keeping in
@@ -1070,6 +1116,13 @@ AnonymousShmemResize(void)
 		if (m->shmem_size == new_size)
 			continue;
 
+		/* Resize the backing anon file. */
+		if(ftruncate(m->segment_fd, new_size) == -1)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not truncase anonymous file for \"%s\": %m",
+							MappingName(m->shmem_segment))));
+
 		/* Clean up some reserved space to resize into */
 		if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
 			ereport(FATAL,
diff --git a/src/backend/port/win32_shmem.c b/src/backend/port/win32_shmem.c
index ce719f1b412..ba972106de1 100644
--- a/src/backend/port/win32_shmem.c
+++ b/src/backend/port/win32_shmem.c
@@ -627,7 +627,7 @@ pgwin32_ReserveSharedMemoryRegion(HANDLE hChild)
  * use GetLargePageMinimum() instead.
  */
 void
-GetHugePageSize(Size *hugepagesize, int *mmap_flags)
+GetHugePageSize(Size *hugepagesize, int *mmap_flags, int *memfd_flags)
 {
 	if (hugepagesize)
 		*hugepagesize = 0;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index abeb91e24fd..dc2b4becf4a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -396,7 +396,7 @@ InitializeShmemGUCs(void)
 	/*
 	 * Calculate the number of huge pages required.
 	 */
-	GetHugePageSize(&hp_size, NULL);
+	GetHugePageSize(&hp_size, NULL, NULL);
 	if (hp_size != 0)
 	{
 		Size		hp_required;
diff --git a/src/include/portability/mem.h b/src/include/portability/mem.h
index ef9800732d9..40588ff6968 100644
--- a/src/include/portability/mem.h
+++ b/src/include/portability/mem.h
@@ -38,7 +38,7 @@
 #define MAP_NOSYNC			0
 #endif
 
-#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_HASSEMAPHORE)
 
 /* Some really old systems don't define MAP_FAILED. */
 #ifndef MAP_FAILED
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 19ad2e2f788..192b637cc65 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -125,7 +125,8 @@ extern PGShmemHeader *PGSharedMemoryCreate(Size size,
 										   PGShmemHeader **shim, Pointer base);
 extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
-extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
+extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
+							int *memfd_flags);
 void *ReserveAnonymousMemory(Size reserve_size);
 
 bool ProcessBarrierShmemResize(Barrier *barrier);
-- 
2.34.1

0008-Support-resize-for-hugetlb-20250610.patchtext/x-patch; charset=US-ASCII; name=0008-Support-resize-for-hugetlb-20250610.patchDownload

From 2ebc737cd5b22c4cb3fbcafb583c0bbd61fe93d0 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sat, 5 Apr 2025 19:51:33 +0200
Subject: [PATCH 08/17] Support resize for hugetlb

Linux kernel has a set of limitations on remapping hugetlb segments: it
can't increase size of such segment [1], and shrinking it will not
release the memory back. In fact support for hugetlb mremap was
implemented no so long time ago [2].

As a workaround, avoid mremap for resizing shared memory. Instead unmap
the whole segment and map it back at the same address with the new size,
relying on the fact that fd for the anon file behind the segment is
still open and will keep the memory content.

[1]: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/mremap.c?id=f4d2ef48250ad057e4f00087967b5ff366da9f39#n1593
[2]: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/mm/mremap.c?id=550a7d60bd5e35a56942dba6d8a26752beb26c9f
---
 src/backend/port/sysv_shmem.c | 60 +++++++++++++++++++++++++----------
 1 file changed, 44 insertions(+), 16 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 87000a24eea..f0b53ce1d7c 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -1109,6 +1109,7 @@ AnonymousShmemResize(void)
 		/* Note that CalculateShmemSize indirectly depends on NBuffers */
 		Size new_size = CalculateShmemSize(&numSemas, i);
 		AnonymousMapping *m = &Mappings[i];
+		int	mmap_flags = PG_MMAP_FLAGS;
 
 		if (m->shmem == NULL)
 			continue;
@@ -1116,6 +1117,44 @@ AnonymousShmemResize(void)
 		if (m->shmem_size == new_size)
 			continue;
 
+#ifndef MAP_HUGETLB
+		/* ReserveAnonymousMemory should have dealt with this case */
+		Assert(huge_pages != HUGE_PAGES_ON && !huge_pages_on);
+#else
+		if (huge_pages_on)
+		{
+			Size		hugepagesize;
+
+			/* Make sure nothing is messed up */
+			Assert(huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY);
+
+			/* Round up the new size to a suitable large value */
+			GetHugePageSize(&hugepagesize, &mmap_flags, NULL);
+
+			if (new_size % hugepagesize != 0)
+				new_size += hugepagesize - (new_size % hugepagesize);
+
+			mmap_flags = PG_MMAP_FLAGS | mmap_flags;
+		}
+#endif
+
+		/*
+		 * Linux limitations do not allow us to mremap hugetlb in the way we
+		 * want. E.g. no size increase is allowed, and for shrinking the memory
+		 * will not be released back. To work around this unmap the segment and
+		 * create a new one at the same address. Thanks for the backing anon
+		 * file the content will still be kept in memory.
+		 */
+		elog(DEBUG1, "segment[%s]: remap from %zu to %zu at address %p",
+					 MappingName(m->shmem_segment), m->shmem_size,
+					 new_size, m->shmem);
+
+		if (munmap(m->shmem, m->shmem_size) < 0)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not unmap shared memory segment %s [%p]: %m",
+							MappingName(m->shmem_segment), m->shmem)));
+
 		/* Resize the backing anon file. */
 		if(ftruncate(m->segment_fd, new_size) == -1)
 			ereport(FATAL,
@@ -1123,25 +1162,14 @@ AnonymousShmemResize(void)
 					 errmsg("could not truncase anonymous file for \"%s\": %m",
 							MappingName(m->shmem_segment))));
 
-		/* Clean up some reserved space to resize into */
-		if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
-			ereport(FATAL,
-					(errcode(ERRCODE_SYSTEM_ERROR),
-					 errmsg("could not unmap %zu from reserved shared memory %p: %m",
-							new_size - m->shmem_size, m->shmem)));
-
-		/* Claim the unused space */
-		elog(DEBUG1, "segment[%s]: remap from %zu to %zu at address %p",
-					 MappingName(m->shmem_segment), m->shmem_size,
-					 new_size, m->shmem);
-
-		ptr = mremap(m->shmem, m->shmem_size, new_size, 0);
+		/* Reclaim the space */
+		ptr = mmap(m->shmem, new_size, PROT_READ | PROT_WRITE,
+				   mmap_flags | MAP_FIXED, m->segment_fd, 0);
 		if (ptr == MAP_FAILED)
 			ereport(FATAL,
 					(errcode(ERRCODE_SYSTEM_ERROR),
-					 errmsg("could not resize shared memory segment %s [%p] to %d (%zu): %m",
-							MappingName(m->shmem_segment), m->shmem, NBuffers,
-							new_size)));
+					 errmsg("could not map shared memory segment %s [%p] with size %zu: %m",
+							MappingName(m->shmem_segment), m->shmem, new_size)));
 
 		reinit = true;
 		m->shmem_size = new_size;
-- 
2.34.1

0006-Allow-to-resize-shared-memory-without-resta-20250610.patchtext/x-patch; charset=US-ASCII; name=0006-Allow-to-resize-shared-memory-without-resta-20250610.patchDownload

From 12a39ceb67b438c2fdf3727b51f5e4a9c105b3b0 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sun, 6 Apr 2025 16:47:16 +0200
Subject: [PATCH 06/17] Allow to resize shared memory without restart

Add assing hook for shared_buffers to resize shared memory using space,
introduced in the previous commits without requiring PostgreSQL restart.
Essentially the implementation is based on two mechanisms: a
ProcSignalBarrier is used to make sure all processes are starting the
resize procedure simultaneously, and a global Barrier is used to
coordinate after that and make sure all finished processes are waiting
for others that are in progress.

The resize process looks like this:

* The GUC assign hook sets a flag to let the Postmaster know that resize
  was requested.

* Postmaster verifies the flag in the event loop, and starts the resize
  by emitting a ProcSignal barrier.

* All processes, that participate in ProcSignal mechanism, begin to
  process ProcSignal barrier. First a process waits until all processes
  have confirmed they received the message and can start simultaneously.

* Every process recalculates shared memory size based on the new
  NBuffers and extend it using mremap. One elected process signals the
  postmaster to do the same.

* When finished, every process waits on a global ShmemControl barrier,
  untill all others are finished as well. This way we ensure three
  stages with clear boundaries: before the resize, when all processes
  use old NBuffers; during the resize, when processes have mix of old
  and new NBuffers, and wait until it's done; after the resize, when all
  processes use new NBuffers.

* After all processes are using new value, one of them will initialize
  new shared structures (buffer blocks, descriptors, etc) as needed and
  broadcast new value of NBuffers via ShmemControl in shared memory.
  Other backends are waiting for this operation to finish as well. Then
  the barrier is lifted and everything goes as usual.

Since resizing takes time, we need to take into account that during that time:

- New backends can be spawned. They will check status of the barrier
  early during the bootstrap, and wait until everything is over to work
  with the new NBuffers value.

- Old backends can exit before attempting to resize. Synchronization
  used between backends relies on ProcSignalBarrier and waits for all
  participants received the message at the beginning to gather all
  existing backends.

- Some backends might be blocked and not responsing either before or
  after receiving the message. In the first case such backend still
  have ProcSignalSlot and should be waited for, in the second case
  shared barrier will make sure we still waiting for those backends. In
  any case there is an unbounded wait.

- Backends might join barrier in disjoint groups with some time in
  between. That means that relying only on the shared dynamic barrier is
  not enough -- it will only synchronize resize procedure withing those
  groups. That's why we wait first for all participants of ProcSignal
  mechanism who received the message.

Here is how it looks like after raising shared_buffers from 128 MB to
512 MB and calling pg_reload_conf():

    -- 128 MB
    7f90cde00000-7f90d4fa6000  /dev/zero (deleted)
    7f90d4fa6000-7f914de00000
    7f914de00000-7f915cfa8000  /dev/zero (deleted)
    ^ buffers mapping, ~241 MB
    7f915cfa8000-7f944de00000
    7f944de00000-7f94550a8000  /dev/zero (deleted)
    7f94550a8000-7f94cde00000
    7f94cde00000-7f94d4fe8000  /dev/zero (deleted)
    7f94d4fe8000-7f954de00000
    7f954de00000-7f9554ff6000  /dev/zero (deleted)
    7f9554ff6000-7f958de00000
    7f958de00000-7f959508a000  /dev/zero (deleted)
    7f959508a000-7f95cde00000

    -- 512 MB
    7f90cde00000-7f90d5126000  /dev/zero (deleted)
    7f90d5126000-7f914de00000
    7f914de00000-7f9175128000  /dev/zero (deleted)
    ^ buffers mapping, ~627 MB
    7f9175128000-7f944de00000
    7f944de00000-7f9455528000  /dev/zero (deleted)
    7f9455528000-7f94cde00000
    7f94cde00000-7f94d5228000  /dev/zero (deleted)
    7f94d5228000-7f954de00000
    7f954de00000-7f9555266000  /dev/zero (deleted)
    7f9555266000-7f958de00000
    7f958de00000-7f95954aa000  /dev/zero (deleted)
    7f95954aa000-7f95cde00000

The implementation supports only increasing of shared_buffers. For
decreasing the value a similar procedure is needed. But the buffer
blocks with data have to be drained first, so that the actual data set
fits into the new smaller space.

From experiment it turns out that shared mappings have to be extended
separately for each process that uses them. Another rough edge is that a
backend blocked on ReadCommand will not apply shared_buffers change
until it receives something.

Note, that mremap is Linux specific, thus the implementation not very
portable.

Authors: Dmitrii Dolgov, Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c                 | 413 ++++++++++++++++++
 src/backend/postmaster/postmaster.c           |  18 +
 src/backend/storage/buffer/buf_init.c         |  75 ++--
 src/backend/storage/ipc/ipci.c                |  18 +-
 src/backend/storage/ipc/procsignal.c          |  46 ++
 src/backend/storage/ipc/shmem.c               |  23 +-
 src/backend/tcop/postgres.c                   |  10 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/backend/utils/misc/guc_tables.c           |   4 +-
 src/include/miscadmin.h                       |   1 +
 src/include/storage/bufmgr.h                  |   2 +-
 src/include/storage/ipc.h                     |   3 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/storage/pg_shmem.h                |  26 ++
 src/include/storage/pmsignal.h                |   1 +
 src/include/storage/procsignal.h              |   1 +
 src/tools/pgindent/typedefs.list              |   1 +
 17 files changed, 603 insertions(+), 43 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index f46d9d5d9cd..a3437973784 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -30,13 +30,19 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "portability/mem.h"
+#include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
+#include "storage/pmsignal.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
 #include "utils/pidfile.h"
+#include "utils/wait_event.h"
 
 
 /*
@@ -105,6 +111,13 @@ typedef struct AnonymousMapping
 
 static AnonymousMapping Mappings[ANON_MAPPINGS];
 
+/* Flag telling postmaster that resize is needed */
+volatile bool pending_pm_shmem_resize = false;
+
+/* Keeps track of the previous NBuffers value */
+static int NBuffersOld = -1;
+static int NBuffersPending = -1;
+
 /* Keeps track of used mapping segments */
 static int next_free_segment = 0;
 
@@ -176,6 +189,49 @@ static Size reserved_offset = 0;
  */
 static bool huge_pages_on = false;
 
+/*
+ * Flag telling that we have prepared the memory layout to be resizable. If
+ * false after all shared memory segments creation, it means we failed to setup
+ * needed layout and falled back to the regular non-resizable approach.
+ */
+static bool shmem_resizable = false;
+
+/*
+ * Currently broadcasted value of NBuffers in shared memory.
+ *
+ * Most of the time this value is going to be equal to NBuffers. But if
+ * postmaster is resizing shared memory and a new backend was created
+ * at the same time, there is a possibility for the new backend to inherit the
+ * old NBuffers value, but miss the resize signal if ProcSignal infrastructure
+ * was not initialized yet. Consider this situation:
+ *
+ *     Postmaster ------> New Backend
+ *         |                   |
+ *         |                Launch
+ *         |                   |
+ *         |             Inherit NBuffers
+ *         |                   |
+ *     Resize NBuffers         |
+ *         |                   |
+ *     Emit Barrier            |
+ *         |            Init ProcSignal
+ *         |                   |
+ *     Finish resize           |
+ *         |                   |
+ *     New NBuffers       Old NBuffers
+ *
+ * In this case the backend is not yet ready to receive a signal from
+ * EmitProcSignalBarrier, and will be ignored. The same happens if ProcSignal
+ * is initialized even later, after the resizing was finished.
+ *
+ * To address resulting inconsistency, postmaster broadcasts the current
+ * NBuffers value via shared memory. Every new backend has to verify this value
+ * before it will access the buffer pool: if it differs from its own value,
+ * this indicates a shared memory resize has happened and the backend has to
+ * first synchronize with rest of the pack.
+ */
+ShmemControl *ShmemCtrl = NULL;
+
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
@@ -769,6 +825,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 	{
 		Size total_reserved = (Size) MaxAvailableMemory * BLCKSZ;
 
+		shmem_resizable = true;
 		reserved_offset += total_reserved * SHMEM_RESIZE_RATIO[next_free_segment];
 	}
 
@@ -964,6 +1021,315 @@ AnonymousShmemDetach(int status, Datum arg)
 	}
 }
 
+/*
+ * Resize all shared memory segments based on the current NBuffers value, which
+ * is is applied from NBuffersPending. The actual segment resizing is done via
+ * mremap, which will fail if is not sufficient space to expand the mapping.
+ * When finished, based on the new and old values initialize new buffer blocks
+ * if any.
+ *
+ * If reinitializing took place, as the last step this function does buffers
+ * reinitialization as well and broadcasts the new value of NSharedBuffers. All
+ * of that needs to be done only by one backend, the first one that managed to
+ * grab the ShmemResizeLock.
+ */
+bool
+AnonymousShmemResize(void)
+{
+	int	numSemas;
+	bool reinit = false;
+	void *ptr = MAP_FAILED;
+	NBuffers = NBuffersPending;
+
+	elog(DEBUG1, "Resize shmem from %d to %d", NBuffersOld, NBuffers);
+
+	/*
+	 * XXX: Where to reset the flag is still an open question. E.g. do we
+	 * consider a no-op when NBuffers is equal to NBuffersOld a genuine resize
+	 * and reset the flag?
+	 */
+	pending_pm_shmem_resize = false;
+
+	/*
+	 * XXX: Currently only increasing of shared_buffers is supported. For
+	 * decreasing something similar has to be done, but buffer blocks with
+	 * data have to be drained first.
+	 */
+	if(NBuffersOld > NBuffers)
+		return false;
+
+	for(int i = 0; i < next_free_segment; i++)
+	{
+		/* Note that CalculateShmemSize indirectly depends on NBuffers */
+		Size new_size = CalculateShmemSize(&numSemas, i);
+		AnonymousMapping *m = &Mappings[i];
+
+		if (m->shmem == NULL)
+			continue;
+
+		if (m->shmem_size == new_size)
+			continue;
+
+		/* Clean up some reserved space to resize into */
+		if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not unmap %zu from reserved shared memory %p: %m",
+							new_size - m->shmem_size, m->shmem)));
+
+		/* Claim the unused space */
+		elog(DEBUG1, "segment[%s]: remap from %zu to %zu at address %p",
+					 MappingName(m->shmem_segment), m->shmem_size,
+					 new_size, m->shmem);
+
+		ptr = mremap(m->shmem, m->shmem_size, new_size, 0);
+		if (ptr == MAP_FAILED)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not resize shared memory segment %s [%p] to %d (%zu): %m",
+							MappingName(m->shmem_segment), m->shmem, NBuffers,
+							new_size)));
+
+		reinit = true;
+		m->shmem_size = new_size;
+	}
+
+	if (reinit)
+	{
+		if(IsUnderPostmaster &&
+			LWLockConditionalAcquire(ShmemResizeLock, LW_EXCLUSIVE))
+		{
+			/*
+			 * If the new NBuffers was already broadcasted, the buffer pool was
+			 * already initialized before.
+			 *
+			 * Since we're not on a hot path, we use lwlocks and do not need to
+			 * involve memory barrier.
+			 */
+			if(pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers)
+			{
+				/*
+				 * Allow the first backend that managed to get the lock to
+				 * reinitialize the new portion of buffer pool. Every other
+				 * process will wait on the shared barrier for that to finish,
+				 * since it's a part of the SHMEM_RESIZE_DONE phase.
+				 *
+				 * Note that it's enough when only one backend will do that,
+				 * even the ShmemInitStruct part. The reason is that resized
+				 * shared memory will maintain the same addresses, meaning that
+				 * all the pointers are still valid, and we only need to update
+				 * structures size in the ShmemIndex once -- any other backend
+				 * will pick up this shared structure from the index.
+				 *
+				 * XXX: This is the right place for buffer eviction as well.
+				 */
+				BufferManagerShmemInit(NBuffersOld);
+
+				/* If all fine, broadcast the new value */
+				pg_atomic_write_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
+			}
+
+			LWLockRelease(ShmemResizeLock);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * We are asked to resize shared memory. Wait for all ProcSignal participants
+ * to join the barrier, then do the resize and wait on the barrier until all
+ * participating finish resizing as well -- otherwise we face danger of
+ * inconsistency between backends.
+ *
+ * XXX: If a backend is blocked on ReadCommand in PostgresMain, it will not
+ * proceed with AnonymousShmemResize after receiving SIGHUP, until something
+ * will be sent.
+ */
+bool
+ProcessBarrierShmemResize(Barrier *barrier)
+{
+	elog(DEBUG1, "Handle a barrier for shmem resizing from %d to %d, %d",
+		 NBuffersOld, NBuffersPending, pending_pm_shmem_resize);
+
+	/* Wait until we have seen the new NBuffers value */
+	if (!pending_pm_shmem_resize)
+		return false;
+
+	/*
+	 * First thing to do after attaching to the barrier is to wait for others.
+	 * We can't simply use BarrierArriveAndWait, because backends might arrive
+	 * here in disjoint groups, e.g. first two backends, pause, then second two
+	 * backends. If the resize is quick enough that can lead to a situation
+	 * when the first group is already finished before the second has appeared,
+	 * and the barrier will only synchonize withing those groups.
+	 */
+	if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED)
+		WaitForProcSignalBarrierReceived(
+				pg_atomic_read_u64(&ShmemCtrl->Generation));
+
+	/*
+	 * Now start the procedure, and elect one backend to ping postmaster to do
+	 * the same.
+	 *
+	 * XXX: If we need to be able to abort resizing, this has to be done later,
+	 * after the SHMEM_RESIZE_DONE.
+	 */
+	if (BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START))
+	{
+		Assert(IsUnderPostmaster);
+		SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
+	}
+
+	AnonymousShmemResize();
+
+	/* The second phase means the resize has finished, SHMEM_RESIZE_DONE */
+	BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_DONE);
+
+	BarrierDetach(barrier);
+	return true;
+}
+
+/*
+ * GUC assign hook for shared_buffers. It's recommended for an assign hook to
+ * be as minimal as possible, thus we just request shared memory resize and
+ * remember the previous value.
+ */
+void
+assign_shared_buffers(int newval, void *extra, bool *pending)
+{
+	elog(DEBUG1, "Received SIGHUP for shmem resizing");
+
+	/* Request shared memory resize only when it was initialized */
+	if (next_free_segment != 0)
+	{
+		elog(DEBUG1, "Set pending signal");
+		pending_pm_shmem_resize = true;
+		*pending = true;
+		NBuffersPending = newval;
+	}
+
+	NBuffersOld = NBuffers;
+}
+
+/*
+ * Test if we have somehow missed a shmem resize signal and NBuffers value
+ * differs from NSharedBuffers. If yes, catchup and do resize.
+ */
+void
+AdjustShmemSize(void)
+{
+	uint32 NSharedBuffers = pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers);
+
+	if (NSharedBuffers != NBuffers)
+	{
+		/*
+		 * If the broadcasted shared_buffers is different from the one we see,
+		 * it could be that the backend has missed a resize signal. To avoid
+		 * any inconsistency, adjust the shared mappings, before having a
+		 * chance to access the buffer pool.
+		 */
+		ereport(LOG,
+				(errmsg("shared_buffers has been changed from %d to %d, "
+						"resize shared memory",
+						NBuffers, NSharedBuffers)));
+		NBuffers = NSharedBuffers;
+		AnonymousShmemResize();
+	}
+}
+
+/*
+ * Start resizing procedure, making sure all existing processes will have
+ * consistent view of shared memory size. Must be called only in postmaster.
+ */
+void
+CoordinateShmemResize(void)
+{
+	elog(DEBUG1, "Coordinating shmem resize from %d to %d",
+		 NBuffersOld, NBuffers);
+	Assert(!IsUnderPostmaster);
+
+	/*
+	 * We use dynamic barrier to help dealing with backends that were spawned
+	 * during the resize.
+	 */
+	BarrierInit(&ShmemCtrl->Barrier, 0);
+
+	/*
+	 * If the value did not change, or shared memory segments are not
+	 * initialized yet, skip the resize.
+	 */
+	if (NBuffersPending == NBuffersOld || next_free_segment == 0)
+	{
+		elog(DEBUG1, "Skip resizing, new %d, old %d, free segment %d",
+			 NBuffers, NBuffersOld, next_free_segment);
+		return;
+	}
+
+	/*
+	 * Shared memory resize requires some coordination done by postmaster,
+	 * and consists of three phases:
+	 *
+	 * - Before the resize all existing backends have the same old NBuffers.
+	 * - When resize is in progress, backends are expected to have a
+	 *   mixture of old a new values. They're not allowed to touch buffer
+	 *   pool during this time frame.
+	 * - After resize has been finished, all existing backends, that can access
+	 *   the buffer pool, are expected to have the same new value of NBuffers.
+	 *
+	 * Those phases are ensured by joining the shared barrier associated with
+	 * the procedure. Since resizing takes time, we need to take into account
+	 * that during that time:
+	 *
+	 * - New backends can be spawned. They will check status of the barrier
+	 *   early during the bootstrap, and wait until everything is over to work
+	 *   with the new NBuffers value.
+	 *
+	 * - Old backends can exit before attempting to resize. Synchronization
+	 *   used between backends relies on ProcSignalBarrier and waits for all
+	 *   participants received the message at the beginning to gather all
+	 *   existing backends.
+	 *
+	 * - Some backends might be blocked and not responsing either before or
+	 *   after receiving the message. In the first case such backend still
+	 *   have ProcSignalSlot and should be waited for, in the second case
+	 *   shared barrier will make sure we still waiting for those backends. In
+	 *   any case there is an unbounded wait.
+	 *
+	 * - Backends might join barrier in disjoint groups with some time in
+	 *   between. That means that relying only on the shared dynamic barrier is
+	 *   not enough -- it will only synchronize resize procedure withing those
+	 *   groups. That's why we wait first for all participants of ProcSignal
+	 *   mechanism who received the message.
+	 */
+	elog(DEBUG1, "Emit a barrier for shmem resizing");
+	pg_atomic_init_u64(&ShmemCtrl->Generation,
+					   EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SHMEM_RESIZE));
+
+	/* To order everything after setting Generation value */
+	pg_memory_barrier();
+
+	/*
+	 * After that postmaster waits for PMSIGNAL_SHMEM_RESIZE as a sign that all
+	 * the rest of the pack has started the procedure and it can resize shared
+	 * memory as well.
+	 *
+	 * Normally we would call WaitForProcSignalBarrier here to wait until every
+	 * backend has reported on the ProcSignalBarrier. But for shared memory
+	 * resize we don't need this, as every participating backend will
+	 * synchronize on the ProcSignal barrier. In fact even if we would like to
+	 * wait here, it wouldn't be possible -- we're in the postmaster, without
+	 * any waiting infrastructure available.
+	 *
+	 * If at some point it will turn out that waiting is essential, we would
+	 * need to consider some alternatives. E.g. it could be a designated
+	 * coordination process, which is not a postmaster. Another option would be
+	 * to introduce a CoordinateShmemResize lock and allow only one process to
+	 * take it (this probably would have to be something different than
+	 * LWLocks, since they block interrupts, and coordination relies on them).
+	 */
+}
+
 /*
  * PGSharedMemoryCreate
  *
@@ -1271,3 +1637,50 @@ PGSharedMemoryDetach(void)
 		}
 	}
 }
+
+void
+WaitOnShmemBarrier()
+{
+	Barrier *barrier = &ShmemCtrl->Barrier;
+
+	/* Nothing to do if resizing is not started */
+	if (BarrierPhase(barrier) < SHMEM_RESIZE_START)
+		return;
+
+	BarrierAttach(barrier);
+
+	/* Otherwise wait through all available phases */
+	while (BarrierPhase(barrier) < SHMEM_RESIZE_DONE)
+	{
+		ereport(LOG, (errmsg("ProcSignal barrier is in phase %d, waiting",
+							 BarrierPhase(barrier))));
+
+		BarrierArriveAndWait(barrier, 0);
+	}
+
+	BarrierDetach(barrier);
+}
+
+void
+ShmemControlInit(void)
+{
+	bool foundShmemCtrl;
+
+	ShmemCtrl = (ShmemControl *)
+	ShmemInitStruct("Shmem Control", sizeof(ShmemControl),
+									 &foundShmemCtrl);
+
+	if (!foundShmemCtrl)
+	{
+		/*
+		 * The barrier is missing here, it will be initialized right before
+		 * starting the resizing process as a convenient way to reset it.
+		 */
+
+		/* Initialize with the currently known value */
+		pg_atomic_init_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
+
+		/* shmem_resizable should be initialized by now */
+		ShmemCtrl->Resizable = shmem_resizable;
+	}
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 490f7ce3664..f0cb0098dcd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -426,6 +426,7 @@ static void process_pm_pmsignal(void);
 static void process_pm_child_exit(void);
 static void process_pm_reload_request(void);
 static void process_pm_shutdown_request(void);
+static void process_pm_shmem_resize(void);
 static void dummy_handler(SIGNAL_ARGS);
 static void CleanupBackend(PMChild *bp, int exitstatus);
 static void HandleChildCrash(int pid, int exitstatus, const char *procname);
@@ -1694,6 +1695,9 @@ ServerLoop(void)
 			if (pending_pm_pmsignal)
 				process_pm_pmsignal();
 
+			if (pending_pm_shmem_resize)
+				process_pm_shmem_resize();
+
 			if (events[i].events & WL_SOCKET_ACCEPT)
 			{
 				ClientSocket s;
@@ -2039,6 +2043,17 @@ process_pm_reload_request(void)
 	}
 }
 
+static void
+process_pm_shmem_resize(void)
+{
+	/*
+	 * Failure to resize is considered to be fatal and will not be
+	 * retried, which means we can disable pending flag right here.
+	 */
+	pending_pm_shmem_resize = false;
+	CoordinateShmemResize();
+}
+
 /*
  * pg_ctl uses SIGTERM, SIGINT and SIGQUIT to request different types of
  * shutdown.
@@ -3852,6 +3867,9 @@ process_pm_pmsignal(void)
 		request_state_update = true;
 	}
 
+	if (CheckPostmasterSignal(PMSIGNAL_SHMEM_RESIZE))
+		AnonymousShmemResize();
+
 	/*
 	 * Try to advance postmaster's state machine, if a child requests it.
 	 */
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index bd68b69ee98..ac844b114bd 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -17,6 +17,7 @@
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -24,7 +25,6 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
-
 /*
  * Data Structures:
  *		buffers live in a freelist and a lookup data structure.
@@ -62,18 +62,28 @@ CkptSortItem *CkptBufferIds;
  * Initialize shared buffer pool
  *
  * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend). Size of data structures initialized
- * here depends on NBuffers, and to be able to change NBuffers without a
- * restart we store each structure into a separate shared memory segment, which
- * could be resized on demand.
+ * postmaster, or in a standalone backend) or during shared-memory resize. Size
+ * of data structures initialized here depends on NBuffers, and to be able to
+ * change NBuffers without a restart we store each structure into a separate
+ * shared memory segment, which could be resized on demand.
+ *
+ * FirstBufferToInit tells where to start initializing buffers. For
+ * initialization it always will be zero, but when resizing shared-memory it
+ * indicates the number of already initialized buffers.
+ *
+ * No locks are taking in this function, it is the caller responsibility to
+ * make sure only one backend can work with new buffers.
  */
 void
-BufferManagerShmemInit(void)
+BufferManagerShmemInit(int FirstBufferToInit)
 {
 	bool		foundBufs,
 				foundDescs,
 				foundIOCV,
 				foundBufCkpt;
+	int			i;
+	elog(DEBUG1, "BufferManagerShmemInit from %d to %d",
+				 FirstBufferToInit, NBuffers);
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
@@ -110,43 +120,44 @@ BufferManagerShmemInit(void)
 	{
 		/* should find all of these, or none of them */
 		Assert(foundDescs && foundBufs && foundIOCV && foundBufCkpt);
-		/* note: this path is only taken in EXEC_BACKEND case */
-	}
-	else
-	{
-		int			i;
-
 		/*
-		 * Initialize all the buffer headers.
+		 * note: this path is only taken in EXEC_BACKEND case when initializing
+		 * shared memory, or in all cases when resizing shared memory.
 		 */
-		for (i = 0; i < NBuffers; i++)
-		{
-			BufferDesc *buf = GetBufferDescriptor(i);
+	}
 
-			ClearBufferTag(&buf->tag);
+#ifndef EXEC_BACKEND
+	/*
+	 * Initialize all the buffer headers.
+	 */
+	for (i = FirstBufferToInit; i < NBuffers; i++)
+	{
+		BufferDesc *buf = GetBufferDescriptor(i);
 
-			pg_atomic_init_u32(&buf->state, 0);
-			buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
+		ClearBufferTag(&buf->tag);
 
-			buf->buf_id = i;
+		pg_atomic_init_u32(&buf->state, 0);
+		buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
 
-			pgaio_wref_clear(&buf->io_wref);
+		buf->buf_id = i;
 
-			/*
-			 * Initially link all the buffers together as unused. Subsequent
-			 * management of this list is done by freelist.c.
-			 */
-			buf->freeNext = i + 1;
+		pgaio_wref_clear(&buf->io_wref);
 
-			LWLockInitialize(BufferDescriptorGetContentLock(buf),
-							 LWTRANCHE_BUFFER_CONTENT);
+		/*
+		 * Initially link all the buffers together as unused. Subsequent
+		 * management of this list is done by freelist.c.
+		 */
+		buf->freeNext = i + 1;
 
-			ConditionVariableInit(BufferDescriptorGetIOCV(buf));
-		}
+		LWLockInitialize(BufferDescriptorGetContentLock(buf),
+						 LWTRANCHE_BUFFER_CONTENT);
 
-		/* Correct last entry of linked list */
-		GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
+		ConditionVariableInit(BufferDescriptorGetIOCV(buf));
 	}
+#endif
+
+	/* Correct last entry of linked list */
+	GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
 
 	/* Init other shared buffer-management stuff */
 	StrategyInitialize(!foundDescs);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 9d00b80b4f8..abeb91e24fd 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -84,6 +84,9 @@ RequestAddinShmemSpace(Size size)
  *
  * If num_semaphores is not NULL, it will be set to the number of semaphores
  * required.
+ *
+ * XXX: Calculation for non main shared memory segments are incorrect, it
+ * includes more than needed for buffers only.
  */
 Size
 CalculateShmemSize(int *num_semaphores, int shmem_segment)
@@ -151,6 +154,14 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
 
+	/*
+	 * XXX: For some reason slightly more memory is needed for larger
+	 * shared_buffers, but this size is enough for any large value I've tested
+	 * with. Is it a mistake in how slots are split, or there was a hidden
+	 * inconsistency in shmem calculation?
+	 */
+	size = add_size(size, 1024 * 1024 * 100);
+
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
 
@@ -298,7 +309,7 @@ CreateOrAttachShmemStructs(void)
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
-	BufferManagerShmemInit();
+	BufferManagerShmemInit(0);
 
 	/*
 	 * Set up lock manager
@@ -310,6 +321,11 @@ CreateOrAttachShmemStructs(void)
 	 */
 	PredicateLockShmemInit();
 
+	/*
+	 * Set up shared memory resize manager
+	 */
+	ShmemControlInit();
+
 	/*
 	 * Set up process table
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6bec9be423..d7b56a18b24 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -27,6 +27,7 @@
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
+#include "storage/pg_shmem.h"
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "storage/smgr.h"
@@ -113,6 +114,10 @@ static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
 
+#ifdef DEBUG_SHMEM_RESIZE
+bool delay_proc_signal_init = false;
+#endif
+
 /*
  * ProcSignalShmemSize
  *		Compute space needed for ProcSignal's shared memory
@@ -176,6 +181,43 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
 	uint32		old_pss_pid;
 
 	Assert(cancel_key_len >= 0 && cancel_key_len <= MAX_CANCEL_KEY_LENGTH);
+
+#ifdef DEBUG_SHMEM_RESIZE
+	/*
+	 * Introduced for debugging purposes. You can change the variable at
+	 * runtime using gdb, then start new backends with delayed ProcSignal
+	 * initialization. Simple pg_usleep wont work here due to SIGHUP interrupt
+	 * needed for testing. Taken from pg_sleep;
+	 */
+	if (delay_proc_signal_init)
+	{
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+		float8		endtime = GetNowFloat() + 5;
+
+		for (;;)
+		{
+			float8		delay;
+			long		delay_ms;
+
+			CHECK_FOR_INTERRUPTS();
+
+			delay = endtime - GetNowFloat();
+			if (delay >= 600.0)
+				delay_ms = 600000;
+			else if (delay > 0.0)
+				delay_ms = (long) (delay * 1000.0);
+			else
+				break;
+
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 delay_ms,
+							 WAIT_EVENT_PG_SLEEP);
+			ResetLatch(MyLatch);
+		}
+	}
+#endif
+
 	if (MyProcNumber < 0)
 		elog(ERROR, "MyProcNumber not set");
 	if (MyProcNumber >= NumProcSignalSlots)
@@ -615,6 +657,10 @@ ProcessProcSignalBarrier(void)
 					case PROCSIGNAL_BARRIER_SMGRRELEASE:
 						processed = ProcessBarrierSmgrRelease();
 						break;
+					case PROCSIGNAL_BARRIER_SHMEM_RESIZE:
+						processed = ProcessBarrierShmemResize(
+								&ShmemCtrl->Barrier);
+						break;
 				}
 
 				/*
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 7e1a9b43fae..c07572d6f89 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -498,17 +498,26 @@ ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
 	{
 		/*
 		 * Structure is in the shmem index so someone else has allocated it
-		 * already.  The size better be the same as the size we are trying to
-		 * initialize to, or there is a name conflict (or worse).
+		 * already. Verify the structure's size:
+		 * - If it's the same, we've found the expected structure.
+		 * - If it's different, we're resizing the expected structure.
+		 *
+		 * XXX: There is an implicit assumption this can only happen in
+		 * "resizable" segments, where only one shared structure is allowed.
+		 * This has to be implemented more cleanly.
 		 */
 		if (result->size != size)
 		{
-			LWLockRelease(ShmemIndexLock);
-			ereport(ERROR,
-					(errmsg("ShmemIndex entry size is wrong for data structure"
-							" \"%s\": expected %zu, actual %zu",
-							name, size, result->size)));
+			Size delta = size - result->size;
+
+			result->size = size;
+
+			/* Reflect size change in the shared segment */
+			SpinLockAcquire(Segments[shmem_segment].ShmemLock);
+			Segments[shmem_segment].ShmemSegHdr->freeoffset += delta;
+			SpinLockRelease(Segments[shmem_segment].ShmemLock);
 		}
+
 		structPtr = result->location;
 	}
 	else
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 0d1b6466d1e..0942d2bffe2 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -62,6 +62,7 @@
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -4309,6 +4310,15 @@ PostgresMain(const char *dbname, const char *username)
 	 */
 	BeginReportingGUCOptions();
 
+	/* Verify the shared barrier, if it's still active: join and wait. */
+	WaitOnShmemBarrier();
+
+	/*
+	 * After waiting on the barrier above we guaranteed to have NSharedBuffers
+	 * broadcasted, so we can use it in the function below.
+	 */
+	AdjustShmemSize();
+
 	/*
 	 * Also set up handler to log session end; we have to wait till now to be
 	 * sure Log_disconnections has its final value.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 4da68312b5f..691fa14e9e3 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -155,6 +155,8 @@ REPLICATION_ORIGIN_DROP	"Waiting for a replication origin to become inactive so
 REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it can be dropped."
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
+SHMEM_RESIZE_START	"Waiting for other backends to start resizing shared memory."
+SHMEM_RESIZE_DONE	"Waiting for other backends to finish resizing shared memory."
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
 WAL_BUFFER_INIT	"Waiting on WAL buffer to be initialized."
 WAL_RECEIVER_EXIT	"Waiting for the WAL receiver to exit."
@@ -352,6 +354,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+ShmemResize	"Waiting to resize shared memory."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index e63521e5a2d..9f00608f508 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2366,14 +2366,14 @@ struct config_int ConfigureNamesInt[] =
 	 * checking for overflow, so we mustn't allow more than INT_MAX / 2.
 	 */
 	{
-		{"shared_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+		{"shared_buffers", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the number of shared memory buffers used by the server."),
 			NULL,
 			GUC_UNIT_BLOCKS
 		},
 		&NBuffers,
 		16384, 16, INT_MAX / 2,
-		NULL, NULL, NULL
+		NULL, assign_shared_buffers, NULL
 	},
 
 	{
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..a0c37a7749e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -173,6 +173,7 @@ extern PGDLLIMPORT char *DataDir;
 extern PGDLLIMPORT int data_directory_mode;
 
 extern PGDLLIMPORT int NBuffers;
+extern PGDLLIMPORT int MaxAvailableMemory;
 extern PGDLLIMPORT int MaxBackends;
 extern PGDLLIMPORT int MaxConnections;
 extern PGDLLIMPORT int max_worker_processes;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index edac9db6a12..4239ebe640b 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -317,7 +317,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 									int32 *buffers_skipped);
 
 /* in buf_init.c */
-extern void BufferManagerShmemInit(void);
+extern void BufferManagerShmemInit(int);
 extern Size BufferManagerShmemSize(int);
 
 /* in localbuf.c */
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 6ebda479ced..bb7ae4d33b3 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -64,6 +64,7 @@ typedef void (*shmem_startup_hook_type) (void);
 /* ipc.c */
 extern PGDLLIMPORT bool proc_exit_inprogress;
 extern PGDLLIMPORT bool shmem_exit_inprogress;
+extern PGDLLIMPORT volatile bool pending_pm_shmem_resize;
 
 pg_noreturn extern void proc_exit(int code);
 extern void shmem_exit(int code);
@@ -83,5 +84,7 @@ extern void CreateSharedMemoryAndSemaphores(void);
 extern void AttachSharedMemoryStructs(void);
 #endif
 extern void InitializeShmemGUCs(void);
+extern void CoordinateShmemResize(void);
+extern bool AnonymousShmemResize(void);
 
 #endif							/* IPC_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index a9681738146..558da6fdd55 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, ShmemResize)
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index f8459a5a421..19ad2e2f788 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -24,6 +24,7 @@
 #ifndef PG_SHMEM_H
 #define PG_SHMEM_H
 
+#include "storage/barrier.h"
 #include "storage/dsm_impl.h"
 #include "storage/spin.h"
 
@@ -56,6 +57,25 @@ typedef struct ShmemSegment
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 
+/*
+ * ShmemControl is shared between backends and helps to coordinate shared
+ * memory resize.
+ */
+typedef struct
+{
+	pg_atomic_uint32 	NSharedBuffers;
+	Barrier 			Barrier;
+	pg_atomic_uint64 	Generation;
+	bool                Resizable;
+} ShmemControl;
+
+extern PGDLLIMPORT ShmemControl *ShmemCtrl;
+
+/* The phases for shared memory resizing, used by for ProcSignal barrier. */
+#define SHMEM_RESIZE_REQUESTED			0
+#define SHMEM_RESIZE_START				1
+#define SHMEM_RESIZE_DONE				2
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -108,6 +128,12 @@ extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 void *ReserveAnonymousMemory(Size reserve_size);
 
+bool ProcessBarrierShmemResize(Barrier *barrier);
+void assign_shared_buffers(int newval, void *extra, bool *pending);
+void AdjustShmemSize(void);
+extern void WaitOnShmemBarrier(void);
+extern void ShmemControlInit(void);
+
 /*
  * To be able to dynamically resize largest parts of the data stored in shared
  * memory, we split it into multiple shared memory mappings segments. Each
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 428aa3fd68a..1a55bf57a70 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
 	PMSIGNAL_XLOG_IS_SHUTDOWN,	/* ShutdownXLOG() completed */
+	PMSIGNAL_SHMEM_RESIZE,	/* resize shared memory */
 } PMSignalReason;
 
 #define NUM_PMSIGNALS (PMSIGNAL_XLOG_IS_SHUTDOWN+1)
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 2733bbb8c5b..97033f84dce 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -54,6 +54,7 @@ typedef enum
 typedef enum
 {
 	PROCSIGNAL_BARRIER_SMGRRELEASE, /* ask smgr to close files */
+	PROCSIGNAL_BARRIER_SHMEM_RESIZE, /* ask backends to resize shared memory */
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a8346cda633..b026a275c38 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2745,6 +2745,7 @@ ShellTypeInfo
 ShippableCacheEntry
 ShippableCacheKey
 ShmemIndexEnt
+ShmemControl
 ShutdownForeignScan_function
 ShutdownInformation
 ShutdownMode
-- 
2.34.1

0009-Support-shrinking-shared-buffers-20250610.patchtext/x-patch; charset=US-ASCII; name=0009-Support-shrinking-shared-buffers-20250610.patchDownload

From 44dd06152fd9b8f65f80c81420974cb77e12e237 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Mon, 9 Jun 2025 14:40:34 +0530
Subject: [PATCH 09/17] Support shrinking shared buffers

When shrinking the shared buffers pool, each buffer in the area being
shrunk needs to be flushed if it's dirty so as not to loose the changes
to that buffer after shrinking. Also, each such buffer needs to be
removed from the buffer mapping table so that backends do not access it
after shrinking.

Buffer eviction requires a separate barrier phase for two reasons:

1. No other backend should map a new page to any of  buffers being
   evicted when eviction is in progress. So they wait while eviction is
   in progress.

2. Since a pinned buffer has the pin recorded in the backend local
   memory as well as the buffer descriptor (which is in shared memory),
   eviction should not coincide with remapping the shared memory of a
   backend. Otherwise we might loose consistency of local and shared
   pinning records. Hence it needs to be carried out in
   ProcessBarrierShmemResize() and not in AnonymousShmemResize() as
   indicated by now removed comment.

If a buffer being evicted is pinned, we raise a FATAL error but this should
improve. There are multiple options 1. to wait for the pinned buffer to get
unpinned, 2. the backend is killed or it itself cancels the query  or 3.
rollback the operation. Note that option 1 and 2 would require the pinning
related local and shared records to be accessed. But we need infrastructure to
do either of this right now.

Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c                 | 30 ++++---
 src/backend/storage/buffer/buf_init.c         |  8 +-
 src/backend/storage/buffer/bufmgr.c           | 89 +++++++++++++++++++
 src/backend/storage/buffer/freelist.c         | 68 ++++++++++++++
 .../utils/activity/wait_event_names.txt       |  1 +
 src/include/storage/buf_internals.h           |  1 +
 src/include/storage/bufmgr.h                  |  1 +
 src/include/storage/pg_shmem.h                |  1 +
 8 files changed, 187 insertions(+), 12 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index f0b53ce1d7c..03aa6a41828 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -1096,14 +1096,6 @@ AnonymousShmemResize(void)
 	 */
 	pending_pm_shmem_resize = false;
 
-	/*
-	 * XXX: Currently only increasing of shared_buffers is supported. For
-	 * decreasing something similar has to be done, but buffer blocks with
-	 * data have to be drained first.
-	 */
-	if(NBuffersOld > NBuffers)
-		return false;
-
 	for(int i = 0; i < next_free_segment; i++)
 	{
 		/* Note that CalculateShmemSize indirectly depends on NBuffers */
@@ -1201,8 +1193,6 @@ AnonymousShmemResize(void)
 				 * all the pointers are still valid, and we only need to update
 				 * structures size in the ShmemIndex once -- any other backend
 				 * will pick up this shared structure from the index.
-				 *
-				 * XXX: This is the right place for buffer eviction as well.
 				 */
 				BufferManagerShmemInit(NBuffersOld);
 
@@ -1262,6 +1252,25 @@ ProcessBarrierShmemResize(Barrier *barrier)
 		SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
 	}
 
+	/*
+	 * Evict extra buffers when shrinking shared buffers. We need to do this
+	 * while the memory for extra buffers is still mapped i.e. before remapping
+	 * the shared memory segments to a smaller memory area.
+	 */
+	if (NBuffersOld > NBuffersPending)
+	{
+		/*
+		 * TODO: If the buffer eviction fails for any reason, we should
+		 * gracefully rollback the shared buffer resizing and try again. But the
+		 * infrastructure to do so is not available right now. Hence just raise
+		 * a FATAL so that the system restarts.
+		 */
+		if (!EvictExtraBuffers(NBuffersPending, NBuffersOld))
+			elog(FATAL, "buffer eviction failed");
+
+		BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_EVICT);
+	}
+
 	AnonymousShmemResize();
 
 	/* The second phase means the resize has finished, SHMEM_RESIZE_DONE */
@@ -1763,5 +1772,6 @@ ShmemControlInit(void)
 
 		/* shmem_resizable should be initialized by now */
 		ShmemCtrl->Resizable = shmem_resizable;
+		ShmemCtrl->evictor_pid = 0;
 	}
 }
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ac844b114bd..f78be4700df 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -156,8 +156,12 @@ BufferManagerShmemInit(int FirstBufferToInit)
 	}
 #endif
 
-	/* Correct last entry of linked list */
-	GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
+	/*
+	 * Correct last entry of linked list, when initializing the buffers or when
+	 * expanding the buffers.
+	 */
+	if (FirstBufferToInit < NBuffers)
+		GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
 
 	/* Init other shared buffer-management stuff */
 	StrategyInitialize(!foundDescs);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 667aa0c0c78..57d78c482bb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -57,6 +57,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/read_stream.h"
 #include "storage/smgr.h"
@@ -7453,3 +7454,91 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+/*
+ * When shrinking shared buffers pool, evict the buffers which will not be part
+ * of the shrunk buffer pool.
+ */
+bool
+EvictExtraBuffers(int newBufSize, int oldBufSize)
+{
+	bool result = true;
+
+	/*
+	 * If the buffer being evicated is locked, this function will need to wait.
+	 * This function should not be called from a Postmaster since it can not wait on a lock.
+	 */
+	Assert(IsUnderPostmaster);
+
+	/*
+	 * Let only one backend perform eviction. We could split the work across all
+	 * the backends but that doesn't seem necessary.
+	 *
+	 * The first backend to acquire ShmemResizeLock, sets its own PID as the
+	 * evictor PID for other backends to know that the eviction is in progress or
+	 * has already been performed. The evictor backend releases the lock when it
+	 * finishes eviction.  While the eviction is in progress, backends other than
+	 * evictor backend won't be able to take the lock. They won't perform
+	 * eviction. A backend may acquire the lock after eviction has completed, but
+	 * it will not perform eviction since the evictor PID is already set. Evictor
+	 * PID is reset only when the buffer resizing finishes. Thus only one backend
+	 * will perform eviction in a given instance of shared buffers resizing.
+	 *
+	 * Any backend which acquires this lock will release it before the eviction
+	 * phase finishes, hence the same lock can be reused for the next phase of
+	 * resizing buffers.
+	 */
+	if (LWLockConditionalAcquire(ShmemResizeLock, LW_EXCLUSIVE))
+	{
+		if (ShmemCtrl->evictor_pid == 0)
+		{
+			ShmemCtrl->evictor_pid = MyProcPid;
+
+			StrategyPurgeFreeList(newBufSize);
+
+			/*
+			 * TODO: Before evicting any buffer, we should check whether any of the
+			 * buffers are pinned. If we find that a buffer is pinned after evicting
+			 * most of them, that will impact performance since all those evicted
+			 * buffers might need to be read again.
+			 */
+			for (Buffer buf = newBufSize + 1; buf <= oldBufSize; buf++)
+			{
+				BufferDesc *desc = GetBufferDescriptor(buf - 1);
+				uint32		buf_state;
+				bool		buffer_flushed;
+
+				buf_state = pg_atomic_read_u32(&desc->state);
+
+				/*
+				 * Nobody is expected to touch the buffers while resizing is
+				 * going one hence unlocked precheck should be safe and saves
+				 * some cycles.
+				 */
+				if (!(buf_state & BM_VALID))
+					continue;
+
+				ResourceOwnerEnlarge(CurrentResourceOwner);
+				ReservePrivateRefCountEntry();
+
+				LockBufHdr(desc);
+
+				/*
+				 * Now that we have locked buffer descriptor, make sure that the
+				 * buffer without valid data has been skipped above.
+				 */
+				Assert(buf_state & BM_VALID);
+
+				if (!EvictUnpinnedBufferInternal(desc, &buffer_flushed))
+				{
+					elog(WARNING, "could not remove buffer %u, it is pinned", buf);
+					result = false;
+					break;
+				}
+			}
+		}
+		LWLockRelease(ShmemResizeLock);
+	}
+
+	return result;
+}
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index bd390f2709d..e384e46c779 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -527,6 +527,74 @@ StrategyInitialize(bool init)
 }
 
 
+/*
+ * StrategyPurgeFreeList -- remove all buffers with id higher than the number of
+ * buffers in the buffer pool.
+ *
+ * This is called before evicting buffers while shrinking shared buffers, so that
+ * the free list does not reference a buffer that will be removed.
+ * 
+ * The function is called after resizing has started and thus nobody should be
+ * traversing the free list and also not touching the buffers.
+ */
+void
+StrategyPurgeFreeList(int numBuffers)
+{
+	int firstBuffer = FREENEXT_END_OF_LIST;
+    int nextFree = StrategyControl->firstFreeBuffer;
+	BufferDesc *prevValidBuf = NULL;
+
+	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+
+    while (nextFree != FREENEXT_END_OF_LIST)
+    {
+ 		BufferDesc *buf = GetBufferDescriptor(nextFree);
+
+		/* nextFree should be id of buffer being examined. */
+		Assert(nextFree == buf->buf_id);
+		/* The buffer should not be marked as not in the list. */
+		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+
+		/*
+		 * If the buffer is within the new size of pool, keep it in the free list
+		 * otherwise discard it.
+		 */
+        if (buf->buf_id < numBuffers)
+        {
+			if (prevValidBuf != NULL)
+				prevValidBuf->freeNext = buf->buf_id;
+			prevValidBuf = buf;
+
+			/* Save the first free buffer in the list if not already known. */
+			if (firstBuffer == FREENEXT_NOT_IN_LIST)
+				firstBuffer = nextFree;
+        }
+		/* Examine the next buffer in the free list. */
+		nextFree = buf->freeNext;
+    }
+
+	/* Update the last valid free buffer, if there's any. */
+	if (prevValidBuf != NULL)
+	{
+		StrategyControl->lastFreeBuffer = prevValidBuf->buf_id;
+		prevValidBuf->freeNext = FREENEXT_END_OF_LIST;
+	}
+	else
+		StrategyControl->lastFreeBuffer = FREENEXT_END_OF_LIST;
+
+	/* Update first valid free buffer, if there's any. */
+	StrategyControl->firstFreeBuffer = firstBuffer;
+
+	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+
+	/*
+	 * TODO: following was suggested by AI. Check whether it is required.
+	 * If we removed all buffers from the freelist, reset the clock sweep
+	 * pointer to zero.  This is not strictly necessary, but it seems like a
+	 * good idea to avoid confusion.
+	 */
+}
+
 /* ----------------------------------------------------------------
  *				Backend-private buffer ring management
  * ----------------------------------------------------------------
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 691fa14e9e3..0c588b69a90 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -156,6 +156,7 @@ REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it c
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
 SHMEM_RESIZE_START	"Waiting for other backends to start resizing shared memory."
+SHMEM_RESIZE_EVICT	"Waiting for other backends to finish buffer evication phase."
 SHMEM_RESIZE_DONE	"Waiting for other backends to finish resizing shared memory."
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
 WAL_BUFFER_INIT	"Waiting on WAL buffer to be initialized."
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0dec7d93b3b..add15e3723b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -453,6 +453,7 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
+extern void StrategyPurgeFreeList(int numBuffers);
 extern bool have_free_buffer(void);
 
 /* buf_table.c */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 4239ebe640b..0c554f0b130 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -315,6 +315,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 									int32 *buffers_evicted,
 									int32 *buffers_flushed,
 									int32 *buffers_skipped);
+extern bool EvictExtraBuffers(int fromBuf, int toBuf);
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(int);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 192b637cc65..23998f5469d 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -64,6 +64,7 @@ extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 typedef struct
 {
 	pg_atomic_uint32 	NSharedBuffers;
+	pid_t				evictor_pid;
 	Barrier 			Barrier;
 	pg_atomic_uint64 	Generation;
 	bool                Resizable;
-- 
2.34.1

0010-Reinitialize-StrategyControl-after-resizing-20250610.patchtext/x-patch; charset=US-ASCII; name=0010-Reinitialize-StrategyControl-after-resizing-20250610.patchDownload

From 0f7d58f7386d0e3a55bdcf931492b62ee883a998 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Tue, 10 Jun 2025 11:00:36 +0530
Subject: [PATCH 10/17] Reinitialize StrategyControl after resizing buffers

The commit introduces a separate function StrategyReInitialize() instead
of reusing StrategyInitialize() since some of the things that the second
one does are not required in the first one. Here's list of what
StrategyReInitialize() does and how does it differ from
StrategyInitialize().

1. When expanding the buffer pool add new buffers to the free list.
2. When shrinking buffers, we remove any buffers, in the area being
   shrunk, from the freelist. While doing so we adjust the first and
   last free buffer pointers in the StrategyControl area. Hence nothing
   more needed after resizing.
3. Check the sanity of the free buffer list is added after resizing.
4. StrategyControl pointer needn't be fetched again since it should not
   change. But added an Assert to make sure the pointer is valid.
5. &StrategyControl->buffer_strategy_lock need not be initialized again.
6. completePasses and numBufferAllocs need not be cleared since the
   server is still running and the previous statistics is still valid.

TODO: Since StrategyControl plays a crucial role in background writer as
well as clock tick algorithm, the impact of resizing the buffers on
those two needs to be assessed. We may require more adjustments to those
two as well as StrategyControl based on the assessment.
---
 src/backend/storage/buffer/buf_init.c |  11 ++-
 src/backend/storage/buffer/freelist.c | 120 ++++++++++++++++++++++++++
 src/include/storage/buf_internals.h   |   1 +
 3 files changed, 130 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index f78be4700df..7b8bc577bd5 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -163,8 +163,15 @@ BufferManagerShmemInit(int FirstBufferToInit)
 	if (FirstBufferToInit < NBuffers)
 		GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
 
-	/* Init other shared buffer-management stuff */
-	StrategyInitialize(!foundDescs);
+	/*
+	 * Init other shared buffer-management stuff from scratch configuring buffer
+	 * pool the first time. If we are just resizing buffer pool adjust only the
+	 * required structures.
+	 */
+	if (FirstBufferToInit == 0)
+		StrategyInitialize(!foundDescs);
+	else
+		StrategyReInitialize(FirstBufferToInit);
 
 	/* Initialize per-backend file flush context */
 	WritebackContextInit(&BackendWritebackContext,
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index e384e46c779..c5277e090a5 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -98,6 +98,9 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
+#ifdef USE_ASSERT_CHECKING
+static void StrategyValidateFreeList(void);
+#endif /* USE_ASSERT_CHECKING */
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -526,6 +529,75 @@ StrategyInitialize(bool init)
 		Assert(!init);
 }
 
+/*
+ * StrategyReInitialize -- re-initialize the buffer cache replacement
+ *		strategy.
+ *
+ * To be called when resizing buffer manager and only from the coordinator.
+ * TODO: Assess the differences between this function and StrategyInitialize().
+ */
+void
+StrategyReInitialize(int FirstBufferIdToInit)
+{
+	bool		found;
+
+	/*
+	 * Resizing memory for buffer pools should not affect the address of
+	 * StrategyControl.
+	 */
+	if (StrategyControl != (BufferStrategyControl *)
+		ShmemInitStructInSegment("Buffer Strategy Status",
+						sizeof(BufferStrategyControl),
+						&found, STRATEGY_SHMEM_SEGMENT))
+		elog(FATAL, "something went wrong while re-initializing the buffer strategy");
+
+	/* TODO: Buffer lookup table adjustment: There are two options:
+	 *
+	 * 1. Resize the buffer lookup table to match the new number of buffers. But
+	 * this requires rehashing all the entries in the buffer lookup table with
+	 * the new table size.
+	 *
+	 * 2. Allocate maximum size of the buffer lookup table at the beginning and
+	 * never resize it. This leaves sparse buffer lookup table which is
+	 * inefficient from both memory and time perspective. According to David
+	 * Rowley, the sparse entries in the buffer look up table cause frequent
+	 * cacheline reload which affect performance. If the impact of that
+	 * inefficiency in a benchmark is significant, we will need to consider first
+	 * option.
+	 */
+
+	/*
+	 * When shrinking buffers, we must have adjusted the first and the last free
+	 * buffer when removing the buffers being shrunk from the free list. Nothing
+	 * to be done here.
+	 *
+	 * When expanding the shared buffers, new buffers are added at the end of the
+	 * freelist or they form the new free list if there are no free buffers.
+	 */
+	if (FirstBufferIdToInit < NBuffers)
+	{
+		if (StrategyControl->firstFreeBuffer == FREENEXT_END_OF_LIST)
+			StrategyControl->firstFreeBuffer = FirstBufferIdToInit;
+		else
+		{
+			Assert(StrategyControl->lastFreeBuffer >= 0);
+			GetBufferDescriptor(StrategyControl->lastFreeBuffer - 1)->freeNext = FirstBufferIdToInit;
+		}
+
+		StrategyControl->lastFreeBuffer = NBuffers - 1;
+	}
+
+	/* Check free list sanity after resizing. */
+#ifdef USE_ASSERT_CHECKING
+	StrategyValidateFreeList();
+#endif /* USE_ASSERT_CHECKING */
+
+	/* Initialize the clock sweep pointer */
+	pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+
+	/* No pending notification */
+	StrategyControl->bgwprocno = -1;
+}
 
 /*
  * StrategyPurgeFreeList -- remove all buffers with id higher than the number of
@@ -595,6 +667,54 @@ StrategyPurgeFreeList(int numBuffers)
 	 */
 }
 
+#ifdef USE_ASSERT_CHECKING
+/*
+ * StrategyValidateFreeList-- check sanity of free buffer list.
+ */
+static void
+StrategyValidateFreeList(void)
+{
+	int			nextFree = StrategyControl->firstFreeBuffer;
+	int			numFreeBuffers = 0;
+	int			lastFreeBuffer = FREENEXT_END_OF_LIST;
+
+	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+
+	while (nextFree != FREENEXT_END_OF_LIST)
+	{
+		BufferDesc *buf = GetBufferDescriptor(nextFree);
+
+		/* nextFree should be id of buffer being examined. */
+		Assert(nextFree == buf->buf_id);
+		Assert(buf->buf_id < NBuffers);
+		/* The buffer should not be marked as not in the list. */
+		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+
+		/* Update our knowledge of last buffer in the free list. */
+		lastFreeBuffer = buf->buf_id;
+
+		numFreeBuffers++;
+
+		/* Avoid infinite recursion in case there are cycles in free list. */
+		if (numFreeBuffers > NBuffers)
+			break;
+
+		nextFree = buf->freeNext;
+	}
+
+	Assert(numFreeBuffers <= NBuffers);
+
+	/*
+	 * Make sure that the StrategyControl's knowledge of last free buffer
+	 * agrees with what's there in the free list.
+	 */
+	if (StrategyControl->firstFreeBuffer != FREENEXT_END_OF_LIST)
+		Assert(StrategyControl->lastFreeBuffer == lastFreeBuffer);
+
+	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+}
+#endif /* USE_ASSERT_CHECKING */
+
 /* ----------------------------------------------------------------
  *				Backend-private buffer ring management
  * ----------------------------------------------------------------
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index add15e3723b..46949e9d90e 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -454,6 +454,7 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
 extern void StrategyPurgeFreeList(int numBuffers);
+extern void StrategyReInitialize(int FirstBufferToInit);
 extern bool have_free_buffer(void);
 
 /* buf_table.c */
-- 
2.34.1

0012-Fix-compilation-failure-in-pg_get_shmem_pag-20250610.patchtext/x-patch; charset=US-ASCII; name=0012-Fix-compilation-failure-in-pg_get_shmem_pag-20250610.patchDownload

From bf5b3cca2e8b26ce6ecf634905074ae930e3745a Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Thu, 5 Jun 2025 14:42:53 +0530
Subject: [PATCH 12/17] Fix compilation failure in pg_get_shmem_pagesize()

Fix compilation failure in pg_get_shmem_pagesize() due to incorrect call to
GetHugePageSize(). This is a temporary fix to allow compilation to proceed.

Ashutosh Bapat
---
 src/backend/storage/ipc/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index b411fbce37e..4c2bddfe6ca 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -826,7 +826,7 @@ pg_get_shmem_pagesize(void)
 	Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
 
 	if (huge_pages_status == HUGE_PAGES_ON)
-		GetHugePageSize(&os_page_size, NULL);
+		GetHugePageSize(&os_page_size, NULL, NULL);
 
 	return os_page_size;
 }
-- 
2.34.1

0011-Fix-compilation-failure-in-pg_get_shmem_all-20250610.patchtext/x-patch; charset=US-ASCII; name=0011-Fix-compilation-failure-in-pg_get_shmem_all-20250610.patchDownload

From f079821259a3841637b794d2d8ad1153e14eb4b3 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Thu, 5 Jun 2025 11:34:12 +0530
Subject: [PATCH 11/17] Fix compilation failure in
 pg_get_shmem_allocations_numa()

The compilation failure is caused by
5cefa489760e34d947dbe67b4a922468b2e43668. Ideal fix should be compute
the total page count across all the shared memory segments. This commit
just fixes the compilation failure.  # Please enter the commit message
for your changes. Lines starting

Ashutosh Bapat
---
 src/backend/storage/ipc/shmem.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index c07572d6f89..b411fbce37e 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -696,7 +696,12 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
 	 * this is not very likely, and moreover we have more entries, each of
 	 * them using only fraction of the total pages.
 	 */
-	shm_total_page_count = (ShmemSegHdr->totalsize / os_page_size) + 1;
+	/*
+	 * TODO: We should loop through all the Shm segments, instead of just the
+	 * main segment, to find the total page count.
+	 */
+	shm_total_page_count = (Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->totalsize
+	/ os_page_size) + 1;
 	page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
 	pages_status = palloc(sizeof(int) * shm_total_page_count);
 
-- 
2.34.1

#74

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

7 months ago

In reply to: Ashutosh Bapat (#73)

13 attachment(s)

Re: Changing shared_buffers without restart

On Tue, Jun 10, 2025 at 4:39 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Here's patchset rebased on f85f6ab051b7cf6950247e5fa6072c4130613555
with some more fixes as described below.

0001 - 0008 are same as the previous patchset

0009 adds support to shrink shared buffers. It has two changes: a. evict the buffers outside the new buffer size b. remove buffers with buffer id outside the new buffer size from the free list. If a buffer being evicted is pinned, the operation is aborted and a FATAL error is raised. I think we need to change this behaviour to be less severe like rolling back the operation or waiting for the pinned buffer to be unpinned etc. Better even if we could let users control the behaviour. But we need better infrastructure to do such things. That's one TODO left in the patch.

Patches upto 0009 are same as the previous patch set.

0010 is about reinitializing the Strategy reinitialization. Once we expand the buffers, the new buffers need to be added to the free list. Some StrategyControl area members (not all) need to be adjusted. That's what this patch does. But a deeper adjustment in BgBufferSync() and ClockSweepTick() is required. Further we need to do something about the buffer lookup table. More on that later in the email.

0010 is improved with fixes for background writer and clocksweeptick.
Now we just reset the information saved between calls to BgBufferSync
since it doesn't make sense after NBuffers has changed. Also the
members in StrategyControl related to ClockSweepTick are reset for the
same reason. More details in the commit message.

0011: GetBufferFromRing() invalidates the buffers beyond NBuffers
since those may have been added before resizing and are not valid
anymore. Details in commit message.

0011-0012 fix compilation issues in these patches but those fixes are not correct. The patches are there so that binaries can be built without any compilation issues and someone can experiment with buffer resizing. Good thing is the compilation fixes are in SQL callable functions pg_get_shmem_pagesize() and pg_get_shmem_numa(). So there's no ill-effect because of these patches as long as those two functions are not called.

These patches are now 0012 and 0013 respectively.

Buffer lookup table resizing
------------------------------------
The size of the buffer lookup table depends upon (number of shared buffers + number of partitions in the shared buffer lookup table). If we shrink the buffer pool, the buffer lookup table will become sparse but still useful. If we expand the buffers we need to expand the buffer lookup table too. That's not implemented in the current patchset. There are two solutions here:

1. We map a lot of extra address space (not memory) initially to accomodate for future expansion of shared buffer pool. Let's say that the total address space is sufficient to accomodate Nx buffers. Simple solution is to allocate a buffer lookup table with Nx initial entries so that we don't have to resize the buffer lookup table ever. It will waste memory but we might be ok with that as version 1 solution. According to my offline discussion with David Rowley, buffer lookups in sparse hash tables are inefficient because or more cacheline faults. Whether that translates to any noticeable performance degradation in TPS needs to be measured.

2. Alternate solution is to resize the buffer mapping table as well. This means that we rehash all the entries again which may take a longer time and the partitions will remain locked for that amount of time. Not to mention this will require non-trivial change to dynahash implementation.

I haven't spent time on this yet.

--
Best Wishes,
Ashutosh Bapat

Attachments:

0001-Allow-to-use-multiple-shared-memory-mapping-20250616.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-to-use-multiple-shared-memory-mapping-20250616.patchDownload

From 25a501f17a36523be0b133f992393433428d73c5 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 28 Feb 2025 19:54:47 +0100
Subject: [PATCH 01/17] Allow to use multiple shared memory mappings

Currently all the work with shared memory is done via a single anonymous
memory mapping, which limits ways how the shared memory could be organized.

Introduce possibility to allocate multiple shared memory mappings, where
a single mapping is associated with a specified shared memory segment.
There is only fixed amount of available segments, currently only one
main shared memory segment is allocated. A new shared memory API is
introduces, extended with a segment as a new parameter. As a path of
least resistance, the original API is kept in place, utilizing the main
shared memory segment.
---
 src/backend/port/posix_sema.c     |   4 +-
 src/backend/port/sysv_sema.c      |   4 +-
 src/backend/port/sysv_shmem.c     | 138 ++++++++++++++++++++---------
 src/backend/port/win32_sema.c     |   2 +-
 src/backend/storage/ipc/ipc.c     |   4 +-
 src/backend/storage/ipc/ipci.c    |  63 +++++++------
 src/backend/storage/ipc/shmem.c   | 141 +++++++++++++++++++++---------
 src/backend/storage/lmgr/lwlock.c |  13 ++-
 src/include/storage/ipc.h         |   2 +-
 src/include/storage/pg_sema.h     |   2 +-
 src/include/storage/pg_shmem.h    |  18 ++++
 src/include/storage/shmem.h       |  12 +++
 12 files changed, 278 insertions(+), 125 deletions(-)

diff --git a/src/backend/port/posix_sema.c b/src/backend/port/posix_sema.c
index 269c7460817..401e1113fa1 100644
--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -193,7 +193,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * we don't have to expose the counters to other processes.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	struct stat statbuf;
 
@@ -220,7 +220,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSegment(PGSemaphoreShmemSize(maxSemas), shmem_segment);
 #endif
 
 	numSems = 0;
diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c
index 423b2b4f9d6..4ce2cfb662b 100644
--- a/src/backend/port/sysv_sema.c
+++ b/src/backend/port/sysv_sema.c
@@ -307,7 +307,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * have clobbered.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	struct stat statbuf;
 
@@ -328,7 +328,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSegment(PGSemaphoreShmemSize(maxSemas), shmem_segment);
 	numSharedSemas = 0;
 	maxSharedSemas = maxSemas;
 
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..56af0231d24 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -94,8 +94,19 @@ typedef enum
 unsigned long UsedShmemSegID = 0;
 void	   *UsedShmemSegAddr = NULL;
 
-static Size AnonymousShmemSize;
-static void *AnonymousShmem = NULL;
+typedef struct AnonymousMapping
+{
+	int shmem_segment;
+	Size shmem_size; 			/* Size of the mapping */
+	Pointer shmem; 				/* Pointer to the start of the mapped memory */
+	Pointer seg_addr; 			/* SysV shared memory for the header */
+	unsigned long seg_id; 		/* IPC key */
+} AnonymousMapping;
+
+static AnonymousMapping Mappings[ANON_MAPPINGS];
+
+/* Keeps track of used mapping segments */
+static int next_free_segment = 0;
 
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
@@ -104,6 +115,28 @@ static IpcMemoryState PGSharedMemoryAttach(IpcMemoryId shmId,
 										   void *attachAt,
 										   PGShmemHeader **addr);
 
+static const char*
+MappingName(int shmem_segment)
+{
+	switch (shmem_segment)
+	{
+		case MAIN_SHMEM_SEGMENT:
+			return "main";
+		default:
+			return "unknown";
+	}
+}
+
+static void
+DebugMappings()
+{
+	for(int i = 0; i < next_free_segment; i++)
+	{
+		AnonymousMapping m = Mappings[i];
+		elog(DEBUG1, "Mapping[%s]: addr %p, size %zu",
+			 MappingName(i), m.shmem, m.shmem_size);
+	}
+}
 
 /*
  *	InternalIpcMemoryCreate(memKey, size)
@@ -591,14 +624,13 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
 /*
  * Creates an anonymous mmap()ed shared memory segment.
  *
- * Pass the requested size in *size.  This function will modify *size to the
- * actual size of the allocation, if it ends up allocating a segment that is
- * larger than requested.
+ * This function will modify mapping size to the actual size of the allocation,
+ * if it ends up allocating a segment that is larger than requested.
  */
-static void *
-CreateAnonymousSegment(Size *size)
+static void
+CreateAnonymousSegment(AnonymousMapping *mapping)
 {
-	Size		allocsize = *size;
+	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
 
@@ -623,8 +655,11 @@ CreateAnonymousSegment(Size *size)
 				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
-			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-				 allocsize);
+		{
+			DebugMappings();
+			elog(DEBUG1, "segment[%s]: mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
+				 MappingName(mapping->shmem_segment), allocsize);
+		}
 	}
 #endif
 
@@ -642,7 +677,7 @@ CreateAnonymousSegment(Size *size)
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
-		allocsize = *size;
+		allocsize = mapping->shmem_size;
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
 				   PG_MMAP_FLAGS, -1, 0);
 		mmap_errno = errno;
@@ -651,8 +686,10 @@ CreateAnonymousSegment(Size *size)
 	if (ptr == MAP_FAILED)
 	{
 		errno = mmap_errno;
+		DebugMappings();
 		ereport(FATAL,
-				(errmsg("could not map anonymous shared memory: %m"),
+				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
+						MappingName(mapping->shmem_segment)),
 				 (mmap_errno == ENOMEM) ?
 				 errhint("This error usually means that PostgreSQL's request "
 						 "for a shared memory segment exceeded available memory, "
@@ -663,8 +700,8 @@ CreateAnonymousSegment(Size *size)
 						 allocsize) : 0));
 	}
 
-	*size = allocsize;
-	return ptr;
+	mapping->shmem = ptr;
+	mapping->shmem_size = allocsize;
 }
 
 /*
@@ -674,13 +711,18 @@ CreateAnonymousSegment(Size *size)
 static void
 AnonymousShmemDetach(int status, Datum arg)
 {
-	/* Release anonymous shared memory block, if any. */
-	if (AnonymousShmem != NULL)
+	for(int i = 0; i < next_free_segment; i++)
 	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		AnonymousMapping m = Mappings[i];
+
+		/* Release anonymous shared memory block, if any. */
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
 
@@ -705,6 +747,7 @@ PGSharedMemoryCreate(Size size,
 	PGShmemHeader *hdr;
 	struct stat statbuf;
 	Size		sysvsize;
+	AnonymousMapping *mapping = &Mappings[next_free_segment];
 
 	/*
 	 * We use the data directory's ID info (inode and device numbers) to
@@ -733,11 +776,15 @@ PGSharedMemoryCreate(Size size,
 
 	/* Room for a header? */
 	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
+	mapping->shmem_size = size;
+	mapping->shmem_segment = next_free_segment;
 
 	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
-		AnonymousShmem = CreateAnonymousSegment(&size);
-		AnonymousShmemSize = size;
+		/* On success, mapping data will be modified. */
+		CreateAnonymousSegment(mapping);
+
+		next_free_segment++;
 
 		/* Register on-exit routine to unmap the anonymous segment */
 		on_shmem_exit(AnonymousShmemDetach, (Datum) 0);
@@ -760,7 +807,7 @@ PGSharedMemoryCreate(Size size,
 	 * loop simultaneously.  (CreateDataDirLockFile() does not entirely ensure
 	 * that, but prefer fixing it over coping here.)
 	 */
-	NextShmemSegID = statbuf.st_ino;
+	NextShmemSegID = statbuf.st_ino + next_free_segment;
 
 	for (;;)
 	{
@@ -852,13 +899,13 @@ PGSharedMemoryCreate(Size size,
 	/*
 	 * Initialize space allocation status for segment.
 	 */
-	hdr->totalsize = size;
+	hdr->totalsize = mapping->shmem_size;
 	hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader));
 	*shim = hdr;
 
 	/* Save info for possible future use */
-	UsedShmemSegAddr = memAddress;
-	UsedShmemSegID = (unsigned long) NextShmemSegID;
+	mapping->seg_addr = memAddress;
+	mapping->seg_id = (unsigned long) NextShmemSegID;
 
 	/*
 	 * If AnonymousShmem is NULL here, then we're not using anonymous shared
@@ -866,10 +913,10 @@ PGSharedMemoryCreate(Size size,
 	 * block. Otherwise, the System V shared memory block is only a shim, and
 	 * we must return a pointer to the real block.
 	 */
-	if (AnonymousShmem == NULL)
+	if (mapping->shmem == NULL)
 		return hdr;
-	memcpy(AnonymousShmem, hdr, sizeof(PGShmemHeader));
-	return (PGShmemHeader *) AnonymousShmem;
+	memcpy(mapping->shmem, hdr, sizeof(PGShmemHeader));
+	return (PGShmemHeader *) mapping->shmem;
 }
 
 #ifdef EXEC_BACKEND
@@ -969,23 +1016,28 @@ PGSharedMemoryNoReAttach(void)
 void
 PGSharedMemoryDetach(void)
 {
-	if (UsedShmemSegAddr != NULL)
+	for(int i = 0; i < next_free_segment; i++)
 	{
-		if ((shmdt(UsedShmemSegAddr) < 0)
+		AnonymousMapping m = Mappings[i];
+
+		if (m.seg_addr != NULL)
+		{
+			if ((shmdt(m.seg_addr) < 0)
 #if defined(EXEC_BACKEND) && defined(__CYGWIN__)
-		/* Work-around for cygipc exec bug */
-			&& shmdt(NULL) < 0
+			/* Work-around for cygipc exec bug */
+				&& shmdt(NULL) < 0
 #endif
-			)
-			elog(LOG, "shmdt(%p) failed: %m", UsedShmemSegAddr);
-		UsedShmemSegAddr = NULL;
-	}
+				)
+				elog(LOG, "shmdt(%p) failed: %m", m.seg_addr);
+			m.seg_addr = NULL;
+		}
 
-	if (AnonymousShmem != NULL)
-	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
diff --git a/src/backend/port/win32_sema.c b/src/backend/port/win32_sema.c
index 5854ad1f54d..e7365ff8060 100644
--- a/src/backend/port/win32_sema.c
+++ b/src/backend/port/win32_sema.c
@@ -44,7 +44,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * process exits.
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	mySemSet = (HANDLE *) malloc(maxSemas * sizeof(HANDLE));
 	if (mySemSet == NULL)
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index 567739b5be9..5b55bec8d9d 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -61,6 +61,8 @@ static void proc_exit_prepare(int code);
  * but provide some additional features we need --- in particular,
  * we want to register callbacks to invoke when we are disconnecting
  * from a broken shared-memory context but not exiting the postmaster.
+ * Maximum number of such exit callbacks depends on the number of shared
+ * segments.
  *
  * Callback functions can take zero, one, or two args: the first passed
  * arg is the integer exitcode, the second is the Datum supplied when
@@ -68,7 +70,7 @@ static void proc_exit_prepare(int code);
  * ----------------------------------------------------------------
  */
 
-#define MAX_ON_EXITS 20
+#define MAX_ON_EXITS 40
 
 struct ONEXIT
 {
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..8b38e985327 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -86,7 +86,7 @@ RequestAddinShmemSpace(Size size)
  * required.
  */
 Size
-CalculateShmemSize(int *num_semaphores)
+CalculateShmemSize(int *num_semaphores, int shmem_segment)
 {
 	Size		size;
 	int			numSemas;
@@ -206,33 +206,38 @@ CreateSharedMemoryAndSemaphores(void)
 
 	Assert(!IsUnderPostmaster);
 
-	/* Compute the size of the shared-memory block */
-	size = CalculateShmemSize(&numSemas);
-	elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
-
-	/*
-	 * Create the shmem segment
-	 */
-	seghdr = PGSharedMemoryCreate(size, &shim);
-
-	/*
-	 * Make sure that huge pages are never reported as "unknown" while the
-	 * server is running.
-	 */
-	Assert(strcmp("unknown",
-				  GetConfigOption("huge_pages_status", false, false)) != 0);
-
-	InitShmemAccess(seghdr);
-
-	/*
-	 * Create semaphores
-	 */
-	PGReserveSemaphores(numSemas);
-
-	/*
-	 * Set up shared memory allocation mechanism
-	 */
-	InitShmemAllocation();
+	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
+	{
+		/* Compute the size of the shared-memory block */
+		size = CalculateShmemSize(&numSemas, segment);
+		elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
+
+		/*
+		 * Create the shmem segment.
+		 *
+		 * XXX: Do multiple shims are needed, one per segment?
+		 */
+		seghdr = PGSharedMemoryCreate(size, &shim);
+
+		/*
+		 * Make sure that huge pages are never reported as "unknown" while the
+		 * server is running.
+		 */
+		Assert(strcmp("unknown",
+					  GetConfigOption("huge_pages_status", false, false)) != 0);
+
+		InitShmemAccessInSegment(seghdr, segment);
+
+		/*
+		 * Create semaphores
+		 */
+		PGReserveSemaphores(numSemas, segment);
+
+		/*
+		 * Set up shared memory allocation mechanism
+		 */
+		InitShmemAllocationInSegment(segment);
+	}
 
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
@@ -363,7 +368,7 @@ InitializeShmemGUCs(void)
 	/*
 	 * Calculate the shared memory size and round up to the nearest megabyte.
 	 */
-	size_b = CalculateShmemSize(&num_semas);
+	size_b = CalculateShmemSize(&num_semas, MAIN_SHMEM_SEGMENT);
 	size_mb = add_size(size_b, (1024 * 1024) - 1) / (1024 * 1024);
 	sprintf(buf, "%zu", size_mb);
 	SetConfigOption("shared_memory_size", buf,
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index c9ae3b45b76..7e1a9b43fae 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -76,19 +76,19 @@
 #include "utils/builtins.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
+static void *ShmemAllocRawInSegment(Size size, Size *allocated_size,
+								 int shmem_segment);
 
 /* shared memory global variables */
 
-static PGShmemHeader *ShmemSegHdr;	/* shared mem segment header */
+ShmemSegment Segments[ANON_MAPPINGS];
 
-static void *ShmemBase;			/* start address of shared memory */
-
-static void *ShmemEnd;			/* end+1 address of shared memory */
-
-slock_t    *ShmemLock;			/* spinlock for shared memory and LWLock
-								 * allocation */
-
-static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/*
+ * Primary index hashtable for shmem, for simplicity we use a single for all
+ * shared memory segments. There can be performance consequences of that, and
+ * an alternative option would be to have one index per shared memory segments.
+ */
+static HTAB *ShmemIndex = NULL;
 
 /* To get reliable results for NUMA inquiry we need to "touch pages" once */
 static bool firstNumaTouch = true;
@@ -101,9 +101,17 @@ Datum		pg_numa_available(PG_FUNCTION_ARGS);
 void
 InitShmemAccess(PGShmemHeader *seghdr)
 {
-	ShmemSegHdr = seghdr;
-	ShmemBase = seghdr;
-	ShmemEnd = (char *) ShmemBase + seghdr->totalsize;
+	InitShmemAccessInSegment(seghdr, MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAccessInSegment(PGShmemHeader *seghdr, int shmem_segment)
+{
+	PGShmemHeader *shmhdr = (PGShmemHeader *) seghdr;
+	ShmemSegment *seg = &Segments[shmem_segment];
+	seg->ShmemSegHdr = shmhdr;
+	seg->ShmemBase = (void *) shmhdr;
+	seg->ShmemEnd = (char *) seg->ShmemBase + shmhdr->totalsize;
 }
 
 /*
@@ -114,7 +122,13 @@ InitShmemAccess(PGShmemHeader *seghdr)
 void
 InitShmemAllocation(void)
 {
-	PGShmemHeader *shmhdr = ShmemSegHdr;
+	InitShmemAllocationInSegment(MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAllocationInSegment(int shmem_segment)
+{
+	PGShmemHeader *shmhdr = Segments[shmem_segment].ShmemSegHdr;
 	char	   *aligned;
 
 	Assert(shmhdr != NULL);
@@ -123,9 +137,9 @@ InitShmemAllocation(void)
 	 * Initialize the spinlock used by ShmemAlloc.  We must use
 	 * ShmemAllocUnlocked, since obviously ShmemAlloc can't be called yet.
 	 */
-	ShmemLock = (slock_t *) ShmemAllocUnlocked(sizeof(slock_t));
+	Segments[shmem_segment].ShmemLock = (slock_t *) ShmemAllocUnlockedInSegment(sizeof(slock_t), shmem_segment);
 
-	SpinLockInit(ShmemLock);
+	SpinLockInit(Segments[shmem_segment].ShmemLock);
 
 	/*
 	 * Allocations after this point should go through ShmemAlloc, which
@@ -150,11 +164,17 @@ InitShmemAllocation(void)
  */
 void *
 ShmemAlloc(Size size)
+{
+	return ShmemAllocInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocInSegment(Size size, int shmem_segment)
 {
 	void	   *newSpace;
 	Size		allocated_size;
 
-	newSpace = ShmemAllocRaw(size, &allocated_size);
+	newSpace = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
 	if (!newSpace)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -184,6 +204,12 @@ ShmemAllocNoError(Size size)
  */
 static void *
 ShmemAllocRaw(Size size, Size *allocated_size)
+{
+	return ShmemAllocRawInSegment(size, allocated_size, MAIN_SHMEM_SEGMENT);
+}
+
+static void *
+ShmemAllocRawInSegment(Size size, Size *allocated_size, int shmem_segment)
 {
 	Size		newStart;
 	Size		newFree;
@@ -203,22 +229,22 @@ ShmemAllocRaw(Size size, Size *allocated_size)
 	size = CACHELINEALIGN(size);
 	*allocated_size = size;
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-	SpinLockAcquire(ShmemLock);
+	SpinLockAcquire(Segments[shmem_segment].ShmemLock);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree <= ShmemSegHdr->totalsize)
+	if (newFree <= Segments[shmem_segment].ShmemSegHdr->totalsize)
 	{
-		newSpace = (char *) ShmemBase + newStart;
-		ShmemSegHdr->freeoffset = newFree;
+		newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
+		Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 	}
 	else
 		newSpace = NULL;
 
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[shmem_segment].ShmemLock);
 
 	/* note this assert is okay with newSpace == NULL */
 	Assert(newSpace == (void *) CACHELINEALIGN(newSpace));
@@ -236,6 +262,12 @@ ShmemAllocRaw(Size size, Size *allocated_size)
  */
 void *
 ShmemAllocUnlocked(Size size)
+{
+	return ShmemAllocUnlockedInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocUnlockedInSegment(Size size, int shmem_segment)
 {
 	Size		newStart;
 	Size		newFree;
@@ -246,19 +278,19 @@ ShmemAllocUnlocked(Size size)
 	 */
 	size = MAXALIGN(size);
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree > ShmemSegHdr->totalsize)
+	if (newFree > Segments[shmem_segment].ShmemSegHdr->totalsize)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of shared memory (%zu bytes requested)",
 						size)));
-	ShmemSegHdr->freeoffset = newFree;
+	Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 
-	newSpace = (char *) ShmemBase + newStart;
+	newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
 
 	Assert(newSpace == (void *) MAXALIGN(newSpace));
 
@@ -273,7 +305,13 @@ ShmemAllocUnlocked(Size size)
 bool
 ShmemAddrIsValid(const void *addr)
 {
-	return (addr >= ShmemBase) && (addr < ShmemEnd);
+	return ShmemAddrIsValidInSegment(addr, MAIN_SHMEM_SEGMENT);
+}
+
+bool
+ShmemAddrIsValidInSegment(const void *addr, int shmem_segment)
+{
+	return (addr >= Segments[shmem_segment].ShmemBase) && (addr < Segments[shmem_segment].ShmemEnd);
 }
 
 /*
@@ -334,6 +372,18 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
+{
+	return ShmemInitHashInSegment(name, init_size, max_size, infoP, hash_flags,
+							   MAIN_SHMEM_SEGMENT);
+}
+
+HTAB *
+ShmemInitHashInSegment(const char *name,		/* table string name for shmem index */
+			  long init_size,		/* initial table size */
+			  long max_size,		/* max size of the table */
+			  HASHCTL *infoP,		/* info about key and bucket size */
+			  int hash_flags,		/* info about infoP */
+			  int shmem_segment) 	/* in which segment to keep the table */
 {
 	bool		found;
 	void	   *location;
@@ -350,9 +400,9 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 	hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
 
 	/* look it up in the shmem index */
-	location = ShmemInitStruct(name,
+	location = ShmemInitStructInSegment(name,
 							   hash_get_shared_size(infoP, hash_flags),
-							   &found);
+							   &found, shmem_segment);
 
 	/*
 	 * if it already exists, attach to it rather than allocate and initialize
@@ -385,6 +435,13 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
  */
 void *
 ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+{
+	return ShmemInitStructInSegment(name, size, foundPtr, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
+					  int shmem_segment)
 {
 	ShmemIndexEnt *result;
 	void	   *structPtr;
@@ -393,7 +450,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	if (!ShmemIndex)
 	{
-		PGShmemHeader *shmemseghdr = ShmemSegHdr;
+		PGShmemHeader *shmemseghdr = Segments[shmem_segment].ShmemSegHdr;
 
 		/* Must be trying to create/attach to ShmemIndex itself */
 		Assert(strcmp(name, "ShmemIndex") == 0);
@@ -416,7 +473,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 			 * process can be accessing shared memory yet.
 			 */
 			Assert(shmemseghdr->index == NULL);
-			structPtr = ShmemAlloc(size);
+			structPtr = ShmemAllocInSegment(size, shmem_segment);
 			shmemseghdr->index = structPtr;
 			*foundPtr = false;
 		}
@@ -433,8 +490,8 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		LWLockRelease(ShmemIndexLock);
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("could not create ShmemIndex entry for data structure \"%s\"",
-						name)));
+				 errmsg("could not create ShmemIndex entry for data structure \"%s\" in segment %d",
+						name, shmem_segment)));
 	}
 
 	if (*foundPtr)
@@ -459,7 +516,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		Size		allocated_size;
 
 		/* It isn't in the table yet. allocate and initialize it */
-		structPtr = ShmemAllocRaw(size, &allocated_size);
+		structPtr = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
 		if (structPtr == NULL)
 		{
 			/* out of memory; remove the failed ShmemIndex entry */
@@ -478,14 +535,13 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	LWLockRelease(ShmemIndexLock);
 
-	Assert(ShmemAddrIsValid(structPtr));
+	Assert(ShmemAddrIsValidInSegment(structPtr, shmem_segment));
 
 	Assert(structPtr == (void *) CACHELINEALIGN(structPtr));
 
 	return structPtr;
 }
 
-
 /*
  * Add two Size values, checking for overflow
  */
@@ -542,10 +598,11 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 
 	/* output all allocated entries */
 	memset(nulls, 0, sizeof(nulls));
+	/* XXX: take all shared memory segments into account. */
 	while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
 	{
 		values[0] = CStringGetTextDatum(ent->key);
-		values[1] = Int64GetDatum((char *) ent->location - (char *) ShmemSegHdr);
+		values[1] = Int64GetDatum((char *) ent->location - (char *) Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr);
 		values[2] = Int64GetDatum(ent->size);
 		values[3] = Int64GetDatum(ent->allocated_size);
 		named_allocated += ent->allocated_size;
@@ -557,15 +614,15 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 	/* output shared memory allocated but not counted via the shmem index */
 	values[0] = CStringGetTextDatum("<anonymous>");
 	nulls[1] = true;
-	values[2] = Int64GetDatum(ShmemSegHdr->freeoffset - named_allocated);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset - named_allocated);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
 	/* output as-of-yet unused shared memory */
 	nulls[0] = true;
-	values[1] = Int64GetDatum(ShmemSegHdr->freeoffset);
+	values[1] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
 	nulls[1] = false;
-	values[2] = Int64GetDatum(ShmemSegHdr->totalsize - ShmemSegHdr->freeoffset);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->totalsize - Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 46f44bc4511..a36b08895c8 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -80,6 +80,8 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
+#include "postmaster/postmaster.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/proclist.h"
 #include "storage/procnumber.h"
@@ -618,10 +620,15 @@ LWLockNewTrancheId(void)
 	int		   *LWLockCounter;
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - sizeof(int));
-	/* We use the ShmemLock spinlock to protect LWLockCounter */
-	SpinLockAcquire(ShmemLock);
+	/*
+	 * We use the ShmemLock spinlock to protect LWLockCounter.
+	 *
+	 * XXX: Looks like this is the only use of Segments outside of shmem.c,
+	 * it's maybe worth it to reshape this part to hide Segments structure.
+	 */
+	SpinLockAcquire(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 	result = (*LWLockCounter)++;
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 
 	return result;
 }
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 3baf418b3d1..6ebda479ced 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -77,7 +77,7 @@ extern void check_on_shmem_exit_lists_are_empty(void);
 /* ipci.c */
 extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;
 
-extern Size CalculateShmemSize(int *num_semaphores);
+extern Size CalculateShmemSize(int *num_semaphores, int shmem_segment);
 extern void CreateSharedMemoryAndSemaphores(void);
 #ifdef EXEC_BACKEND
 extern void AttachSharedMemoryStructs(void);
diff --git a/src/include/storage/pg_sema.h b/src/include/storage/pg_sema.h
index fa6ca35a51f..8ae9637fcd0 100644
--- a/src/include/storage/pg_sema.h
+++ b/src/include/storage/pg_sema.h
@@ -41,7 +41,7 @@ typedef HANDLE PGSemaphore;
 extern Size PGSemaphoreShmemSize(int maxSemas);
 
 /* Module initialization (called during postmaster start or shmem reinit) */
-extern void PGReserveSemaphores(int maxSemas);
+extern void PGReserveSemaphores(int maxSemas, int shmem_segment);
 
 /* Allocate a PGSemaphore structure with initial count 1 */
 extern PGSemaphore PGSemaphoreCreate(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 5f7d4b83a60..2348c59b5a0 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -25,6 +25,7 @@
 #define PG_SHMEM_H
 
 #include "storage/dsm_impl.h"
+#include "storage/spin.h"
 
 typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 {
@@ -41,6 +42,20 @@ typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 #endif
 } PGShmemHeader;
 
+typedef struct ShmemSegment
+{
+	PGShmemHeader *ShmemSegHdr; 	/* shared mem segment header */
+	void *ShmemBase; 				/* start address of shared memory */
+	void *ShmemEnd; 				/* end+1 address of shared memory */
+	slock_t    *ShmemLock; 			/* spinlock for shared memory and LWLock
+									 * allocation */
+} ShmemSegment;
+
+/* Number of available segments for anonymous memory mappings */
+#define ANON_MAPPINGS 1
+
+extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -91,4 +106,7 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+/* The main segment, contains everything except buffer blocks and related data. */
+#define MAIN_SHMEM_SEGMENT 0
+
 #endif							/* PG_SHMEM_H */
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index c1f668ded95..69663d412c3 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -29,15 +29,27 @@
 extern PGDLLIMPORT slock_t *ShmemLock;
 struct PGShmemHeader;			/* avoid including storage/pg_shmem.h here */
 extern void InitShmemAccess(struct PGShmemHeader *seghdr);
+extern void InitShmemAccessInSegment(struct PGShmemHeader *seghdr,
+									 int shmem_segment);
 extern void InitShmemAllocation(void);
+extern void InitShmemAllocationInSegment(int shmem_segment);
 extern void *ShmemAlloc(Size size);
+extern void *ShmemAllocInSegment(Size size, int shmem_segment);
 extern void *ShmemAllocNoError(Size size);
 extern void *ShmemAllocUnlocked(Size size);
+extern void *ShmemAllocUnlockedInSegment(Size size, int shmem_segment);
 extern bool ShmemAddrIsValid(const void *addr);
+extern bool ShmemAddrIsValidInSegment(const void *addr, int shmem_segment);
 extern void InitShmemIndex(void);
+extern void InitVariableShmemIndex(void);
 extern HTAB *ShmemInitHash(const char *name, long init_size, long max_size,
 						   HASHCTL *infoP, int hash_flags);
+extern HTAB *ShmemInitHashInSegment(const char *name, long init_size,
+									long max_size, HASHCTL *infoP,
+									int hash_flags, int shmem_segment);
 extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
+extern void *ShmemInitStructInSegment(const char *name, Size size,
+									  bool *foundPtr, int shmem_segment);
 extern Size add_size(Size s1, Size s2);
 extern Size mul_size(Size s1, Size s2);
 

base-commit: 3feff3916ee106c084eca848527dc2d2c3ef4e89
-- 
2.34.1

0002-Address-space-reservation-for-shared-memory-20250616.patchtext/x-patch; charset=US-ASCII; name=0002-Address-space-reservation-for-shared-memory-20250616.patchDownload

From acb2308031b68862a2f1238bf7ac803210063f71 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Wed, 16 Oct 2024 20:21:33 +0200
Subject: [PATCH 02/17] Address space reservation for shared memory

Currently the kernel is responsible to chose an address, where to place each
shared memory mapping, which is the lowest possible address that do not clash
with any other mappings. This is considered to be the most portable approach,
but one of the downsides is that there is no place to resize allocated mappings
anymore. Here is how it looks like for one mapping in /proc/$PID/maps,
/dev/zero represents the anonymous shared memory we talk about:

    00400000-00490000         /path/bin/postgres
    ...
    012d9000-0133e000         [heap]
    7f443a800000-7f470a800000 /dev/zero (deleted)
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
    ...
    7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842
    7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted)

By specifying the mapping address directly it's possible to place the
mapping in a way that leaves room for resizing. The idea is:

* To reserve some address space via mmap'ing a large chunk of memory
  with PROT_NONE and MAP_NORESERVE. This way we prepare a playground for
  preparing shared memory layout without risking anything interfering
  with that.

* To slice the reserved space up into sections, one to use for each
  shared segment.

* Allocate shared memory segments out of corresponding slices and
  leaving unclaimed space in between them. This is implemented via
  mmap'ing memory at a specified address from the reserved space with
  MAP_FIXED.

The result looks like this:

    012d9000-0133e000         [heap]
    7f443a800000-7f444196c000 /dev/zero (deleted)
    7f444196c000-7f470a800000                     # reserved space
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2

Things like address space randomization should not be a problem in this
context, since the randomization is applied to the mmap base, which is
one per process.

This approach also do not impact the actual memory usage as reported by
the kernel. Here is the output of /proc/$PID/status for the master
version with shared_buffers = 128 MB:

    // Peak virtual memory size, which is described as total pages
    // mapped in mm_struct. It corresponds to the mapped reserved space
    // and is the only number that grows with it.
    VmPeak:          2043192 kB
    // Size of memory portions. It contains RssAnon + RssFile + RssShmem
    VmRSS:             22908 kB
    // Size of resident anonymous memory
    RssAnon:             768 kB
    // Size of resident file mappings
    RssFile:           10364 kB
    // Size of resident shmem memory (includes SysV shm, mapping of tmpfs and
    // shared anonymous mappings)
    RssShmem:          11776 kB

Here is the same for the patch when reserving 20GB of space:

    VmPeak:         21250648 kB
    VmRSS:             22948 kB
    RssAnon:             768 kB
    RssFile:           10404 kB
    RssShmem:          11776 kB

Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup
was created with the memory limit 256 MB, then PostgreSQL was launched withing
this cgroup with shared_buffers = 128 MB:

    $ cd /sys/fs/cgroup
    $ mkdir postgres
    $ cd postres
    $ echo 268435456 > memory.max

    $ echo $MASTER_PID_SHELL > cgroup.procs
    # postgres from the master branch has being successfully launched
    #  from that shell
    $ cat memory.current
    17465344 (~16.6 MB)
    # stop postgres

    $ echo $PATCH_PID_SHELL > cgroup.procs
    # postgres from the patch has being successfully launched from that shell
    $ cat memory.current
    17637376 (~16.8 MB)

To control the amount of space reserved a new GUC max_available_memory
is introduced. Ideally it should be based on the maximum available
memory, hense the name.
---
 src/backend/port/sysv_shmem.c       | 284 ++++++++++++++++++++++++----
 src/backend/port/win32_shmem.c      |   2 +-
 src/backend/storage/ipc/ipci.c      |   5 +-
 src/backend/utils/init/globals.c    |   1 +
 src/backend/utils/misc/guc_tables.c |  14 ++
 src/include/storage/pg_shmem.h      |   4 +-
 6 files changed, 271 insertions(+), 39 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 56af0231d24..a0f03ff868f 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -108,6 +108,66 @@ static AnonymousMapping Mappings[ANON_MAPPINGS];
 /* Keeps track of used mapping segments */
 static int next_free_segment = 0;
 
+/*
+ * Anonymous mapping placing (/dev/zero (deleted) below) looks like this:
+ *
+ * 00400000-00490000         /path/bin/postgres
+ * ...
+ * 012d9000-0133e000         [heap]
+ * 7f443a800000-7f470a800000 /dev/zero (deleted)
+ * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
+ * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
+ * ...
+ * 7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842
+ * 7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted)
+ * ...
+ *
+ * We would like to place multiple mappings in such a way, that there will be
+ * enough space between them in the address space to be able to resize up to
+ * certain size, but without counting towards the total memory consumption.
+ *
+ * To achieve that we first reserve some shared memory address space by
+ * mmap'ing a segment of MaxAvailableMemory size with PROT_NONE and
+ * MAP_NORESERVE (these flags allow to make sure this space will not be used by
+ * anything else, yet do not count against memory limits). Having the reserved
+ * space, we allocate out of it actual chunks of shared memory as usual,
+ * updating a pointer to the current available reserved space for the next
+ * allocation with the gap between segments in mind.
+ *
+ * The result would look like this:
+ *
+ * 012d9000-0133e000         [heap]
+ * 7f4426f54000-7f442e010000 /dev/zero (deleted)
+ * 7f442e010000-7f443a800000                     # reserved empty space
+ * 7f443a800000-7f444196c000 /dev/zero (deleted)
+ * 7f444196c000-7f470a800000                     # reserved empty space
+ * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
+ * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
+ * [...]
+ *
+ * The reserved space pointer is calculated to slice up the total reserved
+ * space into fixed fractions of address space for each segment, as specified
+ * in the SHMEM_RESIZE_RATIO array.
+ */
+static double SHMEM_RESIZE_RATIO[1] = {
+	1.0, 									/* MAIN_SHMEM_SLOT */
+};
+
+/*
+ * Offset from the beginning of the reserved space, which indicates currently
+ * available range. New shared memory segments have to be allocated at this
+ * offset related to the reserved space.
+ */
+static Size reserved_offset = 0;
+
+/*
+ * Flag telling that we have decided to use huge pages.
+ *
+ * XXX: It's possible to use GetConfigOption("huge_pages_status", false, false)
+ * instead, but it feels like an overkill.
+ */
+static bool huge_pages_on = false;
+
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
@@ -626,39 +686,198 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
  *
  * This function will modify mapping size to the actual size of the allocation,
  * if it ends up allocating a segment that is larger than requested.
+ *
+ * Note that we do not switch from huge pages to regular pages in this
+ * function, this decision was already made in ReserveAnonymousMemory and we
+ * stick to it.
  */
 static void
-CreateAnonymousSegment(AnonymousMapping *mapping)
+CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 {
 	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
+	int			mmap_flags = PG_MMAP_FLAGS;
 
 #ifndef MAP_HUGETLB
-	/* PGSharedMemoryCreate should have dealt with this case */
-	Assert(huge_pages != HUGE_PAGES_ON);
+	/* ReserveAnonymousMemory should have dealt with this case */
+	Assert(huge_pages != HUGE_PAGES_ON && !huge_pages_on);
 #else
-	if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
+	if (huge_pages_on)
 	{
-		/*
-		 * Round up the request size to a suitable large value.
-		 */
 		Size		hugepagesize;
-		int			mmap_flags;
 
+		/* Make sure nothing is messed up */
+		Assert(huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY);
+
+		/* Round up the request size to a suitable large value */
 		GetHugePageSize(&hugepagesize, &mmap_flags);
 
 		if (allocsize % hugepagesize != 0)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
+		mmap_flags = PG_MMAP_FLAGS | mmap_flags;
+	}
+#endif
+
+	elog(DEBUG1, "segment[%s]: mmap(%zu) at address %p",
+		 MappingName(mapping->shmem_segment), allocsize, base + reserved_offset);
+
+	/*
+	 * Try to create mapping at an address out of the reserved range, which
+	 * will allow to extend it later. Use reserved_offset to allocate the
+	 * segment, then update currently available reserved range.
+	 *
+	 * If the last step has failed, fallback to the regular mapping
+	 * creation and signal that shared buffers could not be resized without
+	 * a restart.
+	 */
+	ptr = mmap(base + reserved_offset, allocsize, PROT_READ | PROT_WRITE,
+			   mmap_flags | MAP_FIXED, -1, 0);
+	mmap_errno = errno;
+
+	if (ptr == MAP_FAILED)
+	{
+		DebugMappings();
+		elog(DEBUG1, "segment[%s]: mmap(%zu) at address %p failed: %m, "
+					 "fallback to the non-resizable allocation",
+			 MappingName(mapping->shmem_segment), allocsize, base + reserved_offset);
+
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+						   PG_MMAP_FLAGS, -1, 0);
+		mmap_errno = errno;
+	}
+	else
+	{
+		Size total_reserved = (Size) MaxAvailableMemory * BLCKSZ;
+
+		reserved_offset += total_reserved * SHMEM_RESIZE_RATIO[next_free_segment];
+	}
+
+	if (ptr == MAP_FAILED)
+	{
+		errno = mmap_errno;
+		DebugMappings();
+		ereport(FATAL,
+				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
+						MappingName(mapping->shmem_segment)),
+				 (mmap_errno == ENOMEM) ?
+				 errhint("This error usually means that PostgreSQL's request "
+						 "for a shared memory segment exceeded available memory, "
+						 "swap space, or huge pages. To reduce the request size "
+						 "(currently %zu bytes), reduce PostgreSQL's shared "
+						 "memory usage, perhaps by reducing \"shared_buffers\" or "
+						 "\"max_connections\".",
+						 allocsize) : 0));
+	}
+
+	mapping->shmem = ptr;
+	mapping->shmem_size = allocsize;
+}
+
+/*
+ * ReserveAnonymousMemory
+ *
+ * Reserve shared memory address space, from which shared memory segments are
+ * going to be sliced out. The goal of this exercise is to support segments
+ * resizing, for which we need a reserved space free of potential clashes with
+ * other mmap'd areas that are not under our control. Reservation is done via
+ * mmap, and will not allocate any memory until it will be actually used, and
+ * MAP_NORESERVE allows to make it not counting againt kernel reservation
+ * limits (e.g. in cgroups or for huge pages). Do not get confused because of
+ * MAP_NORESERVE -- we need to reserve some space, but not the actual memory,
+ * and that is that this flag is about.
+ *
+ * Note, that with MAP_NORESERVE a reservation with hugetlb will succeed even
+ * if there is actually not enough huge pages. Hence this function is
+ * responsible for deciding whether to use huge pages or not. To achieve that
+ * we need to probe first and try to allocate needed memory for all segments --
+ * if this succeeds, we unmap the probe segment and use hugetlb; if it fails,
+ * we proceed with the regular memory.
+ */
+void *
+ReserveAnonymousMemory(Size reserve_size)
+{
+	Size		allocsize = reserve_size;
+	void	   *ptr = MAP_FAILED;
+	int			mmap_errno = 0;
+
+	/* Complain if hugepages demanded but we can't possibly support them */
+#if !defined(MAP_HUGETLB)
+	if (huge_pages == HUGE_PAGES_ON)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("huge pages not supported on this platform")));
+#else
+	if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
+	{
+		Size		hugepagesize, total_size = 0;
+		int			mmap_flags;
+
+		GetHugePageSize(&hugepagesize, &mmap_flags);
+
+		/*
+		 * Figure out how much memory is needed for all segments, keeping in
+		 * mind that for every segment this value will be rounding up by the
+		 * huge page size. The resulting value will be used to probe memory and
+		 * decide whether we will allocate huge pages or not.
+		 *
+		 * We could actually have a mix and match of segments with and without
+		 * huge pages. But in that case we need to have multiple reservation
+		 * spaces to use corresponding memory (hugetlb adress space reserved
+		 * for hugetlb segments, regular memory for others), and it doesn't
+		 * seem to worth the complexity for now.
+		 */
+		for(int segment = 0; segment < ANON_MAPPINGS; segment++)
+		{
+			int	numSemas;
+			Size segment_size = CalculateShmemSize(&numSemas, segment);
+
+			if (segment_size % hugepagesize != 0)
+				segment_size += hugepagesize - (segment_size % hugepagesize);
+
+			total_size += segment_size;
+		}
+
+		/* Map total amount of memory to test its availability. */
+		elog(DEBUG1, "reserving space: probe mmap(%zu) with MAP_HUGETLB",
+					 total_size);
+		ptr = mmap(NULL, total_size, PROT_NONE,
+				   PG_MMAP_FLAGS | MAP_ANONYMOUS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 		{
-			DebugMappings();
-			elog(DEBUG1, "segment[%s]: mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-				 MappingName(mapping->shmem_segment), allocsize);
+			/* No huge pages, we will go with the regular page size */
+			elog(DEBUG1, "reserving space: probe mmap(%zu) with MAP_HUGETLB "
+						 "failed, huge pages disabled: %m", total_size);
+		}
+		else
+		{
+			/*
+			 * All fine, unmap the temporary segment and proceed with reserving
+			 * using huge pages.
+			 */
+			if (munmap(ptr, total_size) < 0)
+				elog(LOG, "reservice space: munmap(%p, %zu) failed: %m",
+					 ptr, total_size);
+
+			/* Round up the requested size to a suitable large value. */
+			if (allocsize % hugepagesize != 0)
+				allocsize += hugepagesize - (allocsize % hugepagesize);
+
+			elog(DEBUG1, "reserving space: mmap(%zu) with MAP_HUGETLB",
+						 allocsize);
+			ptr = mmap(NULL, allocsize, PROT_NONE,
+					   PG_MMAP_FLAGS | MAP_ANONYMOUS | MAP_NORESERVE | mmap_flags,
+					   -1, 0);
+			mmap_errno = errno;
+
+			/* This should not happen, but handle errors anyway */
+			if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
+			{
+				elog(DEBUG1, "reserving space: mmap(%zu) with MAP_HUGETLB "
+							 "failed, huge pages disabled: %m", allocsize);
+			}
 		}
 	}
 #endif
@@ -666,10 +885,12 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 	/*
 	 * Report whether huge pages are in use.  This needs to be tracked before
 	 * the second mmap() call if attempting to use huge pages failed
-	 * previously.
+	 * previously. At this point ptr is either pointing to the probe segment,
+	 * if we couldn't mmap it, or the reservation space.
 	 */
 	SetConfigOption("huge_pages_status", (ptr == MAP_FAILED) ? "off" : "on",
 					PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
+	huge_pages_on = ptr != MAP_FAILED;
 
 	if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
 	{
@@ -677,10 +898,11 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
-		allocsize = mapping->shmem_size;
-		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS, -1, 0);
-		mmap_errno = errno;
+		allocsize = reserve_size;
+
+		elog(DEBUG1, "reserving space: mmap(%zu)", allocsize);
+		ptr = mmap(NULL, allocsize, PROT_NONE,
+				   MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE, -1, 0);
 	}
 
 	if (ptr == MAP_FAILED)
@@ -688,20 +910,18 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		errno = mmap_errno;
 		DebugMappings();
 		ereport(FATAL,
-				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
-						MappingName(mapping->shmem_segment)),
+				(errmsg("reserving space: could not map anonymous shared "
+						"memory: %m"),
 				 (mmap_errno == ENOMEM) ?
 				 errhint("This error usually means that PostgreSQL's request "
-						 "for a shared memory segment exceeded available memory, "
-						 "swap space, or huge pages. To reduce the request size "
-						 "(currently %zu bytes), reduce PostgreSQL's shared "
-						 "memory usage, perhaps by reducing \"shared_buffers\" or "
-						 "\"max_connections\".",
+						 "for a reserved shared memory address space exceeded "
+						 "available memory, swap space, or huge pages. To "
+						 "reduce the request reservation size (currently %zu "
+						 "bytes), reduce PostgreSQL's \"maximum_shared_buffers\".",
 						 allocsize) : 0));
 	}
 
-	mapping->shmem = ptr;
-	mapping->shmem_size = allocsize;
+	return ptr;
 }
 
 /*
@@ -740,7 +960,7 @@ AnonymousShmemDetach(int status, Datum arg)
  */
 PGShmemHeader *
 PGSharedMemoryCreate(Size size,
-					 PGShmemHeader **shim)
+					 PGShmemHeader **shim, Pointer base)
 {
 	IpcMemoryKey NextShmemSegID;
 	void	   *memAddress;
@@ -760,14 +980,6 @@ PGSharedMemoryCreate(Size size,
 				 errmsg("could not stat data directory \"%s\": %m",
 						DataDir)));
 
-	/* Complain if hugepages demanded but we can't possibly support them */
-#if !defined(MAP_HUGETLB)
-	if (huge_pages == HUGE_PAGES_ON)
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("huge pages not supported on this platform")));
-#endif
-
 	/* For now, we don't support huge pages in SysV memory */
 	if (huge_pages == HUGE_PAGES_ON && shared_memory_type != SHMEM_TYPE_MMAP)
 		ereport(ERROR,
@@ -782,7 +994,7 @@ PGSharedMemoryCreate(Size size,
 	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
 		/* On success, mapping data will be modified. */
-		CreateAnonymousSegment(mapping);
+		CreateAnonymousSegment(mapping, base);
 
 		next_free_segment++;
 
diff --git a/src/backend/port/win32_shmem.c b/src/backend/port/win32_shmem.c
index 4dee856d6bd..ce719f1b412 100644
--- a/src/backend/port/win32_shmem.c
+++ b/src/backend/port/win32_shmem.c
@@ -205,7 +205,7 @@ EnableLockPagesPrivilege(int elevel)
  */
 PGShmemHeader *
 PGSharedMemoryCreate(Size size,
-					 PGShmemHeader **shim)
+					 PGShmemHeader **shim, Pointer base)
 {
 	void	   *memAddress;
 	PGShmemHeader *hdr;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 8b38e985327..076888c0172 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -203,9 +203,12 @@ CreateSharedMemoryAndSemaphores(void)
 	PGShmemHeader *seghdr;
 	Size		size;
 	int			numSemas;
+	void 		*base;
 
 	Assert(!IsUnderPostmaster);
 
+	base = ReserveAnonymousMemory((Size) MaxAvailableMemory * BLCKSZ);
+
 	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
 	{
 		/* Compute the size of the shared-memory block */
@@ -217,7 +220,7 @@ CreateSharedMemoryAndSemaphores(void)
 		 *
 		 * XXX: Do multiple shims are needed, one per segment?
 		 */
-		seghdr = PGSharedMemoryCreate(size, &shim);
+		seghdr = PGSharedMemoryCreate(size, &shim, base);
 
 		/*
 		 * Make sure that huge pages are never reported as "unknown" while the
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..9ccb7d455d6 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -140,6 +140,7 @@ int			max_parallel_maintenance_workers = 2;
  * register background workers.
  */
 int			NBuffers = 16384;
+int			MaxAvailableMemory = 131072;
 int			MaxConnections = 100;
 int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f04bfedb2fd..e63521e5a2d 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2376,6 +2376,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"max_available_memory", PGC_SIGHUP, RESOURCES_MEM,
+			gettext_noop("Sets the upper limit for the shared_buffers value."),
+			gettext_noop("Shared memory could be resized at runtime, this "
+						 "parameters sets the upper limit for it, beyond which "
+						 "resizing would not be supported. Normally this value "
+						 "would be the same as the total available memory."),
+			GUC_UNIT_BLOCKS
+		},
+		&MaxAvailableMemory,
+		131072, 16, INT_MAX / 2,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_buffer_usage_limit", PGC_USERSET, RESOURCES_MEM,
 			gettext_noop("Sets the buffer pool size for VACUUM, ANALYZE, and autovacuum."),
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 2348c59b5a0..8cb1e159917 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
 extern PGDLLIMPORT int huge_page_size;
 extern PGDLLIMPORT int huge_pages_status;
+extern PGDLLIMPORT int MaxAvailableMemory;
 
 /* Possible values for huge_pages and huge_pages_status */
 typedef enum
@@ -101,10 +102,11 @@ extern void PGSharedMemoryNoReAttach(void);
 #endif
 
 extern PGShmemHeader *PGSharedMemoryCreate(Size size,
-										   PGShmemHeader **shim);
+										   PGShmemHeader **shim, Pointer base);
 extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
+void *ReserveAnonymousMemory(Size reserve_size);
 
 /* The main segment, contains everything except buffer blocks and related data. */
 #define MAIN_SHMEM_SEGMENT 0
-- 
2.34.1

0004-Introduce-pending-flag-for-GUC-assign-hooks-20250616.patchtext/x-patch; charset=US-ASCII; name=0004-Introduce-pending-flag-for-GUC-assign-hooks-20250616.patchDownload

From 1afa2048d803e0b3372de348868b26542ccfb3cd Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sun, 6 Apr 2025 16:40:32 +0200
Subject: [PATCH 04/17] Introduce pending flag for GUC assign hooks

Currently an assing hook can perform some preprocessing of a new value,
but it cannot change the behavior, which dictates that the new value
will be applied immediately after the hook. Certain GUC options (like
shared_buffers, coming in subsequent patches) may need coordinating work
between backends to change, meaning we cannot apply it right away.

Add a new flag "pending" for an assign hook to allow the hook indicate
exactly that. If the pending flag is set after the hook, the new value
will not be applied and it's handling becomes the hook's implementation
responsibility.

Note, that this also requires changes in the way how GUCs are getting
reported, but the patch does not cover that yet.
---
 src/backend/access/transam/xlog.c    |  2 +-
 src/backend/commands/variable.c      |  6 +--
 src/backend/libpq/pqcomm.c           |  8 ++--
 src/backend/tcop/postgres.c          |  2 +-
 src/backend/utils/misc/guc.c         | 59 +++++++++++++++++++---------
 src/backend/utils/misc/stack_depth.c |  2 +-
 src/include/utils/guc.h              |  2 +-
 src/include/utils/guc_hooks.h        | 20 +++++-----
 8 files changed, 61 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1914859b2ee..5e204341bde 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2321,7 +2321,7 @@ CalculateCheckpointSegments(void)
 }
 
 void
-assign_max_wal_size(int newval, void *extra)
+assign_max_wal_size(int newval, void *extra, bool *pending)
 {
 	max_wal_size_mb = newval;
 	CalculateCheckpointSegments();
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 608f10d9412..e40dae2ddf2 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1143,7 +1143,7 @@ check_cluster_name(char **newval, void **extra, GucSource source)
  * GUC assign_hook for maintenance_io_concurrency
  */
 void
-assign_maintenance_io_concurrency(int newval, void *extra)
+assign_maintenance_io_concurrency(int newval, void *extra, bool *pending)
 {
 	/*
 	 * Reconfigure recovery prefetching, because a setting it depends on
@@ -1161,12 +1161,12 @@ assign_maintenance_io_concurrency(int newval, void *extra)
  * they may be assigned in either order.
  */
 void
-assign_io_max_combine_limit(int newval, void *extra)
+assign_io_max_combine_limit(int newval, void *extra, bool *pending)
 {
 	io_combine_limit = Min(newval, io_combine_limit_guc);
 }
 void
-assign_io_combine_limit(int newval, void *extra)
+assign_io_combine_limit(int newval, void *extra, bool *pending)
 {
 	io_combine_limit = Min(io_max_combine_limit, newval);
 }
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index e5171467de1..2a6a587ef76 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -1952,7 +1952,7 @@ pq_settcpusertimeout(int timeout, Port *port)
  * GUC assign_hook for tcp_keepalives_idle
  */
 void
-assign_tcp_keepalives_idle(int newval, void *extra)
+assign_tcp_keepalives_idle(int newval, void *extra, bool *pending)
 {
 	/*
 	 * The kernel API provides no way to test a value without setting it; and
@@ -1985,7 +1985,7 @@ show_tcp_keepalives_idle(void)
  * GUC assign_hook for tcp_keepalives_interval
  */
 void
-assign_tcp_keepalives_interval(int newval, void *extra)
+assign_tcp_keepalives_interval(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_setkeepalivesinterval(newval, MyProcPort);
@@ -2008,7 +2008,7 @@ show_tcp_keepalives_interval(void)
  * GUC assign_hook for tcp_keepalives_count
  */
 void
-assign_tcp_keepalives_count(int newval, void *extra)
+assign_tcp_keepalives_count(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_setkeepalivescount(newval, MyProcPort);
@@ -2031,7 +2031,7 @@ show_tcp_keepalives_count(void)
  * GUC assign_hook for tcp_user_timeout
  */
 void
-assign_tcp_user_timeout(int newval, void *extra)
+assign_tcp_user_timeout(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_settcpusertimeout(newval, MyProcPort);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 2f8c3d5f918..0d1b6466d1e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3591,7 +3591,7 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 
 /* GUC assign hook for transaction_timeout */
 void
-assign_transaction_timeout(int newval, void *extra)
+assign_transaction_timeout(int newval, void *extra, bool *pending)
 {
 	if (IsTransactionState())
 	{
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 667df448732..bb681f5bc60 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1679,6 +1679,7 @@ InitializeOneGUCOption(struct config_generic *gconf)
 				struct config_int *conf = (struct config_int *) gconf;
 				int			newval = conf->boot_val;
 				void	   *extra = NULL;
+				bool 	   pending = false;
 
 				Assert(newval >= conf->min);
 				Assert(newval <= conf->max);
@@ -1687,9 +1688,13 @@ InitializeOneGUCOption(struct config_generic *gconf)
 					elog(FATAL, "failed to initialize %s to %d",
 						 conf->gen.name, newval);
 				if (conf->assign_hook)
-					conf->assign_hook(newval, extra);
-				*conf->variable = conf->reset_val = newval;
-				conf->gen.extra = conf->reset_extra = extra;
+					conf->assign_hook(newval, extra, &pending);
+
+				if (!pending)
+				{
+					*conf->variable = conf->reset_val = newval;
+					conf->gen.extra = conf->reset_extra = extra;
+				}
 				break;
 			}
 		case PGC_REAL:
@@ -2041,13 +2046,18 @@ ResetAllOptions(void)
 			case PGC_INT:
 				{
 					struct config_int *conf = (struct config_int *) gconf;
+					bool 			  pending = false;
 
 					if (conf->assign_hook)
 						conf->assign_hook(conf->reset_val,
-										  conf->reset_extra);
-					*conf->variable = conf->reset_val;
-					set_extra_field(&conf->gen, &conf->gen.extra,
-									conf->reset_extra);
+										  conf->reset_extra,
+										  &pending);
+					if (!pending)
+					{
+						*conf->variable = conf->reset_val;
+						set_extra_field(&conf->gen, &conf->gen.extra,
+										conf->reset_extra);
+					}
 					break;
 				}
 			case PGC_REAL:
@@ -2424,16 +2434,21 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
 							struct config_int *conf = (struct config_int *) gconf;
 							int			newval = newvalue.val.intval;
 							void	   *newextra = newvalue.extra;
+							bool 	    pending = false;
 
 							if (*conf->variable != newval ||
 								conf->gen.extra != newextra)
 							{
 								if (conf->assign_hook)
-									conf->assign_hook(newval, newextra);
-								*conf->variable = newval;
-								set_extra_field(&conf->gen, &conf->gen.extra,
-												newextra);
-								changed = true;
+									conf->assign_hook(newval, newextra, &pending);
+
+								if (!pending)
+								{
+									*conf->variable = newval;
+									set_extra_field(&conf->gen, &conf->gen.extra,
+													newextra);
+									changed = true;
+								}
 							}
 							break;
 						}
@@ -3850,18 +3865,24 @@ set_config_with_handle(const char *name, config_handle *handle,
 
 				if (changeVal)
 				{
+					bool pending = false;
+
 					/* Save old value to support transaction abort */
 					if (!makeDefault)
 						push_old_value(&conf->gen, action);
 
 					if (conf->assign_hook)
-						conf->assign_hook(newval, newextra);
-					*conf->variable = newval;
-					set_extra_field(&conf->gen, &conf->gen.extra,
-									newextra);
-					set_guc_source(&conf->gen, source);
-					conf->gen.scontext = context;
-					conf->gen.srole = srole;
+						conf->assign_hook(newval, newextra, &pending);
+
+					if (!pending)
+					{
+						*conf->variable = newval;
+						set_extra_field(&conf->gen, &conf->gen.extra,
+										newextra);
+						set_guc_source(&conf->gen, source);
+						conf->gen.scontext = context;
+						conf->gen.srole = srole;
+					}
 				}
 				if (makeDefault)
 				{
diff --git a/src/backend/utils/misc/stack_depth.c b/src/backend/utils/misc/stack_depth.c
index 8f7cf531fbc..ef59ae62008 100644
--- a/src/backend/utils/misc/stack_depth.c
+++ b/src/backend/utils/misc/stack_depth.c
@@ -156,7 +156,7 @@ check_max_stack_depth(int *newval, void **extra, GucSource source)
 
 /* GUC assign hook for max_stack_depth */
 void
-assign_max_stack_depth(int newval, void *extra)
+assign_max_stack_depth(int newval, void *extra, bool *pending)
 {
 	ssize_t		newval_bytes = newval * (ssize_t) 1024;
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index f619100467d..8802ad8a3cb 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -187,7 +187,7 @@ typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource sourc
 typedef bool (*GucEnumCheckHook) (int *newval, void **extra, GucSource source);
 
 typedef void (*GucBoolAssignHook) (bool newval, void *extra);
-typedef void (*GucIntAssignHook) (int newval, void *extra);
+typedef void (*GucIntAssignHook) (int newval, void *extra, bool *pending);
 typedef void (*GucRealAssignHook) (double newval, void *extra);
 typedef void (*GucStringAssignHook) (const char *newval, void *extra);
 typedef void (*GucEnumAssignHook) (int newval, void *extra);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 799fa7ace68..c8300cffa8e 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -81,14 +81,14 @@ extern bool check_log_stats(bool *newval, void **extra, GucSource source);
 extern bool check_log_timezone(char **newval, void **extra, GucSource source);
 extern void assign_log_timezone(const char *newval, void *extra);
 extern const char *show_log_timezone(void);
-extern void assign_maintenance_io_concurrency(int newval, void *extra);
-extern void assign_io_max_combine_limit(int newval, void *extra);
-extern void assign_io_combine_limit(int newval, void *extra);
+extern void assign_maintenance_io_concurrency(int newval, void *extra, bool *pending);
+extern void assign_io_max_combine_limit(int newval, void *extra, bool *pending);
+extern void assign_io_combine_limit(int newval, void *extra, bool *pending);
 extern bool check_max_slot_wal_keep_size(int *newval, void **extra,
 										 GucSource source);
-extern void assign_max_wal_size(int newval, void *extra);
+extern void assign_max_wal_size(int newval, void *extra, bool *pending);
 extern bool check_max_stack_depth(int *newval, void **extra, GucSource source);
-extern void assign_max_stack_depth(int newval, void *extra);
+extern void assign_max_stack_depth(int newval, void *extra, bool *pending);
 extern bool check_multixact_member_buffers(int *newval, void **extra,
 										   GucSource source);
 extern bool check_multixact_offset_buffers(int *newval, void **extra,
@@ -143,13 +143,13 @@ extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 extern void assign_syslog_facility(int newval, void *extra);
 extern void assign_syslog_ident(const char *newval, void *extra);
-extern void assign_tcp_keepalives_count(int newval, void *extra);
+extern void assign_tcp_keepalives_count(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_count(void);
-extern void assign_tcp_keepalives_idle(int newval, void *extra);
+extern void assign_tcp_keepalives_idle(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_idle(void);
-extern void assign_tcp_keepalives_interval(int newval, void *extra);
+extern void assign_tcp_keepalives_interval(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_interval(void);
-extern void assign_tcp_user_timeout(int newval, void *extra);
+extern void assign_tcp_user_timeout(int newval, void *extra, bool *pending);
 extern const char *show_tcp_user_timeout(void);
 extern bool check_temp_buffers(int *newval, void **extra, GucSource source);
 extern bool check_temp_tablespaces(char **newval, void **extra,
@@ -165,7 +165,7 @@ extern bool check_transaction_buffers(int *newval, void **extra, GucSource sourc
 extern bool check_transaction_deferrable(bool *newval, void **extra, GucSource source);
 extern bool check_transaction_isolation(int *newval, void **extra, GucSource source);
 extern bool check_transaction_read_only(bool *newval, void **extra, GucSource source);
-extern void assign_transaction_timeout(int newval, void *extra);
+extern void assign_transaction_timeout(int newval, void *extra, bool *pending);
 extern const char *show_unix_socket_permissions(void);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern bool check_wal_consistency_checking(char **newval, void **extra,
-- 
2.34.1

0005-Introduce-pss_barrierReceivedGeneration-20250616.patchtext/x-patch; charset=US-ASCII; name=0005-Introduce-pss_barrierReceivedGeneration-20250616.patchDownload

From fb470c019742f2e9eaa7666ab81a24f816066387 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 4 Apr 2025 21:46:14 +0200
Subject: [PATCH 05/17] Introduce pss_barrierReceivedGeneration

Currently WaitForProcSignalBarrier allows to make sure the message sent
via EmitProcSignalBarrier was processed by all ProcSignal mechanism
participants.

Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
which will be updated when a process has received the message, but not
processed it yet. This makes it possible to support a new mode of
waiting, when ProcSignal participants want to synchronize message
processing. To do that, a participant can wait via
WaitForProcSignalBarrierReceived when processing a message, effectively
making sure that all processes are going to start processing
ProcSignalBarrier simultaneously.
---
 src/backend/storage/ipc/procsignal.c | 67 ++++++++++++++++++++++------
 src/include/storage/procsignal.h     |  1 +
 2 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index a9bb540b55a..c6bec9be423 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -58,7 +58,10 @@
  * of it. For such use cases, we set a bit in pss_barrierCheckMask and then
  * increment the current "barrier generation"; when the new barrier generation
  * (or greater) appears in the pss_barrierGeneration flag of every process,
- * we know that the message has been received everywhere.
+ * we know that the message has been received and processed everywhere. In case
+ * if we only need to know only that the message was received everywhere (e.g.
+ * receiving processes need to handle the message in a coordinated fashion)
+ * use pss_barrierReceivedGeneration in the same way.
  */
 typedef struct
 {
@@ -70,6 +73,7 @@ typedef struct
 
 	/* Barrier-related fields (not protected by pss_mutex) */
 	pg_atomic_uint64 pss_barrierGeneration;
+	pg_atomic_uint64 pss_barrierReceivedGeneration;
 	pg_atomic_uint32 pss_barrierCheckMask;
 	ConditionVariable pss_barrierCV;
 } ProcSignalSlot;
@@ -152,6 +156,8 @@ ProcSignalShmemInit(void)
 			slot->pss_cancel_key_len = 0;
 			MemSet(slot->pss_signalFlags, 0, sizeof(slot->pss_signalFlags));
 			pg_atomic_init_u64(&slot->pss_barrierGeneration, PG_UINT64_MAX);
+			pg_atomic_init_u64(&slot->pss_barrierReceivedGeneration,
+							   PG_UINT64_MAX);
 			pg_atomic_init_u32(&slot->pss_barrierCheckMask, 0);
 			ConditionVariableInit(&slot->pss_barrierCV);
 		}
@@ -199,6 +205,8 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
 	barrier_generation =
 		pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration);
 	pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation);
+	pg_atomic_write_u64(&slot->pss_barrierReceivedGeneration,
+						barrier_generation);
 
 	if (cancel_key_len > 0)
 		memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
@@ -263,6 +271,7 @@ CleanupProcSignalState(int status, Datum arg)
 	 * no barrier waits block on it.
 	 */
 	pg_atomic_write_u64(&slot->pss_barrierGeneration, PG_UINT64_MAX);
+	pg_atomic_write_u64(&slot->pss_barrierReceivedGeneration, PG_UINT64_MAX);
 
 	SpinLockRelease(&slot->pss_mutex);
 
@@ -416,12 +425,8 @@ EmitProcSignalBarrier(ProcSignalBarrierType type)
 	return generation;
 }
 
-/*
- * WaitForProcSignalBarrier - wait until it is guaranteed that all changes
- * requested by a specific call to EmitProcSignalBarrier() have taken effect.
- */
-void
-WaitForProcSignalBarrier(uint64 generation)
+static void
+WaitForProcSignalBarrierInternal(uint64 generation, bool receivedOnly)
 {
 	Assert(generation <= pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration));
 
@@ -436,12 +441,17 @@ WaitForProcSignalBarrier(uint64 generation)
 		uint64		oldval;
 
 		/*
-		 * It's important that we check only pss_barrierGeneration here and
-		 * not pss_barrierCheckMask. Bits in pss_barrierCheckMask get cleared
-		 * before the barrier is actually absorbed, but pss_barrierGeneration
+		 * It's important that we check only pss_barrierGeneration &
+		 * pss_barrierGeneration here and not pss_barrierCheckMask. Bits in
+		 * pss_barrierCheckMask get cleared before the barrier is actually
+		 * absorbed, but pss_barrierGeneration & pss_barrierReceivedGeneration
 		 * is updated only afterward.
 		 */
-		oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+		if (receivedOnly)
+			oldval = pg_atomic_read_u64(&slot->pss_barrierReceivedGeneration);
+		else
+			oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+
 		while (oldval < generation)
 		{
 			if (ConditionVariableTimedSleep(&slot->pss_barrierCV,
@@ -450,7 +460,11 @@ WaitForProcSignalBarrier(uint64 generation)
 				ereport(LOG,
 						(errmsg("still waiting for backend with PID %d to accept ProcSignalBarrier",
 								(int) pg_atomic_read_u32(&slot->pss_pid))));
-			oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+
+			if (receivedOnly)
+				oldval = pg_atomic_read_u64(&slot->pss_barrierReceivedGeneration);
+			else
+				oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
 		}
 		ConditionVariableCancelSleep();
 	}
@@ -464,12 +478,33 @@ WaitForProcSignalBarrier(uint64 generation)
 	 * The caller is probably calling this function because it wants to read
 	 * the shared state or perform further writes to shared state once all
 	 * backends are known to have absorbed the barrier. However, the read of
-	 * pss_barrierGeneration was performed unlocked; insert a memory barrier
-	 * to separate it from whatever follows.
+	 * pss_barrierGeneration & pss_barrierReceivedGeneration was performed
+	 * unlocked; insert a memory barrier to separate it from whatever follows.
 	 */
 	pg_memory_barrier();
 }
 
+/*
+ * WaitForProcSignalBarrier - wait until it is guaranteed that all changes
+ * requested by a specific call to EmitProcSignalBarrier() have taken effect.
+ */
+void
+WaitForProcSignalBarrier(uint64 generation)
+{
+	WaitForProcSignalBarrierInternal(generation, false);
+}
+
+/*
+ * WaitForProcSignalBarrierReceived - wait until it is guaranteed that all
+ * backends have observed the message sent by a specific call to
+ * EmitProcSignalBarrier().
+ */
+void
+WaitForProcSignalBarrierReceived(uint64 generation)
+{
+	WaitForProcSignalBarrierInternal(generation, true);
+}
+
 /*
  * Handle receipt of an interrupt indicating a global barrier event.
  *
@@ -523,6 +558,10 @@ ProcessProcSignalBarrier(void)
 	if (local_gen == shared_gen)
 		return;
 
+	/* The message is observed, record that */
+	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
+						shared_gen);
+
 	/*
 	 * Get and clear the flags that are set for this backend. Note that
 	 * pg_atomic_exchange_u32 is a full barrier, so we're guaranteed that the
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index afeeb1ca019..2733bbb8c5b 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -79,6 +79,7 @@ extern void SendCancelRequest(int backendPID, const uint8 *cancel_key, int cance
 
 extern uint64 EmitProcSignalBarrier(ProcSignalBarrierType type);
 extern void WaitForProcSignalBarrier(uint64 generation);
+extern void WaitForProcSignalBarrierReceived(uint64 generation);
 extern void ProcessProcSignalBarrier(void);
 
 extern void procsignal_sigusr1_handler(SIGNAL_ARGS);
-- 
2.34.1

0003-Introduce-multiple-shmem-segments-for-share-20250616.patchtext/x-patch; charset=US-ASCII; name=0003-Introduce-multiple-shmem-segments-for-share-20250616.patchDownload

From 92e4e639547207e883bc010c485177afa32d5c72 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sat, 15 Mar 2025 16:38:59 +0100
Subject: [PATCH 03/17] Introduce multiple shmem segments for shared buffers

Add more shmem segments to split shared buffers into following chunks:
* BUFFERS_SHMEM_SEGMENT: contains buffer blocks
* BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
* BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
* CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
* STRATEGY_SHMEM_SEGMENT: contains buffer strategy status

Size of the corresponding shared data directly depends on NBuffers,
meaning that if we would like to change NBuffers, they have to be
resized correspondingly. Placing each of them in a separate shmem
segment allows to achieve that.

There are some asumptions made about each of shmem segments upper size
limit. The buffer blocks have the largest, while the rest claim less
extra room for resize. Ideally those limits have to be deduced from the
maximum allowed shared memory.
---
 src/backend/port/sysv_shmem.c          | 24 +++++++-
 src/backend/storage/buffer/buf_init.c  | 79 +++++++++++++++++---------
 src/backend/storage/buffer/buf_table.c |  6 +-
 src/backend/storage/buffer/freelist.c  |  5 +-
 src/backend/storage/ipc/ipci.c         |  2 +-
 src/include/storage/bufmgr.h           |  2 +-
 src/include/storage/pg_shmem.h         | 24 +++++++-
 7 files changed, 105 insertions(+), 37 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index a0f03ff868f..f46d9d5d9cd 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -147,10 +147,18 @@ static int next_free_segment = 0;
  *
  * The reserved space pointer is calculated to slice up the total reserved
  * space into fixed fractions of address space for each segment, as specified
- * in the SHMEM_RESIZE_RATIO array.
+ * in the SHMEM_RESIZE_RATIO array. E.g. we allow BUFFERS_SHMEM_SEGMENT to take
+ * up to 60% of the whole space when resizing, based on the fact that it most
+ * likely will be the main consumer of this memory. Those numbers are pulled
+ * out of thin air for now, makes sense to evaluate them more precise.
  */
-static double SHMEM_RESIZE_RATIO[1] = {
-	1.0, 									/* MAIN_SHMEM_SLOT */
+static double SHMEM_RESIZE_RATIO[6] = {
+	0.1,    /* MAIN_SHMEM_SEGMENT */
+	0.6,    /* BUFFERS_SHMEM_SEGMENT */
+	0.1,    /* BUFFER_DESCRIPTORS_SHMEM_SEGMENT */
+	0.1,    /* BUFFER_IOCV_SHMEM_SEGMENT */
+	0.05,   /* CHECKPOINT_BUFFERS_SHMEM_SEGMENT */
+	0.05,   /* STRATEGY_SHMEM_SEGMENT */
 };
 
 /*
@@ -182,6 +190,16 @@ MappingName(int shmem_segment)
 	{
 		case MAIN_SHMEM_SEGMENT:
 			return "main";
+		case BUFFERS_SHMEM_SEGMENT:
+			return "buffers";
+		case BUFFER_DESCRIPTORS_SHMEM_SEGMENT:
+			return "descriptors";
+		case BUFFER_IOCV_SHMEM_SEGMENT:
+			return "iocv";
+		case CHECKPOINT_BUFFERS_SHMEM_SEGMENT:
+			return "checkpoint";
+		case STRATEGY_SHMEM_SEGMENT:
+			return "strategy";
 		default:
 			return "unknown";
 	}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..bd68b69ee98 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -62,7 +62,10 @@ CkptSortItem *CkptBufferIds;
  * Initialize shared buffer pool
  *
  * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend).
+ * postmaster, or in a standalone backend). Size of data structures initialized
+ * here depends on NBuffers, and to be able to change NBuffers without a
+ * restart we store each structure into a separate shared memory segment, which
+ * could be resized on demand.
  */
 void
 BufferManagerShmemInit(void)
@@ -74,22 +77,22 @@ BufferManagerShmemInit(void)
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
+		ShmemInitStructInSegment("Buffer Descriptors",
 						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+						&foundDescs, BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
 		TYPEALIGN(PG_IO_ALIGN_SIZE,
-				  ShmemInitStruct("Buffer Blocks",
+				  ShmemInitStructInSegment("Buffer Blocks",
 								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
-								  &foundBufs));
+								  &foundBufs, BUFFERS_SHMEM_SEGMENT));
 
 	/* Align condition variables to cacheline boundary. */
 	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
-		ShmemInitStruct("Buffer IO Condition Variables",
+		ShmemInitStructInSegment("Buffer IO Condition Variables",
 						NBuffers * sizeof(ConditionVariableMinimallyPadded),
-						&foundIOCV);
+						&foundIOCV, BUFFER_IOCV_SHMEM_SEGMENT);
 
 	/*
 	 * The array used to sort to-be-checkpointed buffer ids is located in
@@ -99,8 +102,9 @@ BufferManagerShmemInit(void)
 	 * painful.
 	 */
 	CkptBufferIds = (CkptSortItem *)
-		ShmemInitStruct("Checkpoint BufferIds",
-						NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
+		ShmemInitStructInSegment("Checkpoint BufferIds",
+						NBuffers * sizeof(CkptSortItem), &foundBufCkpt,
+						CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
 
 	if (foundDescs || foundBufs || foundIOCV || foundBufCkpt)
 	{
@@ -156,33 +160,54 @@ BufferManagerShmemInit(void)
  * BufferManagerShmemSize
  *
  * compute the size of shared memory for the buffer pool including
- * data pages, buffer descriptors, hash tables, etc.
+ * data pages, buffer descriptors, hash tables, etc. based on the
+ * shared memory segment. The main segment must not allocate anything
+ * related to buffers, every other segment will receive part of the
+ * data.
  */
 Size
-BufferManagerShmemSize(void)
+BufferManagerShmemSize(int shmem_segment)
 {
 	Size		size = 0;
 
-	/* size of buffer descriptors */
-	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
-	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_segment == MAIN_SHMEM_SEGMENT)
+		return size;
 
-	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
-	size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	if (shmem_segment == BUFFER_DESCRIPTORS_SHMEM_SEGMENT)
+	{
+		/* size of buffer descriptors */
+		size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
+		/* to allow aligning buffer descriptors */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
 
-	/* size of stuff controlled by freelist.c */
-	size = add_size(size, StrategyShmemSize());
+	if (shmem_segment == BUFFERS_SHMEM_SEGMENT)
+	{
+		/* size of data pages, plus alignment padding */
+		size = add_size(size, PG_IO_ALIGN_SIZE);
+		size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	}
 
-	/* size of I/O condition variables */
-	size = add_size(size, mul_size(NBuffers,
-								   sizeof(ConditionVariableMinimallyPadded)));
-	/* to allow aligning the above */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_segment == STRATEGY_SHMEM_SEGMENT)
+	{
+		/* size of stuff controlled by freelist.c */
+		size = add_size(size, StrategyShmemSize());
+	}
 
-	/* size of checkpoint sort array in bufmgr.c */
-	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	if (shmem_segment == BUFFER_IOCV_SHMEM_SEGMENT)
+	{
+		/* size of I/O condition variables */
+		size = add_size(size, mul_size(NBuffers,
+									   sizeof(ConditionVariableMinimallyPadded)));
+		/* to allow aligning the above */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
+
+	if (shmem_segment == CHECKPOINT_BUFFERS_SHMEM_SEGMENT)
+	{
+		/* size of checkpoint sort array in bufmgr.c */
+		size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	}
 
 	return size;
 }
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index a50955d5286..a9952b36eba 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -22,6 +22,7 @@
 #include "postgres.h"
 
 #include "storage/buf_internals.h"
+#include "storage/pg_shmem.h"
 
 /* entry for buffer lookup hashtable */
 typedef struct
@@ -59,10 +60,11 @@ InitBufTable(int size)
 	info.entrysize = sizeof(BufferLookupEnt);
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
-	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
+	SharedBufHash = ShmemInitHashInSegment("Shared Buffer Lookup Table",
 								  size, size,
 								  &info,
-								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION,
+								  STRATEGY_SHMEM_SEGMENT);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..bd390f2709d 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
@@ -491,9 +492,9 @@ StrategyInitialize(bool init)
 	 * Get or create the shared strategy control block
 	 */
 	StrategyControl = (BufferStrategyControl *)
-		ShmemInitStruct("Buffer Strategy Status",
+		ShmemInitStructInSegment("Buffer Strategy Status",
 						sizeof(BufferStrategyControl),
-						&found);
+						&found, STRATEGY_SHMEM_SEGMENT);
 
 	if (!found)
 	{
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 076888c0172..9d00b80b4f8 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -113,7 +113,7 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
 											 sizeof(ShmemIndexEnt)));
 	size = add_size(size, dsm_estimate_size());
 	size = add_size(size, DSMRegistryShmemSize());
-	size = add_size(size, BufferManagerShmemSize());
+	size = add_size(size, BufferManagerShmemSize(shmem_segment));
 	size = add_size(size, LockManagerShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41fdc1e7693..edac9db6a12 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -318,7 +318,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
-extern Size BufferManagerShmemSize(void);
+extern Size BufferManagerShmemSize(int);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 8cb1e159917..f8459a5a421 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -52,7 +52,7 @@ typedef struct ShmemSegment
 } ShmemSegment;
 
 /* Number of available segments for anonymous memory mappings */
-#define ANON_MAPPINGS 1
+#define ANON_MAPPINGS 6
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 
@@ -108,7 +108,29 @@ extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 void *ReserveAnonymousMemory(Size reserve_size);
 
+/*
+ * To be able to dynamically resize largest parts of the data stored in shared
+ * memory, we split it into multiple shared memory mappings segments. Each
+ * segment contains only certain part of the data, which size depends on
+ * NBuffers.
+ */
+
 /* The main segment, contains everything except buffer blocks and related data. */
 #define MAIN_SHMEM_SEGMENT 0
 
+/* Buffer blocks */
+#define BUFFERS_SHMEM_SEGMENT 1
+
+/* Buffer descriptors */
+#define BUFFER_DESCRIPTORS_SHMEM_SEGMENT 2
+
+/* Condition variables for buffers */
+#define BUFFER_IOCV_SHMEM_SEGMENT 3
+
+/* Checkpoint BufferIds */
+#define CHECKPOINT_BUFFERS_SHMEM_SEGMENT 4
+
+/* Buffer strategy status */
+#define STRATEGY_SHMEM_SEGMENT 5
+
 #endif							/* PG_SHMEM_H */
-- 
2.34.1

0008-Support-resize-for-hugetlb-20250616.patchtext/x-patch; charset=US-ASCII; name=0008-Support-resize-for-hugetlb-20250616.patchDownload

From 2ebc737cd5b22c4cb3fbcafb583c0bbd61fe93d0 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sat, 5 Apr 2025 19:51:33 +0200
Subject: [PATCH 08/17] Support resize for hugetlb

Linux kernel has a set of limitations on remapping hugetlb segments: it
can't increase size of such segment [1], and shrinking it will not
release the memory back. In fact support for hugetlb mremap was
implemented no so long time ago [2].

As a workaround, avoid mremap for resizing shared memory. Instead unmap
the whole segment and map it back at the same address with the new size,
relying on the fact that fd for the anon file behind the segment is
still open and will keep the memory content.

[1]: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/mremap.c?id=f4d2ef48250ad057e4f00087967b5ff366da9f39#n1593
[2]: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/mm/mremap.c?id=550a7d60bd5e35a56942dba6d8a26752beb26c9f
---
 src/backend/port/sysv_shmem.c | 60 +++++++++++++++++++++++++----------
 1 file changed, 44 insertions(+), 16 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 87000a24eea..f0b53ce1d7c 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -1109,6 +1109,7 @@ AnonymousShmemResize(void)
 		/* Note that CalculateShmemSize indirectly depends on NBuffers */
 		Size new_size = CalculateShmemSize(&numSemas, i);
 		AnonymousMapping *m = &Mappings[i];
+		int	mmap_flags = PG_MMAP_FLAGS;
 
 		if (m->shmem == NULL)
 			continue;
@@ -1116,6 +1117,44 @@ AnonymousShmemResize(void)
 		if (m->shmem_size == new_size)
 			continue;
 
+#ifndef MAP_HUGETLB
+		/* ReserveAnonymousMemory should have dealt with this case */
+		Assert(huge_pages != HUGE_PAGES_ON && !huge_pages_on);
+#else
+		if (huge_pages_on)
+		{
+			Size		hugepagesize;
+
+			/* Make sure nothing is messed up */
+			Assert(huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY);
+
+			/* Round up the new size to a suitable large value */
+			GetHugePageSize(&hugepagesize, &mmap_flags, NULL);
+
+			if (new_size % hugepagesize != 0)
+				new_size += hugepagesize - (new_size % hugepagesize);
+
+			mmap_flags = PG_MMAP_FLAGS | mmap_flags;
+		}
+#endif
+
+		/*
+		 * Linux limitations do not allow us to mremap hugetlb in the way we
+		 * want. E.g. no size increase is allowed, and for shrinking the memory
+		 * will not be released back. To work around this unmap the segment and
+		 * create a new one at the same address. Thanks for the backing anon
+		 * file the content will still be kept in memory.
+		 */
+		elog(DEBUG1, "segment[%s]: remap from %zu to %zu at address %p",
+					 MappingName(m->shmem_segment), m->shmem_size,
+					 new_size, m->shmem);
+
+		if (munmap(m->shmem, m->shmem_size) < 0)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not unmap shared memory segment %s [%p]: %m",
+							MappingName(m->shmem_segment), m->shmem)));
+
 		/* Resize the backing anon file. */
 		if(ftruncate(m->segment_fd, new_size) == -1)
 			ereport(FATAL,
@@ -1123,25 +1162,14 @@ AnonymousShmemResize(void)
 					 errmsg("could not truncase anonymous file for \"%s\": %m",
 							MappingName(m->shmem_segment))));
 
-		/* Clean up some reserved space to resize into */
-		if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
-			ereport(FATAL,
-					(errcode(ERRCODE_SYSTEM_ERROR),
-					 errmsg("could not unmap %zu from reserved shared memory %p: %m",
-							new_size - m->shmem_size, m->shmem)));
-
-		/* Claim the unused space */
-		elog(DEBUG1, "segment[%s]: remap from %zu to %zu at address %p",
-					 MappingName(m->shmem_segment), m->shmem_size,
-					 new_size, m->shmem);
-
-		ptr = mremap(m->shmem, m->shmem_size, new_size, 0);
+		/* Reclaim the space */
+		ptr = mmap(m->shmem, new_size, PROT_READ | PROT_WRITE,
+				   mmap_flags | MAP_FIXED, m->segment_fd, 0);
 		if (ptr == MAP_FAILED)
 			ereport(FATAL,
 					(errcode(ERRCODE_SYSTEM_ERROR),
-					 errmsg("could not resize shared memory segment %s [%p] to %d (%zu): %m",
-							MappingName(m->shmem_segment), m->shmem, NBuffers,
-							new_size)));
+					 errmsg("could not map shared memory segment %s [%p] with size %zu: %m",
+							MappingName(m->shmem_segment), m->shmem, new_size)));
 
 		reinit = true;
 		m->shmem_size = new_size;
-- 
2.34.1

0006-Allow-to-resize-shared-memory-without-resta-20250616.patchtext/x-patch; charset=US-ASCII; name=0006-Allow-to-resize-shared-memory-without-resta-20250616.patchDownload

From 12a39ceb67b438c2fdf3727b51f5e4a9c105b3b0 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sun, 6 Apr 2025 16:47:16 +0200
Subject: [PATCH 06/17] Allow to resize shared memory without restart

Add assing hook for shared_buffers to resize shared memory using space,
introduced in the previous commits without requiring PostgreSQL restart.
Essentially the implementation is based on two mechanisms: a
ProcSignalBarrier is used to make sure all processes are starting the
resize procedure simultaneously, and a global Barrier is used to
coordinate after that and make sure all finished processes are waiting
for others that are in progress.

The resize process looks like this:

* The GUC assign hook sets a flag to let the Postmaster know that resize
  was requested.

* Postmaster verifies the flag in the event loop, and starts the resize
  by emitting a ProcSignal barrier.

* All processes, that participate in ProcSignal mechanism, begin to
  process ProcSignal barrier. First a process waits until all processes
  have confirmed they received the message and can start simultaneously.

* Every process recalculates shared memory size based on the new
  NBuffers and extend it using mremap. One elected process signals the
  postmaster to do the same.

* When finished, every process waits on a global ShmemControl barrier,
  untill all others are finished as well. This way we ensure three
  stages with clear boundaries: before the resize, when all processes
  use old NBuffers; during the resize, when processes have mix of old
  and new NBuffers, and wait until it's done; after the resize, when all
  processes use new NBuffers.

* After all processes are using new value, one of them will initialize
  new shared structures (buffer blocks, descriptors, etc) as needed and
  broadcast new value of NBuffers via ShmemControl in shared memory.
  Other backends are waiting for this operation to finish as well. Then
  the barrier is lifted and everything goes as usual.

Since resizing takes time, we need to take into account that during that time:

- New backends can be spawned. They will check status of the barrier
  early during the bootstrap, and wait until everything is over to work
  with the new NBuffers value.

- Old backends can exit before attempting to resize. Synchronization
  used between backends relies on ProcSignalBarrier and waits for all
  participants received the message at the beginning to gather all
  existing backends.

- Some backends might be blocked and not responsing either before or
  after receiving the message. In the first case such backend still
  have ProcSignalSlot and should be waited for, in the second case
  shared barrier will make sure we still waiting for those backends. In
  any case there is an unbounded wait.

- Backends might join barrier in disjoint groups with some time in
  between. That means that relying only on the shared dynamic barrier is
  not enough -- it will only synchronize resize procedure withing those
  groups. That's why we wait first for all participants of ProcSignal
  mechanism who received the message.

Here is how it looks like after raising shared_buffers from 128 MB to
512 MB and calling pg_reload_conf():

    -- 128 MB
    7f90cde00000-7f90d4fa6000  /dev/zero (deleted)
    7f90d4fa6000-7f914de00000
    7f914de00000-7f915cfa8000  /dev/zero (deleted)
    ^ buffers mapping, ~241 MB
    7f915cfa8000-7f944de00000
    7f944de00000-7f94550a8000  /dev/zero (deleted)
    7f94550a8000-7f94cde00000
    7f94cde00000-7f94d4fe8000  /dev/zero (deleted)
    7f94d4fe8000-7f954de00000
    7f954de00000-7f9554ff6000  /dev/zero (deleted)
    7f9554ff6000-7f958de00000
    7f958de00000-7f959508a000  /dev/zero (deleted)
    7f959508a000-7f95cde00000

    -- 512 MB
    7f90cde00000-7f90d5126000  /dev/zero (deleted)
    7f90d5126000-7f914de00000
    7f914de00000-7f9175128000  /dev/zero (deleted)
    ^ buffers mapping, ~627 MB
    7f9175128000-7f944de00000
    7f944de00000-7f9455528000  /dev/zero (deleted)
    7f9455528000-7f94cde00000
    7f94cde00000-7f94d5228000  /dev/zero (deleted)
    7f94d5228000-7f954de00000
    7f954de00000-7f9555266000  /dev/zero (deleted)
    7f9555266000-7f958de00000
    7f958de00000-7f95954aa000  /dev/zero (deleted)
    7f95954aa000-7f95cde00000

The implementation supports only increasing of shared_buffers. For
decreasing the value a similar procedure is needed. But the buffer
blocks with data have to be drained first, so that the actual data set
fits into the new smaller space.

From experiment it turns out that shared mappings have to be extended
separately for each process that uses them. Another rough edge is that a
backend blocked on ReadCommand will not apply shared_buffers change
until it receives something.

Note, that mremap is Linux specific, thus the implementation not very
portable.

Authors: Dmitrii Dolgov, Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c                 | 413 ++++++++++++++++++
 src/backend/postmaster/postmaster.c           |  18 +
 src/backend/storage/buffer/buf_init.c         |  75 ++--
 src/backend/storage/ipc/ipci.c                |  18 +-
 src/backend/storage/ipc/procsignal.c          |  46 ++
 src/backend/storage/ipc/shmem.c               |  23 +-
 src/backend/tcop/postgres.c                   |  10 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/backend/utils/misc/guc_tables.c           |   4 +-
 src/include/miscadmin.h                       |   1 +
 src/include/storage/bufmgr.h                  |   2 +-
 src/include/storage/ipc.h                     |   3 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/storage/pg_shmem.h                |  26 ++
 src/include/storage/pmsignal.h                |   1 +
 src/include/storage/procsignal.h              |   1 +
 src/tools/pgindent/typedefs.list              |   1 +
 17 files changed, 603 insertions(+), 43 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index f46d9d5d9cd..a3437973784 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -30,13 +30,19 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "portability/mem.h"
+#include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
+#include "storage/pmsignal.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
 #include "utils/pidfile.h"
+#include "utils/wait_event.h"
 
 
 /*
@@ -105,6 +111,13 @@ typedef struct AnonymousMapping
 
 static AnonymousMapping Mappings[ANON_MAPPINGS];
 
+/* Flag telling postmaster that resize is needed */
+volatile bool pending_pm_shmem_resize = false;
+
+/* Keeps track of the previous NBuffers value */
+static int NBuffersOld = -1;
+static int NBuffersPending = -1;
+
 /* Keeps track of used mapping segments */
 static int next_free_segment = 0;
 
@@ -176,6 +189,49 @@ static Size reserved_offset = 0;
  */
 static bool huge_pages_on = false;
 
+/*
+ * Flag telling that we have prepared the memory layout to be resizable. If
+ * false after all shared memory segments creation, it means we failed to setup
+ * needed layout and falled back to the regular non-resizable approach.
+ */
+static bool shmem_resizable = false;
+
+/*
+ * Currently broadcasted value of NBuffers in shared memory.
+ *
+ * Most of the time this value is going to be equal to NBuffers. But if
+ * postmaster is resizing shared memory and a new backend was created
+ * at the same time, there is a possibility for the new backend to inherit the
+ * old NBuffers value, but miss the resize signal if ProcSignal infrastructure
+ * was not initialized yet. Consider this situation:
+ *
+ *     Postmaster ------> New Backend
+ *         |                   |
+ *         |                Launch
+ *         |                   |
+ *         |             Inherit NBuffers
+ *         |                   |
+ *     Resize NBuffers         |
+ *         |                   |
+ *     Emit Barrier            |
+ *         |            Init ProcSignal
+ *         |                   |
+ *     Finish resize           |
+ *         |                   |
+ *     New NBuffers       Old NBuffers
+ *
+ * In this case the backend is not yet ready to receive a signal from
+ * EmitProcSignalBarrier, and will be ignored. The same happens if ProcSignal
+ * is initialized even later, after the resizing was finished.
+ *
+ * To address resulting inconsistency, postmaster broadcasts the current
+ * NBuffers value via shared memory. Every new backend has to verify this value
+ * before it will access the buffer pool: if it differs from its own value,
+ * this indicates a shared memory resize has happened and the backend has to
+ * first synchronize with rest of the pack.
+ */
+ShmemControl *ShmemCtrl = NULL;
+
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
@@ -769,6 +825,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 	{
 		Size total_reserved = (Size) MaxAvailableMemory * BLCKSZ;
 
+		shmem_resizable = true;
 		reserved_offset += total_reserved * SHMEM_RESIZE_RATIO[next_free_segment];
 	}
 
@@ -964,6 +1021,315 @@ AnonymousShmemDetach(int status, Datum arg)
 	}
 }
 
+/*
+ * Resize all shared memory segments based on the current NBuffers value, which
+ * is is applied from NBuffersPending. The actual segment resizing is done via
+ * mremap, which will fail if is not sufficient space to expand the mapping.
+ * When finished, based on the new and old values initialize new buffer blocks
+ * if any.
+ *
+ * If reinitializing took place, as the last step this function does buffers
+ * reinitialization as well and broadcasts the new value of NSharedBuffers. All
+ * of that needs to be done only by one backend, the first one that managed to
+ * grab the ShmemResizeLock.
+ */
+bool
+AnonymousShmemResize(void)
+{
+	int	numSemas;
+	bool reinit = false;
+	void *ptr = MAP_FAILED;
+	NBuffers = NBuffersPending;
+
+	elog(DEBUG1, "Resize shmem from %d to %d", NBuffersOld, NBuffers);
+
+	/*
+	 * XXX: Where to reset the flag is still an open question. E.g. do we
+	 * consider a no-op when NBuffers is equal to NBuffersOld a genuine resize
+	 * and reset the flag?
+	 */
+	pending_pm_shmem_resize = false;
+
+	/*
+	 * XXX: Currently only increasing of shared_buffers is supported. For
+	 * decreasing something similar has to be done, but buffer blocks with
+	 * data have to be drained first.
+	 */
+	if(NBuffersOld > NBuffers)
+		return false;
+
+	for(int i = 0; i < next_free_segment; i++)
+	{
+		/* Note that CalculateShmemSize indirectly depends on NBuffers */
+		Size new_size = CalculateShmemSize(&numSemas, i);
+		AnonymousMapping *m = &Mappings[i];
+
+		if (m->shmem == NULL)
+			continue;
+
+		if (m->shmem_size == new_size)
+			continue;
+
+		/* Clean up some reserved space to resize into */
+		if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not unmap %zu from reserved shared memory %p: %m",
+							new_size - m->shmem_size, m->shmem)));
+
+		/* Claim the unused space */
+		elog(DEBUG1, "segment[%s]: remap from %zu to %zu at address %p",
+					 MappingName(m->shmem_segment), m->shmem_size,
+					 new_size, m->shmem);
+
+		ptr = mremap(m->shmem, m->shmem_size, new_size, 0);
+		if (ptr == MAP_FAILED)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not resize shared memory segment %s [%p] to %d (%zu): %m",
+							MappingName(m->shmem_segment), m->shmem, NBuffers,
+							new_size)));
+
+		reinit = true;
+		m->shmem_size = new_size;
+	}
+
+	if (reinit)
+	{
+		if(IsUnderPostmaster &&
+			LWLockConditionalAcquire(ShmemResizeLock, LW_EXCLUSIVE))
+		{
+			/*
+			 * If the new NBuffers was already broadcasted, the buffer pool was
+			 * already initialized before.
+			 *
+			 * Since we're not on a hot path, we use lwlocks and do not need to
+			 * involve memory barrier.
+			 */
+			if(pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers)
+			{
+				/*
+				 * Allow the first backend that managed to get the lock to
+				 * reinitialize the new portion of buffer pool. Every other
+				 * process will wait on the shared barrier for that to finish,
+				 * since it's a part of the SHMEM_RESIZE_DONE phase.
+				 *
+				 * Note that it's enough when only one backend will do that,
+				 * even the ShmemInitStruct part. The reason is that resized
+				 * shared memory will maintain the same addresses, meaning that
+				 * all the pointers are still valid, and we only need to update
+				 * structures size in the ShmemIndex once -- any other backend
+				 * will pick up this shared structure from the index.
+				 *
+				 * XXX: This is the right place for buffer eviction as well.
+				 */
+				BufferManagerShmemInit(NBuffersOld);
+
+				/* If all fine, broadcast the new value */
+				pg_atomic_write_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
+			}
+
+			LWLockRelease(ShmemResizeLock);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * We are asked to resize shared memory. Wait for all ProcSignal participants
+ * to join the barrier, then do the resize and wait on the barrier until all
+ * participating finish resizing as well -- otherwise we face danger of
+ * inconsistency between backends.
+ *
+ * XXX: If a backend is blocked on ReadCommand in PostgresMain, it will not
+ * proceed with AnonymousShmemResize after receiving SIGHUP, until something
+ * will be sent.
+ */
+bool
+ProcessBarrierShmemResize(Barrier *barrier)
+{
+	elog(DEBUG1, "Handle a barrier for shmem resizing from %d to %d, %d",
+		 NBuffersOld, NBuffersPending, pending_pm_shmem_resize);
+
+	/* Wait until we have seen the new NBuffers value */
+	if (!pending_pm_shmem_resize)
+		return false;
+
+	/*
+	 * First thing to do after attaching to the barrier is to wait for others.
+	 * We can't simply use BarrierArriveAndWait, because backends might arrive
+	 * here in disjoint groups, e.g. first two backends, pause, then second two
+	 * backends. If the resize is quick enough that can lead to a situation
+	 * when the first group is already finished before the second has appeared,
+	 * and the barrier will only synchonize withing those groups.
+	 */
+	if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED)
+		WaitForProcSignalBarrierReceived(
+				pg_atomic_read_u64(&ShmemCtrl->Generation));
+
+	/*
+	 * Now start the procedure, and elect one backend to ping postmaster to do
+	 * the same.
+	 *
+	 * XXX: If we need to be able to abort resizing, this has to be done later,
+	 * after the SHMEM_RESIZE_DONE.
+	 */
+	if (BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START))
+	{
+		Assert(IsUnderPostmaster);
+		SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
+	}
+
+	AnonymousShmemResize();
+
+	/* The second phase means the resize has finished, SHMEM_RESIZE_DONE */
+	BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_DONE);
+
+	BarrierDetach(barrier);
+	return true;
+}
+
+/*
+ * GUC assign hook for shared_buffers. It's recommended for an assign hook to
+ * be as minimal as possible, thus we just request shared memory resize and
+ * remember the previous value.
+ */
+void
+assign_shared_buffers(int newval, void *extra, bool *pending)
+{
+	elog(DEBUG1, "Received SIGHUP for shmem resizing");
+
+	/* Request shared memory resize only when it was initialized */
+	if (next_free_segment != 0)
+	{
+		elog(DEBUG1, "Set pending signal");
+		pending_pm_shmem_resize = true;
+		*pending = true;
+		NBuffersPending = newval;
+	}
+
+	NBuffersOld = NBuffers;
+}
+
+/*
+ * Test if we have somehow missed a shmem resize signal and NBuffers value
+ * differs from NSharedBuffers. If yes, catchup and do resize.
+ */
+void
+AdjustShmemSize(void)
+{
+	uint32 NSharedBuffers = pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers);
+
+	if (NSharedBuffers != NBuffers)
+	{
+		/*
+		 * If the broadcasted shared_buffers is different from the one we see,
+		 * it could be that the backend has missed a resize signal. To avoid
+		 * any inconsistency, adjust the shared mappings, before having a
+		 * chance to access the buffer pool.
+		 */
+		ereport(LOG,
+				(errmsg("shared_buffers has been changed from %d to %d, "
+						"resize shared memory",
+						NBuffers, NSharedBuffers)));
+		NBuffers = NSharedBuffers;
+		AnonymousShmemResize();
+	}
+}
+
+/*
+ * Start resizing procedure, making sure all existing processes will have
+ * consistent view of shared memory size. Must be called only in postmaster.
+ */
+void
+CoordinateShmemResize(void)
+{
+	elog(DEBUG1, "Coordinating shmem resize from %d to %d",
+		 NBuffersOld, NBuffers);
+	Assert(!IsUnderPostmaster);
+
+	/*
+	 * We use dynamic barrier to help dealing with backends that were spawned
+	 * during the resize.
+	 */
+	BarrierInit(&ShmemCtrl->Barrier, 0);
+
+	/*
+	 * If the value did not change, or shared memory segments are not
+	 * initialized yet, skip the resize.
+	 */
+	if (NBuffersPending == NBuffersOld || next_free_segment == 0)
+	{
+		elog(DEBUG1, "Skip resizing, new %d, old %d, free segment %d",
+			 NBuffers, NBuffersOld, next_free_segment);
+		return;
+	}
+
+	/*
+	 * Shared memory resize requires some coordination done by postmaster,
+	 * and consists of three phases:
+	 *
+	 * - Before the resize all existing backends have the same old NBuffers.
+	 * - When resize is in progress, backends are expected to have a
+	 *   mixture of old a new values. They're not allowed to touch buffer
+	 *   pool during this time frame.
+	 * - After resize has been finished, all existing backends, that can access
+	 *   the buffer pool, are expected to have the same new value of NBuffers.
+	 *
+	 * Those phases are ensured by joining the shared barrier associated with
+	 * the procedure. Since resizing takes time, we need to take into account
+	 * that during that time:
+	 *
+	 * - New backends can be spawned. They will check status of the barrier
+	 *   early during the bootstrap, and wait until everything is over to work
+	 *   with the new NBuffers value.
+	 *
+	 * - Old backends can exit before attempting to resize. Synchronization
+	 *   used between backends relies on ProcSignalBarrier and waits for all
+	 *   participants received the message at the beginning to gather all
+	 *   existing backends.
+	 *
+	 * - Some backends might be blocked and not responsing either before or
+	 *   after receiving the message. In the first case such backend still
+	 *   have ProcSignalSlot and should be waited for, in the second case
+	 *   shared barrier will make sure we still waiting for those backends. In
+	 *   any case there is an unbounded wait.
+	 *
+	 * - Backends might join barrier in disjoint groups with some time in
+	 *   between. That means that relying only on the shared dynamic barrier is
+	 *   not enough -- it will only synchronize resize procedure withing those
+	 *   groups. That's why we wait first for all participants of ProcSignal
+	 *   mechanism who received the message.
+	 */
+	elog(DEBUG1, "Emit a barrier for shmem resizing");
+	pg_atomic_init_u64(&ShmemCtrl->Generation,
+					   EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SHMEM_RESIZE));
+
+	/* To order everything after setting Generation value */
+	pg_memory_barrier();
+
+	/*
+	 * After that postmaster waits for PMSIGNAL_SHMEM_RESIZE as a sign that all
+	 * the rest of the pack has started the procedure and it can resize shared
+	 * memory as well.
+	 *
+	 * Normally we would call WaitForProcSignalBarrier here to wait until every
+	 * backend has reported on the ProcSignalBarrier. But for shared memory
+	 * resize we don't need this, as every participating backend will
+	 * synchronize on the ProcSignal barrier. In fact even if we would like to
+	 * wait here, it wouldn't be possible -- we're in the postmaster, without
+	 * any waiting infrastructure available.
+	 *
+	 * If at some point it will turn out that waiting is essential, we would
+	 * need to consider some alternatives. E.g. it could be a designated
+	 * coordination process, which is not a postmaster. Another option would be
+	 * to introduce a CoordinateShmemResize lock and allow only one process to
+	 * take it (this probably would have to be something different than
+	 * LWLocks, since they block interrupts, and coordination relies on them).
+	 */
+}
+
 /*
  * PGSharedMemoryCreate
  *
@@ -1271,3 +1637,50 @@ PGSharedMemoryDetach(void)
 		}
 	}
 }
+
+void
+WaitOnShmemBarrier()
+{
+	Barrier *barrier = &ShmemCtrl->Barrier;
+
+	/* Nothing to do if resizing is not started */
+	if (BarrierPhase(barrier) < SHMEM_RESIZE_START)
+		return;
+
+	BarrierAttach(barrier);
+
+	/* Otherwise wait through all available phases */
+	while (BarrierPhase(barrier) < SHMEM_RESIZE_DONE)
+	{
+		ereport(LOG, (errmsg("ProcSignal barrier is in phase %d, waiting",
+							 BarrierPhase(barrier))));
+
+		BarrierArriveAndWait(barrier, 0);
+	}
+
+	BarrierDetach(barrier);
+}
+
+void
+ShmemControlInit(void)
+{
+	bool foundShmemCtrl;
+
+	ShmemCtrl = (ShmemControl *)
+	ShmemInitStruct("Shmem Control", sizeof(ShmemControl),
+									 &foundShmemCtrl);
+
+	if (!foundShmemCtrl)
+	{
+		/*
+		 * The barrier is missing here, it will be initialized right before
+		 * starting the resizing process as a convenient way to reset it.
+		 */
+
+		/* Initialize with the currently known value */
+		pg_atomic_init_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
+
+		/* shmem_resizable should be initialized by now */
+		ShmemCtrl->Resizable = shmem_resizable;
+	}
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 490f7ce3664..f0cb0098dcd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -426,6 +426,7 @@ static void process_pm_pmsignal(void);
 static void process_pm_child_exit(void);
 static void process_pm_reload_request(void);
 static void process_pm_shutdown_request(void);
+static void process_pm_shmem_resize(void);
 static void dummy_handler(SIGNAL_ARGS);
 static void CleanupBackend(PMChild *bp, int exitstatus);
 static void HandleChildCrash(int pid, int exitstatus, const char *procname);
@@ -1694,6 +1695,9 @@ ServerLoop(void)
 			if (pending_pm_pmsignal)
 				process_pm_pmsignal();
 
+			if (pending_pm_shmem_resize)
+				process_pm_shmem_resize();
+
 			if (events[i].events & WL_SOCKET_ACCEPT)
 			{
 				ClientSocket s;
@@ -2039,6 +2043,17 @@ process_pm_reload_request(void)
 	}
 }
 
+static void
+process_pm_shmem_resize(void)
+{
+	/*
+	 * Failure to resize is considered to be fatal and will not be
+	 * retried, which means we can disable pending flag right here.
+	 */
+	pending_pm_shmem_resize = false;
+	CoordinateShmemResize();
+}
+
 /*
  * pg_ctl uses SIGTERM, SIGINT and SIGQUIT to request different types of
  * shutdown.
@@ -3852,6 +3867,9 @@ process_pm_pmsignal(void)
 		request_state_update = true;
 	}
 
+	if (CheckPostmasterSignal(PMSIGNAL_SHMEM_RESIZE))
+		AnonymousShmemResize();
+
 	/*
 	 * Try to advance postmaster's state machine, if a child requests it.
 	 */
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index bd68b69ee98..ac844b114bd 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -17,6 +17,7 @@
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -24,7 +25,6 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
-
 /*
  * Data Structures:
  *		buffers live in a freelist and a lookup data structure.
@@ -62,18 +62,28 @@ CkptSortItem *CkptBufferIds;
  * Initialize shared buffer pool
  *
  * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend). Size of data structures initialized
- * here depends on NBuffers, and to be able to change NBuffers without a
- * restart we store each structure into a separate shared memory segment, which
- * could be resized on demand.
+ * postmaster, or in a standalone backend) or during shared-memory resize. Size
+ * of data structures initialized here depends on NBuffers, and to be able to
+ * change NBuffers without a restart we store each structure into a separate
+ * shared memory segment, which could be resized on demand.
+ *
+ * FirstBufferToInit tells where to start initializing buffers. For
+ * initialization it always will be zero, but when resizing shared-memory it
+ * indicates the number of already initialized buffers.
+ *
+ * No locks are taking in this function, it is the caller responsibility to
+ * make sure only one backend can work with new buffers.
  */
 void
-BufferManagerShmemInit(void)
+BufferManagerShmemInit(int FirstBufferToInit)
 {
 	bool		foundBufs,
 				foundDescs,
 				foundIOCV,
 				foundBufCkpt;
+	int			i;
+	elog(DEBUG1, "BufferManagerShmemInit from %d to %d",
+				 FirstBufferToInit, NBuffers);
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
@@ -110,43 +120,44 @@ BufferManagerShmemInit(void)
 	{
 		/* should find all of these, or none of them */
 		Assert(foundDescs && foundBufs && foundIOCV && foundBufCkpt);
-		/* note: this path is only taken in EXEC_BACKEND case */
-	}
-	else
-	{
-		int			i;
-
 		/*
-		 * Initialize all the buffer headers.
+		 * note: this path is only taken in EXEC_BACKEND case when initializing
+		 * shared memory, or in all cases when resizing shared memory.
 		 */
-		for (i = 0; i < NBuffers; i++)
-		{
-			BufferDesc *buf = GetBufferDescriptor(i);
+	}
 
-			ClearBufferTag(&buf->tag);
+#ifndef EXEC_BACKEND
+	/*
+	 * Initialize all the buffer headers.
+	 */
+	for (i = FirstBufferToInit; i < NBuffers; i++)
+	{
+		BufferDesc *buf = GetBufferDescriptor(i);
 
-			pg_atomic_init_u32(&buf->state, 0);
-			buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
+		ClearBufferTag(&buf->tag);
 
-			buf->buf_id = i;
+		pg_atomic_init_u32(&buf->state, 0);
+		buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
 
-			pgaio_wref_clear(&buf->io_wref);
+		buf->buf_id = i;
 
-			/*
-			 * Initially link all the buffers together as unused. Subsequent
-			 * management of this list is done by freelist.c.
-			 */
-			buf->freeNext = i + 1;
+		pgaio_wref_clear(&buf->io_wref);
 
-			LWLockInitialize(BufferDescriptorGetContentLock(buf),
-							 LWTRANCHE_BUFFER_CONTENT);
+		/*
+		 * Initially link all the buffers together as unused. Subsequent
+		 * management of this list is done by freelist.c.
+		 */
+		buf->freeNext = i + 1;
 
-			ConditionVariableInit(BufferDescriptorGetIOCV(buf));
-		}
+		LWLockInitialize(BufferDescriptorGetContentLock(buf),
+						 LWTRANCHE_BUFFER_CONTENT);
 
-		/* Correct last entry of linked list */
-		GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
+		ConditionVariableInit(BufferDescriptorGetIOCV(buf));
 	}
+#endif
+
+	/* Correct last entry of linked list */
+	GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
 
 	/* Init other shared buffer-management stuff */
 	StrategyInitialize(!foundDescs);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 9d00b80b4f8..abeb91e24fd 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -84,6 +84,9 @@ RequestAddinShmemSpace(Size size)
  *
  * If num_semaphores is not NULL, it will be set to the number of semaphores
  * required.
+ *
+ * XXX: Calculation for non main shared memory segments are incorrect, it
+ * includes more than needed for buffers only.
  */
 Size
 CalculateShmemSize(int *num_semaphores, int shmem_segment)
@@ -151,6 +154,14 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
 
+	/*
+	 * XXX: For some reason slightly more memory is needed for larger
+	 * shared_buffers, but this size is enough for any large value I've tested
+	 * with. Is it a mistake in how slots are split, or there was a hidden
+	 * inconsistency in shmem calculation?
+	 */
+	size = add_size(size, 1024 * 1024 * 100);
+
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
 
@@ -298,7 +309,7 @@ CreateOrAttachShmemStructs(void)
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
-	BufferManagerShmemInit();
+	BufferManagerShmemInit(0);
 
 	/*
 	 * Set up lock manager
@@ -310,6 +321,11 @@ CreateOrAttachShmemStructs(void)
 	 */
 	PredicateLockShmemInit();
 
+	/*
+	 * Set up shared memory resize manager
+	 */
+	ShmemControlInit();
+
 	/*
 	 * Set up process table
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6bec9be423..d7b56a18b24 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -27,6 +27,7 @@
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
+#include "storage/pg_shmem.h"
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "storage/smgr.h"
@@ -113,6 +114,10 @@ static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
 
+#ifdef DEBUG_SHMEM_RESIZE
+bool delay_proc_signal_init = false;
+#endif
+
 /*
  * ProcSignalShmemSize
  *		Compute space needed for ProcSignal's shared memory
@@ -176,6 +181,43 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
 	uint32		old_pss_pid;
 
 	Assert(cancel_key_len >= 0 && cancel_key_len <= MAX_CANCEL_KEY_LENGTH);
+
+#ifdef DEBUG_SHMEM_RESIZE
+	/*
+	 * Introduced for debugging purposes. You can change the variable at
+	 * runtime using gdb, then start new backends with delayed ProcSignal
+	 * initialization. Simple pg_usleep wont work here due to SIGHUP interrupt
+	 * needed for testing. Taken from pg_sleep;
+	 */
+	if (delay_proc_signal_init)
+	{
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+		float8		endtime = GetNowFloat() + 5;
+
+		for (;;)
+		{
+			float8		delay;
+			long		delay_ms;
+
+			CHECK_FOR_INTERRUPTS();
+
+			delay = endtime - GetNowFloat();
+			if (delay >= 600.0)
+				delay_ms = 600000;
+			else if (delay > 0.0)
+				delay_ms = (long) (delay * 1000.0);
+			else
+				break;
+
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 delay_ms,
+							 WAIT_EVENT_PG_SLEEP);
+			ResetLatch(MyLatch);
+		}
+	}
+#endif
+
 	if (MyProcNumber < 0)
 		elog(ERROR, "MyProcNumber not set");
 	if (MyProcNumber >= NumProcSignalSlots)
@@ -615,6 +657,10 @@ ProcessProcSignalBarrier(void)
 					case PROCSIGNAL_BARRIER_SMGRRELEASE:
 						processed = ProcessBarrierSmgrRelease();
 						break;
+					case PROCSIGNAL_BARRIER_SHMEM_RESIZE:
+						processed = ProcessBarrierShmemResize(
+								&ShmemCtrl->Barrier);
+						break;
 				}
 
 				/*
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 7e1a9b43fae..c07572d6f89 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -498,17 +498,26 @@ ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
 	{
 		/*
 		 * Structure is in the shmem index so someone else has allocated it
-		 * already.  The size better be the same as the size we are trying to
-		 * initialize to, or there is a name conflict (or worse).
+		 * already. Verify the structure's size:
+		 * - If it's the same, we've found the expected structure.
+		 * - If it's different, we're resizing the expected structure.
+		 *
+		 * XXX: There is an implicit assumption this can only happen in
+		 * "resizable" segments, where only one shared structure is allowed.
+		 * This has to be implemented more cleanly.
 		 */
 		if (result->size != size)
 		{
-			LWLockRelease(ShmemIndexLock);
-			ereport(ERROR,
-					(errmsg("ShmemIndex entry size is wrong for data structure"
-							" \"%s\": expected %zu, actual %zu",
-							name, size, result->size)));
+			Size delta = size - result->size;
+
+			result->size = size;
+
+			/* Reflect size change in the shared segment */
+			SpinLockAcquire(Segments[shmem_segment].ShmemLock);
+			Segments[shmem_segment].ShmemSegHdr->freeoffset += delta;
+			SpinLockRelease(Segments[shmem_segment].ShmemLock);
 		}
+
 		structPtr = result->location;
 	}
 	else
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 0d1b6466d1e..0942d2bffe2 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -62,6 +62,7 @@
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -4309,6 +4310,15 @@ PostgresMain(const char *dbname, const char *username)
 	 */
 	BeginReportingGUCOptions();
 
+	/* Verify the shared barrier, if it's still active: join and wait. */
+	WaitOnShmemBarrier();
+
+	/*
+	 * After waiting on the barrier above we guaranteed to have NSharedBuffers
+	 * broadcasted, so we can use it in the function below.
+	 */
+	AdjustShmemSize();
+
 	/*
 	 * Also set up handler to log session end; we have to wait till now to be
 	 * sure Log_disconnections has its final value.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 4da68312b5f..691fa14e9e3 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -155,6 +155,8 @@ REPLICATION_ORIGIN_DROP	"Waiting for a replication origin to become inactive so
 REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it can be dropped."
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
+SHMEM_RESIZE_START	"Waiting for other backends to start resizing shared memory."
+SHMEM_RESIZE_DONE	"Waiting for other backends to finish resizing shared memory."
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
 WAL_BUFFER_INIT	"Waiting on WAL buffer to be initialized."
 WAL_RECEIVER_EXIT	"Waiting for the WAL receiver to exit."
@@ -352,6 +354,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+ShmemResize	"Waiting to resize shared memory."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index e63521e5a2d..9f00608f508 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2366,14 +2366,14 @@ struct config_int ConfigureNamesInt[] =
 	 * checking for overflow, so we mustn't allow more than INT_MAX / 2.
 	 */
 	{
-		{"shared_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+		{"shared_buffers", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the number of shared memory buffers used by the server."),
 			NULL,
 			GUC_UNIT_BLOCKS
 		},
 		&NBuffers,
 		16384, 16, INT_MAX / 2,
-		NULL, NULL, NULL
+		NULL, assign_shared_buffers, NULL
 	},
 
 	{
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..a0c37a7749e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -173,6 +173,7 @@ extern PGDLLIMPORT char *DataDir;
 extern PGDLLIMPORT int data_directory_mode;
 
 extern PGDLLIMPORT int NBuffers;
+extern PGDLLIMPORT int MaxAvailableMemory;
 extern PGDLLIMPORT int MaxBackends;
 extern PGDLLIMPORT int MaxConnections;
 extern PGDLLIMPORT int max_worker_processes;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index edac9db6a12..4239ebe640b 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -317,7 +317,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 									int32 *buffers_skipped);
 
 /* in buf_init.c */
-extern void BufferManagerShmemInit(void);
+extern void BufferManagerShmemInit(int);
 extern Size BufferManagerShmemSize(int);
 
 /* in localbuf.c */
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 6ebda479ced..bb7ae4d33b3 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -64,6 +64,7 @@ typedef void (*shmem_startup_hook_type) (void);
 /* ipc.c */
 extern PGDLLIMPORT bool proc_exit_inprogress;
 extern PGDLLIMPORT bool shmem_exit_inprogress;
+extern PGDLLIMPORT volatile bool pending_pm_shmem_resize;
 
 pg_noreturn extern void proc_exit(int code);
 extern void shmem_exit(int code);
@@ -83,5 +84,7 @@ extern void CreateSharedMemoryAndSemaphores(void);
 extern void AttachSharedMemoryStructs(void);
 #endif
 extern void InitializeShmemGUCs(void);
+extern void CoordinateShmemResize(void);
+extern bool AnonymousShmemResize(void);
 
 #endif							/* IPC_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index a9681738146..558da6fdd55 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, ShmemResize)
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index f8459a5a421..19ad2e2f788 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -24,6 +24,7 @@
 #ifndef PG_SHMEM_H
 #define PG_SHMEM_H
 
+#include "storage/barrier.h"
 #include "storage/dsm_impl.h"
 #include "storage/spin.h"
 
@@ -56,6 +57,25 @@ typedef struct ShmemSegment
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 
+/*
+ * ShmemControl is shared between backends and helps to coordinate shared
+ * memory resize.
+ */
+typedef struct
+{
+	pg_atomic_uint32 	NSharedBuffers;
+	Barrier 			Barrier;
+	pg_atomic_uint64 	Generation;
+	bool                Resizable;
+} ShmemControl;
+
+extern PGDLLIMPORT ShmemControl *ShmemCtrl;
+
+/* The phases for shared memory resizing, used by for ProcSignal barrier. */
+#define SHMEM_RESIZE_REQUESTED			0
+#define SHMEM_RESIZE_START				1
+#define SHMEM_RESIZE_DONE				2
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -108,6 +128,12 @@ extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 void *ReserveAnonymousMemory(Size reserve_size);
 
+bool ProcessBarrierShmemResize(Barrier *barrier);
+void assign_shared_buffers(int newval, void *extra, bool *pending);
+void AdjustShmemSize(void);
+extern void WaitOnShmemBarrier(void);
+extern void ShmemControlInit(void);
+
 /*
  * To be able to dynamically resize largest parts of the data stored in shared
  * memory, we split it into multiple shared memory mappings segments. Each
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 428aa3fd68a..1a55bf57a70 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
 	PMSIGNAL_XLOG_IS_SHUTDOWN,	/* ShutdownXLOG() completed */
+	PMSIGNAL_SHMEM_RESIZE,	/* resize shared memory */
 } PMSignalReason;
 
 #define NUM_PMSIGNALS (PMSIGNAL_XLOG_IS_SHUTDOWN+1)
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 2733bbb8c5b..97033f84dce 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -54,6 +54,7 @@ typedef enum
 typedef enum
 {
 	PROCSIGNAL_BARRIER_SMGRRELEASE, /* ask smgr to close files */
+	PROCSIGNAL_BARRIER_SHMEM_RESIZE, /* ask backends to resize shared memory */
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a8346cda633..b026a275c38 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2745,6 +2745,7 @@ ShellTypeInfo
 ShippableCacheEntry
 ShippableCacheKey
 ShmemIndexEnt
+ShmemControl
 ShutdownForeignScan_function
 ShutdownInformation
 ShutdownMode
-- 
2.34.1

0007-Use-anonymous-files-to-back-shared-memory-s-20250616.patchtext/x-patch; charset=US-ASCII; name=0007-Use-anonymous-files-to-back-shared-memory-s-20250616.patchDownload

From 441f537b64b6bc8f0f00fa0de7850911acff621c Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sat, 15 Mar 2025 16:39:45 +0100
Subject: [PATCH 07/17] Use anonymous files to back shared memory segments

Allow to use anonymous files for shared memory, instead of plain
anonymous memory. Such an anonymous file is created via memfd_create, it
lives in memory, behaves like a regular file and semantically equivalent
to an anonymous memory allocated via mmap with MAP_ANONYMOUS.

Advantages of using anon files are following:

* We've got a file descriptor, which could be used for regular file
  operations (modification, truncation, you name it).

* The file could be given a name, which improves readability when it
  comes to process maps. Here is how it looks like

7f90cde00000-7f90d5126000 rw-s 00000000 00:01 5463 /memfd:main (deleted)
7f90d5126000-7f914de00000 ---p 00000000 00:00 0
7f914de00000-7f9175128000 rw-s 00000000 00:01 5466 /memfd:buffers (deleted)
7f9175128000-7f944de00000 ---p 00000000 00:00 0
7f944de00000-7f9455528000 rw-s 00000000 00:01 5469 /memfd:descriptors (deleted)
7f9455528000-7f94cde00000 ---p 00000000 00:00 0
7f94cde00000-7f94d5228000 rw-s 00000000 00:01 5472 /memfd:iocv (deleted)
7f94d5228000-7f954de00000 ---p 00000000 00:00 0
7f954de00000-7f9555266000 rw-s 00000000 00:01 5475 /memfd:checkpoint (deleted)
7f9555266000-7f958de00000 ---p 00000000 00:00 0
7f958de00000-7f95954aa000 rw-s 00000000 00:01 5478 /memfd:strategy (deleted)
7f95954aa000-7f95cde00000 ---p 00000000 00:00 0

* By default, Linux will not add file-backed shared mappings into a core dump,
  making it more convenient to work with them in PostgreSQL: no more huge dumps
  to process.

The downside is that memfd_create is Linux specific.
---
 src/backend/port/sysv_shmem.c  | 73 +++++++++++++++++++++++++++++-----
 src/backend/port/win32_shmem.c |  2 +-
 src/backend/storage/ipc/ipci.c |  2 +-
 src/include/portability/mem.h  |  2 +-
 src/include/storage/pg_shmem.h |  3 +-
 5 files changed, 68 insertions(+), 14 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index a3437973784..87000a24eea 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -107,6 +107,7 @@ typedef struct AnonymousMapping
 	Pointer shmem; 				/* Pointer to the start of the mapped memory */
 	Pointer seg_addr; 			/* SysV shared memory for the header */
 	unsigned long seg_id; 		/* IPC key */
+	int segment_fd; 			/* fd for the backing anon file */
 } AnonymousMapping;
 
 static AnonymousMapping Mappings[ANON_MAPPINGS];
@@ -127,7 +128,7 @@ static int next_free_segment = 0;
  * 00400000-00490000         /path/bin/postgres
  * ...
  * 012d9000-0133e000         [heap]
- * 7f443a800000-7f470a800000 /dev/zero (deleted)
+ * 7f443a800000-7f470a800000 /memfd:main (deleted)
  * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
  * ...
@@ -150,9 +151,9 @@ static int next_free_segment = 0;
  * The result would look like this:
  *
  * 012d9000-0133e000         [heap]
- * 7f4426f54000-7f442e010000 /dev/zero (deleted)
+ * 7f4426f54000-7f442e010000 /memfd:main (deleted)
  * 7f442e010000-7f443a800000                     # reserved empty space
- * 7f443a800000-7f444196c000 /dev/zero (deleted)
+ * 7f443a800000-7f444196c000 /memfd:buffers (deleted)
  * 7f444196c000-7f470a800000                     # reserved empty space
  * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
@@ -643,13 +644,14 @@ PGSharedMemoryAttach(IpcMemoryId shmId,
  * *hugepagesize and *mmap_flags are set to 0.
  */
 void
-GetHugePageSize(Size *hugepagesize, int *mmap_flags)
+GetHugePageSize(Size *hugepagesize, int *mmap_flags, int *memfd_flags)
 {
 #ifdef MAP_HUGETLB
 
 	Size		default_hugepagesize = 0;
 	Size		hugepagesize_local = 0;
 	int			mmap_flags_local = 0;
+	int			memfd_flags_local = 0;
 
 	/*
 	 * System-dependent code to find out the default huge page size.
@@ -708,6 +710,7 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 	}
 
 	mmap_flags_local = MAP_HUGETLB;
+	memfd_flags_local = MFD_HUGETLB;
 
 	/*
 	 * On recent enough Linux, also include the explicit page size, if
@@ -718,7 +721,16 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 	{
 		int			shift = pg_ceil_log2_64(hugepagesize_local);
 
-		mmap_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
+		memfd_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
+	}
+#endif
+
+#if defined(MFD_HUGE_MASK) && defined(MFD_HUGE_SHIFT)
+	if (hugepagesize_local != default_hugepagesize)
+	{
+		int			shift = pg_ceil_log2_64(hugepagesize_local);
+
+		memfd_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
 	}
 #endif
 
@@ -727,6 +739,8 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 		*mmap_flags = mmap_flags_local;
 	if (hugepagesize)
 		*hugepagesize = hugepagesize_local;
+	if (memfd_flags)
+		*memfd_flags = memfd_flags_local;
 
 #else
 
@@ -734,6 +748,8 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 		*hugepagesize = 0;
 	if (mmap_flags)
 		*mmap_flags = 0;
+	if (memfd_flags)
+		*memfd_flags = 0;
 
 #endif							/* MAP_HUGETLB */
 }
@@ -771,7 +787,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
-	int			mmap_flags = PG_MMAP_FLAGS;
+	int			mmap_flags = PG_MMAP_FLAGS, memfd_flags = 0;
 
 #ifndef MAP_HUGETLB
 	/* ReserveAnonymousMemory should have dealt with this case */
@@ -785,7 +801,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 		Assert(huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY);
 
 		/* Round up the request size to a suitable large value */
-		GetHugePageSize(&hugepagesize, &mmap_flags);
+		GetHugePageSize(&hugepagesize, &mmap_flags, &memfd_flags);
 
 		if (allocsize % hugepagesize != 0)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
@@ -794,6 +810,29 @@ CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 	}
 #endif
 
+	/*
+	 * Prepare an anonymous file backing the segment. Its size will be
+	 * specified later via ftruncate.
+	 *
+	 * The file behaves like a regular file, but lives in memory. Once all
+	 * references to the file are dropped,  it is automatically released.
+	 * Anonymous memory is used for all backing pages of the file, thus it has
+	 * the same semantics as anonymous memory allocations using mmap with the
+	 * MAP_ANONYMOUS flag.
+	 */
+	mapping->segment_fd = memfd_create(MappingName(mapping->shmem_segment),
+									   memfd_flags);
+
+	/*
+	 * Specify the segment file size using allocsize, which contains
+	 * potentially modified size.
+	 */
+	if(ftruncate(mapping->segment_fd, allocsize) == -1)
+		ereport(FATAL,
+				(errcode(ERRCODE_SYSTEM_ERROR),
+				 errmsg("could not truncase anonymous file for \"%s\": %m",
+						MappingName(mapping->shmem_segment))));
+
 	elog(DEBUG1, "segment[%s]: mmap(%zu) at address %p",
 		 MappingName(mapping->shmem_segment), allocsize, base + reserved_offset);
 
@@ -807,7 +846,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 	 * a restart.
 	 */
 	ptr = mmap(base + reserved_offset, allocsize, PROT_READ | PROT_WRITE,
-			   mmap_flags | MAP_FIXED, -1, 0);
+			   mmap_flags | MAP_FIXED, mapping->segment_fd, 0);
 	mmap_errno = errno;
 
 	if (ptr == MAP_FAILED)
@@ -817,8 +856,15 @@ CreateAnonymousSegment(AnonymousMapping *mapping, Pointer base)
 					 "fallback to the non-resizable allocation",
 			 MappingName(mapping->shmem_segment), allocsize, base + reserved_offset);
 
+		/* Specify the segment file size using allocsize. */
+		if(ftruncate(mapping->segment_fd, allocsize) == -1)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not truncase anonymous file for \"%s\": %m",
+							MappingName(mapping->shmem_segment))));
+
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-						   PG_MMAP_FLAGS, -1, 0);
+						   PG_MMAP_FLAGS, mapping->segment_fd, 0);
 		mmap_errno = errno;
 	}
 	else
@@ -889,7 +935,7 @@ ReserveAnonymousMemory(Size reserve_size)
 		Size		hugepagesize, total_size = 0;
 		int			mmap_flags;
 
-		GetHugePageSize(&hugepagesize, &mmap_flags);
+		GetHugePageSize(&hugepagesize, &mmap_flags, NULL);
 
 		/*
 		 * Figure out how much memory is needed for all segments, keeping in
@@ -1070,6 +1116,13 @@ AnonymousShmemResize(void)
 		if (m->shmem_size == new_size)
 			continue;
 
+		/* Resize the backing anon file. */
+		if(ftruncate(m->segment_fd, new_size) == -1)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not truncase anonymous file for \"%s\": %m",
+							MappingName(m->shmem_segment))));
+
 		/* Clean up some reserved space to resize into */
 		if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
 			ereport(FATAL,
diff --git a/src/backend/port/win32_shmem.c b/src/backend/port/win32_shmem.c
index ce719f1b412..ba972106de1 100644
--- a/src/backend/port/win32_shmem.c
+++ b/src/backend/port/win32_shmem.c
@@ -627,7 +627,7 @@ pgwin32_ReserveSharedMemoryRegion(HANDLE hChild)
  * use GetLargePageMinimum() instead.
  */
 void
-GetHugePageSize(Size *hugepagesize, int *mmap_flags)
+GetHugePageSize(Size *hugepagesize, int *mmap_flags, int *memfd_flags)
 {
 	if (hugepagesize)
 		*hugepagesize = 0;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index abeb91e24fd..dc2b4becf4a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -396,7 +396,7 @@ InitializeShmemGUCs(void)
 	/*
 	 * Calculate the number of huge pages required.
 	 */
-	GetHugePageSize(&hp_size, NULL);
+	GetHugePageSize(&hp_size, NULL, NULL);
 	if (hp_size != 0)
 	{
 		Size		hp_required;
diff --git a/src/include/portability/mem.h b/src/include/portability/mem.h
index ef9800732d9..40588ff6968 100644
--- a/src/include/portability/mem.h
+++ b/src/include/portability/mem.h
@@ -38,7 +38,7 @@
 #define MAP_NOSYNC			0
 #endif
 
-#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_HASSEMAPHORE)
 
 /* Some really old systems don't define MAP_FAILED. */
 #ifndef MAP_FAILED
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 19ad2e2f788..192b637cc65 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -125,7 +125,8 @@ extern PGShmemHeader *PGSharedMemoryCreate(Size size,
 										   PGShmemHeader **shim, Pointer base);
 extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
-extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
+extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
+							int *memfd_flags);
 void *ReserveAnonymousMemory(Size reserve_size);
 
 bool ProcessBarrierShmemResize(Barrier *barrier);
-- 
2.34.1

0010-Reinitialize-StrategyControl-after-resizing-20250616.patchtext/x-patch; charset=US-ASCII; name=0010-Reinitialize-StrategyControl-after-resizing-20250616.patchDownload

From 2d173c9580b125dbc1248d86fb603a8508501ede Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Tue, 10 Jun 2025 11:00:36 +0530
Subject: [PATCH 10/17] Reinitialize StrategyControl after resizing buffers

... and BgBufferSync and ClockSweepTick adjustments

The commit introduces a separate function StrategyReInitialize() instead
of reusing StrategyInitialize() since some of the things that the second
one does are not required in the first one. Here's list of what
StrategyReInitialize() does and how does it differ from
StrategyInitialize().

1. When expanding the buffer pool add new buffers to the free list.
2. When shrinking buffers, we remove any buffers, in the area being
   shrunk, from the freelist. While doing so we adjust the first and
   last free buffer pointers in the StrategyControl area. Hence nothing
   more needed after resizing.
3. Check the sanity of the free buffer list is added after resizing.
4. StrategyControl pointer needn't be fetched again since it should not
   change. But added an Assert to make sure the pointer is valid.
5. &StrategyControl->buffer_strategy_lock need not be initialized again.
6. nextVictimBuffer, completePasses and numBufferAllocs are viewed in
   the context of NBuffers. Now that NBuffers itself has changed, those
   three do not make sense. Reset them as if the server has restarted
   again.

This commit introduces a flag delay_shmem_resize, which postgresql
backends and workers can use to signal the coordinator to delay resizing
operation. Background writer sets this flag when its scanning buffers. Background
writer is blocked when the actual resizing is in progress. But if
resizing is about to begin, it does not scan the buffers by returning
from BgBufferSync(). It stops a scan in progress when it sees that the
resizing has begun. After the resizing is finished, it adjusts the
collected statistics according to the new size of the buffer pool at the
end of barrier processing.

Once the buffer resizing is finished, before resuming the regular
operation, bgwriter resets the information saved so far. This
information is viewed in the context of NBuffers and hence does not make
sense after NBuffer has changed.

Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c         |  24 ++++-
 src/backend/postmaster/bgwriter.c     |   2 +-
 src/backend/storage/buffer/buf_init.c |  11 ++-
 src/backend/storage/buffer/bufmgr.c   |  74 ++++++++++----
 src/backend/storage/buffer/freelist.c | 133 ++++++++++++++++++++++++++
 src/include/storage/buf_internals.h   |   1 +
 src/include/storage/bufmgr.h          |   3 +-
 src/include/storage/ipc.h             |   1 +
 8 files changed, 226 insertions(+), 23 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index c03464b40e2..9144f101585 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -114,6 +114,7 @@ static AnonymousMapping Mappings[ANON_MAPPINGS];
 
 /* Flag telling postmaster that resize is needed */
 volatile bool pending_pm_shmem_resize = false;
+volatile bool delay_shmem_resize = false;
 
 /* Keeps track of the previous NBuffers value */
 static int NBuffersOld = -1;
@@ -1207,6 +1208,12 @@ AnonymousShmemResize(void)
 
 			LWLockRelease(ShmemResizeLock);
 		}
+
+		/*
+		 * TODO: Shouldn't we call ResizeBufferPool() here as well? Or those
+		 * backend who can not lock the LWLock conditionally won't resize the
+		 * buffers.
+		 */
 	}
 
 	return true;
@@ -1225,13 +1232,17 @@ AnonymousShmemResize(void)
 bool
 ProcessBarrierShmemResize(Barrier *barrier)
 {
-	elog(DEBUG1, "Handle a barrier for shmem resizing from %d to %d, %d",
-		 NBuffersOld, NBuffersPending, pending_pm_shmem_resize);
+	elog(DEBUG1, "Handle a barrier for shmem resizing from %d to %d, %d, %d",
+		 NBuffersOld, NBuffersPending, pending_pm_shmem_resize, delay_shmem_resize);
 
 	/* Wait until we have seen the new NBuffers value */
 	if (!pending_pm_shmem_resize)
 		return false;
 
+	/* Wait till this process becomes ready to resize buffers. */
+	if (delay_shmem_resize)
+		return false;
+
 	/*
 	 * First thing to do after attaching to the barrier is to wait for others.
 	 * We can't simply use BarrierArriveAndWait, because backends might arrive
@@ -1281,6 +1292,15 @@ ProcessBarrierShmemResize(Barrier *barrier)
 	/* The second phase means the resize has finished, SHMEM_RESIZE_DONE */
 	BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_DONE);
 
+	if (MyBackendType == B_BG_WRITER)
+	{
+		/*
+		 * Before resuming regular background writer activity, adjust the
+		 * statistics collected so far.
+		 */
+		BgBufferSyncAdjust(NBuffersOld, NBuffers);
+	}
+
 	BarrierDetach(barrier);
 	return true;
 }
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 72f5acceec7..32b34f28ead 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -233,7 +233,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(&wb_context, false);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index f78be4700df..7b8bc577bd5 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -163,8 +163,15 @@ BufferManagerShmemInit(int FirstBufferToInit)
 	if (FirstBufferToInit < NBuffers)
 		GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
 
-	/* Init other shared buffer-management stuff */
-	StrategyInitialize(!foundDescs);
+	/*
+	 * Init other shared buffer-management stuff from scratch configuring buffer
+	 * pool the first time. If we are just resizing buffer pool adjust only the
+	 * required structures.
+	 */
+	if (FirstBufferToInit == 0)
+		StrategyInitialize(!foundDescs);
+	else
+		StrategyReInitialize(FirstBufferToInit);
 
 	/* Initialize per-backend file flush context */
 	WritebackContextInit(&BackendWritebackContext,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 57d78c482bb..194d5da2999 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3611,6 +3611,32 @@ BufferSync(int flags)
 	TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_scan);
 }
 
+/*
+ * Information saved between BgBufferSync() calls so we can determine the
+ * strategy point's advance rate and avoid scanning already-cleaned buffers. The
+ * variables are global instead of static local so that BgBufferSyncAdjust() can
+ * adjust it when resizing shared buffers.
+ */
+static bool saved_info_valid = false;
+static int	prev_strategy_buf_id;
+static uint32 prev_strategy_passes;
+static int	next_to_clean;
+static uint32 next_passes;
+
+/* Moving averages of allocation rate and clean-buffer density */
+static float smoothed_alloc = 0;
+static float smoothed_density = 10.0;
+
+void
+BgBufferSyncAdjust(int NBuffersOld, int NBuffersNew)
+{
+				saved_info_valid = false;
+#ifdef BGW_DEBUG
+			elog(DEBUG2, "invalidated background writer status after resizing buffers from %d to %d",
+				 NBuffersOld, NBuffersNew);
+#endif
+}
+
 /*
  * BgBufferSync -- Write out some dirty buffers in the pool.
  *
@@ -3623,27 +3649,13 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(WritebackContext *wb_context, bool reset)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
 	uint32		strategy_passes;
 	uint32		recent_alloc;
 
-	/*
-	 * Information saved between calls so we can determine the strategy
-	 * point's advance rate and avoid scanning already-cleaned buffers.
-	 */
-	static bool saved_info_valid = false;
-	static int	prev_strategy_buf_id;
-	static uint32 prev_strategy_passes;
-	static int	next_to_clean;
-	static uint32 next_passes;
-
-	/* Moving averages of allocation rate and clean-buffer density */
-	static float smoothed_alloc = 0;
-	static float smoothed_density = 10.0;
-
 	/* Potentially these could be tunables, but for now, not */
 	float		smoothing_samples = 16;
 	float		scan_whole_pool_milliseconds = 120000.0;
@@ -3666,6 +3678,22 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	/*
+	 * If buffer pool is being shrunk the buffer being written out may not remain
+	 * valid. If the buffer pool is being expanded, more buffers will become
+	 * available without even this function writing out any. Hence wait till
+	 * buffer resizing finishes i.e. go into hibernation mode.
+	 */
+	if (pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers)
+		return true;
+
+	/*
+	 * Resizing shared buffers while this function is performing an LRU scan on
+	 * them may lead to wrong results. Indicate that the resizing should wait for
+	 * the LRU scan to complete.
+	 */
+	delay_shmem_resize = true;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -3842,8 +3870,17 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
-	/* Execute the LRU scan */
-	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
+	/*
+	 * Execute the LRU scan.
+	 *
+	 * If buffer pool is being shrunk, the buffer being written may not remain
+	 * valid. If the buffer pool is being expanded, more buffers will become
+	 * available without even this function writing any. Hence stop what we are doing. This
+	 * also unblocks other processes that are waiting for buffer resizing to
+	 * finish.
+	 */
+	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est &&
+			pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) == NBuffers)
 	{
 		int			sync_state = SyncOneBuffer(next_to_clean, true,
 											   wb_context);
@@ -3902,6 +3939,9 @@ BgBufferSync(WritebackContext *wb_context)
 #endif
 	}
 
+	/* Let the resizing commence. */
+	delay_shmem_resize = false;
+
 	/* Return true if OK to hibernate */
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7b9ed010e2f..41641bb3ae6 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -98,6 +98,9 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
+#ifdef USE_ASSERT_CHECKING
+static void StrategyValidateFreeList(void);
+#endif /* USE_ASSERT_CHECKING */
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -526,6 +529,88 @@ StrategyInitialize(bool init)
 		Assert(!init);
 }
 
+/*
+ * StrategyReInitialize -- re-initialize the buffer cache replacement
+ *		strategy.
+ *
+ * To be called when resizing buffer manager and only from the coordinator.
+ * TODO: Assess the differences between this function and StrategyInitialize().
+ */
+void
+StrategyReInitialize(int FirstBufferIdToInit)
+{
+	bool		found;
+
+	/*
+	 * Resizing memory for buffer pools should not affect the address of
+	 * StrategyControl.
+	 */
+	if (StrategyControl != (BufferStrategyControl *)
+		ShmemInitStructInSegment("Buffer Strategy Status",
+						sizeof(BufferStrategyControl),
+						&found, STRATEGY_SHMEM_SEGMENT))
+		elog(FATAL, "something went wrong while re-initializing the buffer strategy");
+
+	Assert(found);
+
+	/* TODO: Buffer lookup table adjustment: There are two options:
+	 *
+	 * 1. Resize the buffer lookup table to match the new number of buffers. But
+	 * this requires rehashing all the entries in the buffer lookup table with
+	 * the new table size.
+	 *
+	 * 2. Allocate maximum size of the buffer lookup table at the beginning and
+	 * never resize it. This leaves sparse buffer lookup table which is
+	 * inefficient from both memory and time perspective. According to David
+	 * Rowley, the sparse entries in the buffer look up table cause frequent
+	 * cacheline reload which affect performance. If the impact of that
+	 * inefficiency in a benchmark is significant, we will need to consider first
+	 * option.
+	 */
+
+	/*
+	 * When shrinking buffers, we must have adjusted the first and the last free
+	 * buffer when removing the buffers being shrunk from the free list. Nothing
+	 * to be done here.
+	 *
+	 * When expanding the shared buffers, new buffers are added at the end of the
+	 * freelist or they form the new free list if there are no free buffers.
+	 */
+	if (FirstBufferIdToInit < NBuffers)
+	{
+		if (StrategyControl->firstFreeBuffer == FREENEXT_END_OF_LIST)
+			StrategyControl->firstFreeBuffer = FirstBufferIdToInit;
+		else
+		{
+			Assert(StrategyControl->lastFreeBuffer >= 0);
+			GetBufferDescriptor(StrategyControl->lastFreeBuffer - 1)->freeNext = FirstBufferIdToInit;
+		}
+
+		StrategyControl->lastFreeBuffer = NBuffers - 1;
+	}
+
+	/* Check free list sanity after resizing. */
+#ifdef USE_ASSERT_CHECKING
+	StrategyValidateFreeList();
+#endif /* USE_ASSERT_CHECKING */
+
+	/*
+	 * The clock sweep tick pointer might have got invalidated. Reset it as if
+	 * starting a fresh server.
+	 */
+	pg_atomic_write_u32(&StrategyControl->nextVictimBuffer, 0);
+
+	/*
+	 * The old statistics is viewed in the context of the number of shared
+	 * buffers. It does not make sense now that the number of shared buffers
+	 * itself has changed.
+	 */
+	StrategyControl->completePasses = 0;
+	pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
+
+	/* No pending notification */
+	StrategyControl->bgwprocno = -1;
+}
 
 /*
  * StrategyPurgeFreeList -- remove all buffers with id higher than the number of
@@ -595,6 +680,54 @@ StrategyPurgeFreeList(int numBuffers)
 	 */
 }
 
+#ifdef USE_ASSERT_CHECKING
+/*
+ * StrategyValidateFreeList-- check sanity of free buffer list.
+ */
+static void
+StrategyValidateFreeList(void)
+{
+	int			nextFree = StrategyControl->firstFreeBuffer;
+	int			numFreeBuffers = 0;
+	int			lastFreeBuffer = FREENEXT_END_OF_LIST;
+
+	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+
+	while (nextFree != FREENEXT_END_OF_LIST)
+	{
+		BufferDesc *buf = GetBufferDescriptor(nextFree);
+
+		/* nextFree should be id of buffer being examined. */
+		Assert(nextFree == buf->buf_id);
+		Assert(buf->buf_id < NBuffers);
+		/* The buffer should not be marked as not in the list. */
+		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+
+		/* Update our knowledge of last buffer in the free list. */
+		lastFreeBuffer = buf->buf_id;
+
+		numFreeBuffers++;
+
+		/* Avoid infinite recursion in case there are cycles in free list. */
+		if (numFreeBuffers > NBuffers)
+			break;
+
+		nextFree = buf->freeNext;
+	}
+
+	Assert(numFreeBuffers <= NBuffers);
+
+	/*
+	 * Make sure that the StrategyControl's knowledge of last free buffer
+	 * agrees with what's there in the free list.
+	 */
+	if (StrategyControl->firstFreeBuffer != FREENEXT_END_OF_LIST)
+		Assert(StrategyControl->lastFreeBuffer == lastFreeBuffer);
+
+	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+}
+#endif /* USE_ASSERT_CHECKING */
+
 /* ----------------------------------------------------------------
  *				Backend-private buffer ring management
  * ----------------------------------------------------------------
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index add15e3723b..46949e9d90e 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -454,6 +454,7 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
 extern void StrategyPurgeFreeList(int numBuffers);
+extern void StrategyReInitialize(int FirstBufferToInit);
 extern bool have_free_buffer(void);
 
 /* buf_table.c */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 0c554f0b130..83a75eab844 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -298,7 +298,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+extern bool BgBufferSync(struct WritebackContext *wb_context, bool reset);
+extern void BgBufferSyncAdjust(int NBuffersOld, int NBuffersNew);
 
 extern uint32 GetPinLimit(void);
 extern uint32 GetLocalPinLimit(void);
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index bb7ae4d33b3..7d1c64a9267 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -65,6 +65,7 @@ typedef void (*shmem_startup_hook_type) (void);
 extern PGDLLIMPORT bool proc_exit_inprogress;
 extern PGDLLIMPORT bool shmem_exit_inprogress;
 extern PGDLLIMPORT volatile bool pending_pm_shmem_resize;
+extern PGDLLIMPORT volatile bool delay_shmem_resize;
 
 pg_noreturn extern void proc_exit(int code);
 extern void shmem_exit(int code);
-- 
2.34.1

0009-Support-shrinking-shared-buffers-20250616.patchtext/x-patch; charset=US-ASCII; name=0009-Support-shrinking-shared-buffers-20250616.patchDownload

From 23c8b2f5d75f52f14c80015ed104e82f69eca5c6 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Mon, 9 Jun 2025 14:40:34 +0530
Subject: [PATCH 09/17] Support shrinking shared buffers

When shrinking the shared buffers pool, each buffer in the area being
shrunk needs to be flushed if it's dirty so as not to loose the changes
to that buffer after shrinking. Also, each such buffer needs to be
removed from the buffer mapping table so that backends do not access it
after shrinking.

Buffer eviction requires a separate barrier phase for two reasons:

1. No other backend should map a new page to any of  buffers being
   evicted when eviction is in progress. So they wait while eviction is
   in progress.

2. Since a pinned buffer has the pin recorded in the backend local
   memory as well as the buffer descriptor (which is in shared memory),
   eviction should not coincide with remapping the shared memory of a
   backend. Otherwise we might loose consistency of local and shared
   pinning records. Hence it needs to be carried out in
   ProcessBarrierShmemResize() and not in AnonymousShmemResize() as
   indicated by now removed comment.

If a buffer being evicted is pinned, we raise a FATAL error but this should
improve. There are multiple options 1. to wait for the pinned buffer to get
unpinned, 2. the backend is killed or it itself cancels the query  or 3.
rollback the operation. Note that option 1 and 2 would require the pinning
related local and shared records to be accessed. But we need infrastructure to
do either of this right now.

Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c                 | 35 +++++---
 src/backend/storage/buffer/buf_init.c         |  8 +-
 src/backend/storage/buffer/bufmgr.c           | 89 +++++++++++++++++++
 src/backend/storage/buffer/freelist.c         | 68 ++++++++++++++
 .../utils/activity/wait_event_names.txt       |  1 +
 src/include/storage/buf_internals.h           |  1 +
 src/include/storage/bufmgr.h                  |  1 +
 src/include/storage/pg_shmem.h                |  1 +
 8 files changed, 192 insertions(+), 12 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index f0b53ce1d7c..c03464b40e2 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -1096,14 +1096,6 @@ AnonymousShmemResize(void)
 	 */
 	pending_pm_shmem_resize = false;
 
-	/*
-	 * XXX: Currently only increasing of shared_buffers is supported. For
-	 * decreasing something similar has to be done, but buffer blocks with
-	 * data have to be drained first.
-	 */
-	if(NBuffersOld > NBuffers)
-		return false;
-
 	for(int i = 0; i < next_free_segment; i++)
 	{
 		/* Note that CalculateShmemSize indirectly depends on NBuffers */
@@ -1201,11 +1193,14 @@ AnonymousShmemResize(void)
 				 * all the pointers are still valid, and we only need to update
 				 * structures size in the ShmemIndex once -- any other backend
 				 * will pick up this shared structure from the index.
-				 *
-				 * XXX: This is the right place for buffer eviction as well.
 				 */
 				BufferManagerShmemInit(NBuffersOld);
 
+				/*
+				 * Wipe out the evictor PID so that it can be used for the next
+				 * buffer resizing operation.
+				*/
+				ShmemCtrl->evictor_pid = 0;
 				/* If all fine, broadcast the new value */
 				pg_atomic_write_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
 			}
@@ -1262,6 +1257,25 @@ ProcessBarrierShmemResize(Barrier *barrier)
 		SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
 	}
 
+	/*
+	 * Evict extra buffers when shrinking shared buffers. We need to do this
+	 * while the memory for extra buffers is still mapped i.e. before remapping
+	 * the shared memory segments to a smaller memory area.
+	 */
+	if (NBuffersOld > NBuffersPending)
+	{
+		/*
+		 * TODO: If the buffer eviction fails for any reason, we should
+		 * gracefully rollback the shared buffer resizing and try again. But the
+		 * infrastructure to do so is not available right now. Hence just raise
+		 * a FATAL so that the system restarts.
+		 */
+		if (!EvictExtraBuffers(NBuffersPending, NBuffersOld))
+			elog(FATAL, "buffer eviction failed");
+
+		BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_EVICT);
+	}
+
 	AnonymousShmemResize();
 
 	/* The second phase means the resize has finished, SHMEM_RESIZE_DONE */
@@ -1763,5 +1777,6 @@ ShmemControlInit(void)
 
 		/* shmem_resizable should be initialized by now */
 		ShmemCtrl->Resizable = shmem_resizable;
+		ShmemCtrl->evictor_pid = 0;
 	}
 }
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ac844b114bd..f78be4700df 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -156,8 +156,12 @@ BufferManagerShmemInit(int FirstBufferToInit)
 	}
 #endif
 
-	/* Correct last entry of linked list */
-	GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
+	/*
+	 * Correct last entry of linked list, when initializing the buffers or when
+	 * expanding the buffers.
+	 */
+	if (FirstBufferToInit < NBuffers)
+		GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
 
 	/* Init other shared buffer-management stuff */
 	StrategyInitialize(!foundDescs);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 667aa0c0c78..57d78c482bb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -57,6 +57,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/read_stream.h"
 #include "storage/smgr.h"
@@ -7453,3 +7454,91 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+/*
+ * When shrinking shared buffers pool, evict the buffers which will not be part
+ * of the shrunk buffer pool.
+ */
+bool
+EvictExtraBuffers(int newBufSize, int oldBufSize)
+{
+	bool result = true;
+
+	/*
+	 * If the buffer being evicated is locked, this function will need to wait.
+	 * This function should not be called from a Postmaster since it can not wait on a lock.
+	 */
+	Assert(IsUnderPostmaster);
+
+	/*
+	 * Let only one backend perform eviction. We could split the work across all
+	 * the backends but that doesn't seem necessary.
+	 *
+	 * The first backend to acquire ShmemResizeLock, sets its own PID as the
+	 * evictor PID for other backends to know that the eviction is in progress or
+	 * has already been performed. The evictor backend releases the lock when it
+	 * finishes eviction.  While the eviction is in progress, backends other than
+	 * evictor backend won't be able to take the lock. They won't perform
+	 * eviction. A backend may acquire the lock after eviction has completed, but
+	 * it will not perform eviction since the evictor PID is already set. Evictor
+	 * PID is reset only when the buffer resizing finishes. Thus only one backend
+	 * will perform eviction in a given instance of shared buffers resizing.
+	 *
+	 * Any backend which acquires this lock will release it before the eviction
+	 * phase finishes, hence the same lock can be reused for the next phase of
+	 * resizing buffers.
+	 */
+	if (LWLockConditionalAcquire(ShmemResizeLock, LW_EXCLUSIVE))
+	{
+		if (ShmemCtrl->evictor_pid == 0)
+		{
+			ShmemCtrl->evictor_pid = MyProcPid;
+
+			StrategyPurgeFreeList(newBufSize);
+
+			/*
+			 * TODO: Before evicting any buffer, we should check whether any of the
+			 * buffers are pinned. If we find that a buffer is pinned after evicting
+			 * most of them, that will impact performance since all those evicted
+			 * buffers might need to be read again.
+			 */
+			for (Buffer buf = newBufSize + 1; buf <= oldBufSize; buf++)
+			{
+				BufferDesc *desc = GetBufferDescriptor(buf - 1);
+				uint32		buf_state;
+				bool		buffer_flushed;
+
+				buf_state = pg_atomic_read_u32(&desc->state);
+
+				/*
+				 * Nobody is expected to touch the buffers while resizing is
+				 * going one hence unlocked precheck should be safe and saves
+				 * some cycles.
+				 */
+				if (!(buf_state & BM_VALID))
+					continue;
+
+				ResourceOwnerEnlarge(CurrentResourceOwner);
+				ReservePrivateRefCountEntry();
+
+				LockBufHdr(desc);
+
+				/*
+				 * Now that we have locked buffer descriptor, make sure that the
+				 * buffer without valid data has been skipped above.
+				 */
+				Assert(buf_state & BM_VALID);
+
+				if (!EvictUnpinnedBufferInternal(desc, &buffer_flushed))
+				{
+					elog(WARNING, "could not remove buffer %u, it is pinned", buf);
+					result = false;
+					break;
+				}
+			}
+		}
+		LWLockRelease(ShmemResizeLock);
+	}
+
+	return result;
+}
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index bd390f2709d..7b9ed010e2f 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -527,6 +527,74 @@ StrategyInitialize(bool init)
 }
 
 
+/*
+ * StrategyPurgeFreeList -- remove all buffers with id higher than the number of
+ * buffers in the buffer pool.
+ *
+ * This is called before evicting buffers while shrinking shared buffers, so that
+ * the free list does not reference a buffer that will be removed.
+ *
+ * The function is called after resizing has started and thus nobody should be
+ * traversing the free list and also not touching the buffers.
+ */
+void
+StrategyPurgeFreeList(int numBuffers)
+{
+	int	firstBuffer = FREENEXT_END_OF_LIST;
+	int	nextFree = StrategyControl->firstFreeBuffer;
+	BufferDesc *prevValidBuf = NULL;
+
+	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+
+	while (nextFree != FREENEXT_END_OF_LIST)
+	{
+		BufferDesc *buf = GetBufferDescriptor(nextFree);
+
+		/* nextFree should be id of buffer being examined. */
+		Assert(nextFree == buf->buf_id);
+		/* The buffer should not be marked as not in the list. */
+		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+
+		/*
+		 * If the buffer is within the new size of pool, keep it in the free list
+		 * otherwise discard it.
+		 */
+		if (buf->buf_id < numBuffers)
+		{
+			if (prevValidBuf != NULL)
+				prevValidBuf->freeNext = buf->buf_id;
+			prevValidBuf = buf;
+
+			/* Save the first free buffer in the list if not already known. */
+			if (firstBuffer == FREENEXT_NOT_IN_LIST)
+				firstBuffer = nextFree;
+		}
+		/* Examine the next buffer in the free list. */
+		nextFree = buf->freeNext;
+	}
+
+	/* Update the last valid free buffer, if there's any. */
+	if (prevValidBuf != NULL)
+	{
+		StrategyControl->lastFreeBuffer = prevValidBuf->buf_id;
+		prevValidBuf->freeNext = FREENEXT_END_OF_LIST;
+	}
+	else
+		StrategyControl->lastFreeBuffer = FREENEXT_END_OF_LIST;
+
+	/* Update first valid free buffer, if there's any. */
+	StrategyControl->firstFreeBuffer = firstBuffer;
+
+	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+
+	/*
+	 * TODO: following was suggested by AI. Check whether it is required.
+	 * If we removed all buffers from the freelist, reset the clock sweep
+	 * pointer to zero.  This is not strictly necessary, but it seems like a
+	 * good idea to avoid confusion.
+	 */
+}
+
 /* ----------------------------------------------------------------
  *				Backend-private buffer ring management
  * ----------------------------------------------------------------
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 691fa14e9e3..0c588b69a90 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -156,6 +156,7 @@ REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it c
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
 SHMEM_RESIZE_START	"Waiting for other backends to start resizing shared memory."
+SHMEM_RESIZE_EVICT	"Waiting for other backends to finish buffer evication phase."
 SHMEM_RESIZE_DONE	"Waiting for other backends to finish resizing shared memory."
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
 WAL_BUFFER_INIT	"Waiting on WAL buffer to be initialized."
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0dec7d93b3b..add15e3723b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -453,6 +453,7 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
+extern void StrategyPurgeFreeList(int numBuffers);
 extern bool have_free_buffer(void);
 
 /* buf_table.c */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 4239ebe640b..0c554f0b130 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -315,6 +315,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 									int32 *buffers_evicted,
 									int32 *buffers_flushed,
 									int32 *buffers_skipped);
+extern bool EvictExtraBuffers(int fromBuf, int toBuf);
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(int);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 192b637cc65..23998f5469d 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -64,6 +64,7 @@ extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 typedef struct
 {
 	pg_atomic_uint32 	NSharedBuffers;
+	pid_t				evictor_pid;
 	Barrier 			Barrier;
 	pg_atomic_uint64 	Generation;
 	bool                Resizable;
-- 
2.34.1

0012-Fix-compilation-failure-in-pg_get_shmem_all-20250616.patchtext/x-patch; charset=US-ASCII; name=0012-Fix-compilation-failure-in-pg_get_shmem_all-20250616.patchDownload

From aa1ef8f2a108764585f42073e2b770689e3e8b3b Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Thu, 5 Jun 2025 11:34:12 +0530
Subject: [PATCH 12/17] Fix compilation failure in
 pg_get_shmem_allocations_numa()

The compilation failure is caused by
5cefa489760e34d947dbe67b4a922468b2e43668. Ideal fix should be compute
the total page count across all the shared memory segments. This commit
just fixes the compilation failure.  # Please enter the commit message
for your changes. Lines starting

Ashutosh Bapat
---
 src/backend/storage/ipc/shmem.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index c07572d6f89..b411fbce37e 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -696,7 +696,12 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
 	 * this is not very likely, and moreover we have more entries, each of
 	 * them using only fraction of the total pages.
 	 */
-	shm_total_page_count = (ShmemSegHdr->totalsize / os_page_size) + 1;
+	/*
+	 * TODO: We should loop through all the Shm segments, instead of just the
+	 * main segment, to find the total page count.
+	 */
+	shm_total_page_count = (Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->totalsize
+	/ os_page_size) + 1;
 	page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
 	pages_status = palloc(sizeof(int) * shm_total_page_count);
 
-- 
2.34.1

0011-Additional-validation-for-buffer-in-the-rin-20250616.patchtext/x-patch; charset=US-ASCII; name=0011-Additional-validation-for-buffer-in-the-rin-20250616.patchDownload

From ac07b39c724fe5a0d4350b7b716c7690d22d4e82 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Wed, 11 Jun 2025 18:15:06 +0530
Subject: [PATCH 11/17] Additional validation for buffer in the ring

If the buffer pool has been shrunk, the buffers in the buffer list may
not be valid anymore. Modify GetBufferFromRing to check if the buffer is
still valid before using it. This makes GetBufferFromRing() a bit more
expensive because of additional boolean condition. That may not be
expensive enough to affect query performance. The alternative to that is
more complex as explained below.

The strategy object is created in CurrentMemoryContext and is not
available in any global structure thus accessible when processing buffer
resizing barriers. We may modify GetAccessStrategy() to register
strategy in a global linked list and then arrange to deregister it once
it's no more in use. Looking at the places which use
GetAccessStrategy(), fixing all those may be some work.

Ashutosh Bapat
---
 src/backend/storage/buffer/freelist.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 41641bb3ae6..74d070733a4 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -948,12 +948,13 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 		strategy->current = 0;

 	/*
-	 * If the slot hasn't been filled yet, tell the caller to allocate a new
-	 * buffer with the normal allocation strategy.  He will then fill this
-	 * slot by calling AddBufferToRing with the new buffer.
+	 * If the slot hasn't been filled yet or the buffer in the slot has been
+	 * invalidated when buffer pool was shrunk, tell the caller to allocate a new
+	 * buffer with the normal allocation strategy.  He will then fill this slot
+	 * by calling AddBufferToRing with the new buffer.
 	 */
 	bufnum = strategy->buffers[strategy->current];
-	if (bufnum == InvalidBuffer)
+	if (bufnum == InvalidBuffer || bufnum > NBuffers)
 		return NULL;

 	/*
-- 
2.34.1

0013-Fix-compilation-failure-in-pg_get_shmem_pag-20250616.patchtext/x-patch; charset=US-ASCII; name=0013-Fix-compilation-failure-in-pg_get_shmem_pag-20250616.patchDownload

From cc706b4db87b3bd69ee2f4acc223e9306fb10674 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Thu, 5 Jun 2025 14:42:53 +0530
Subject: [PATCH 13/17] Fix compilation failure in pg_get_shmem_pagesize()

Fix compilation failure in pg_get_shmem_pagesize() due to incorrect call to
GetHugePageSize(). This is a temporary fix to allow compilation to proceed.

Ashutosh Bapat
---
 src/backend/storage/ipc/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index b411fbce37e..4c2bddfe6ca 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -826,7 +826,7 @@ pg_get_shmem_pagesize(void)
 	Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
 
 	if (huge_pages_status == HUGE_PAGES_ON)
-		GetHugePageSize(&os_page_size, NULL);
+		GetHugePageSize(&os_page_size, NULL, NULL);
 
 	return os_page_size;
 }
-- 
2.34.1

#75

Dmitry Dolgov

9erthalion6@gmail.com

7 months ago

In reply to: Ashutosh Bapat (#74)

10 attachment(s)

Re: Changing shared_buffers without restart

Hi,

On Mon, Apr 21, 2025 at 7:47 PM Thomas Munro <thomas.munro@gmail.com>
wrote:

One thing I'm still wondering about is whether you really need all
this multi-phase barrier stuff, or even need to stop other backends
from running at all while doing the resize. I guess that's related to
your remapping scheme, but supposing you find the simple
ftruncate()-only approach to be good, my next question is: why isn't
it enough to wait for all backends to agree to stop allocating new
buffers in the range to be truncated, and then left them continue to
run as normal? As far as they would be concerned, the in-progress
downsize has already happened, though it could be reverted later if
the eviction phase fails. Then the coordinator could start evicting
buffers and truncating the shared memory object, which are
phases/steps, sure, but it's not clear to me why they need other
backends' help.

My intention behind keeping all backends waiting was to have a simple way of
not only preventing them from allocating new buffers from the truncated range,
but also eliminating any chance of them accessing those to-be-truncated
buffers. In the end it's just easier (at least for me) to reason about
correctness of the implementation this way.

On Tue, Jun 10, 2025 at 04:39:58PM +0530, Ashutosh Bapat wrote:

Here's patchset rebased on f85f6ab051b7cf6950247e5fa6072c4130613555

Thanks! I've reworked the series to implement approach suggested by
Thomas, and applied your patches to support buffers shrinking on top. I
had to restructure the patch set, here is how it looks like right now:

1. Preparation patches

Changes, that are needed to support resizing functionality, but not
strictly related to it.

* Process config reload in AIO workers. Corrects omission discussed on [1]/messages/by-id/sh5uqe4a4aqo5zkkpfy5fobe2rg2zzouctdjz7kou4t74c66ql@yzpkxb7pgoxf.

* Introduce pending flag for GUC assign hooks. Allowing to decouple a GUC value
change from actually applying it, sort of "pending" change. The idea is to let
a custom logic be triggered on an assign hook, and then take responsibility for
what happens later and how it's going to be applied. Doesn't do GUC reporting
yet.

* Introduce pss_barrierReceivedGeneration. Allows to distinguish situations
when a signal was processed everywhere, and when a signal was received
everywhere.

2. Resizing implementation

* Allow to use multiple shared memory mappings. A preparation patch, extending
the existing interface to support multiple shared memory segments.

* Address space reservation for shared memory. Implement the new way of
handling shared memory segments, now each segment can visually be represented
as following:

    /              Address space                 \
    +---------------<+>--------------------------+
    | Actual content | Address space reservation |
    | (memfd)        | (mmap, PROT_NONE)         |
    +---------------<+>--------------------------+

The actual segment size is managed via ftruncate and mprotect. One interesting
side effect I haven't fully understood yet, is that Linux doesn't seem to
extend the existing mapping when doing mprotect on huge pages, it creates
another mapping instead. E.g. when using normal page size and resizing shared
memory we get:

7f4808600000-7f4817e00000 rw-s /memfd:buffers (deleted)
7f4817e00000-7f48a2000000 ---s /memfd:buffers (deleted)

Doing the same with huge pages ends up looking like this:

7f4808600000-7f4817e00000 rw-s /memfd:buffers (deleted)
7f4817e00000-7f4830000000 rw-s /memfd:buffers (deleted)
7f4830000000-7f48a2000000 ---s /memfd:buffers (deleted)

I'm still investigating whether it's a mistake on my side or a genuine Linux
behavior. At the same time I don't see it as a large issue, the same situation
could happen with the previous implementation as well.

* Introduce multiple shmem segments for shared buffers. Modifies necessary bits
to use new functionality.

* Allow to resize shared memory without restart. Utilizes infrastructure
introduced so far to implement stop-the-world resizing approach, where all the
active backend (and potentially new one spawning) are waiting until everyone
gets the same shared memory size.

When testing I've noticed that there seems to be concurrency issues with
interrupts, where aio workers and checkpointer sometimes do not receive
the resize signal correctly. I assume it has something to do with the
significant behavior change -- config reload processing can now fire
signals on its own. Letting those backends to always process config
reload first seems to be resolved (or at least hide) the issue, but I
still need to understand what's going on there.

3. Shared memory shrinking

So far only shared memory increase was implemented. These patches from Ashutosh
support shrinking as well, which is tricky due to the need for buffer eviction.

* Support shrinking shared buffers
* Reinitialize StrategyControl after resizing buffers
* Additional validation for buffer in the ring

0009 adds support to shrink shared buffers. It has two changes: a.
evict the buffers outside the new buffer size b. remove buffers with
buffer id outside the new buffer size from the free list. If a buffer
being evicted is pinned, the operation is aborted and a FATAL error is
raised. I think we need to change this behaviour to be less severe
like rolling back the operation or waiting for the pinned buffer to be
unpinned etc. Better even if we could let users control the behaviour.
But we need better infrastructure to do such things. That's one TODO
left in the patch.

I haven't reviewed those, just tested a bit to finally include into the series.
Note that I had to tweak two things:

* The way it was originally implemented was sending resize signal to postmaster
before doing eviction, which could result in sigbus when accessing LSN of a
dirty buffer to be evicted. I've reshaped it a bit to make sure eviction always
happens first.

* It seems the CurrentResource owner could be missing sometimes, so I've added
a band-aid checking its presence.

One side note, during my testing I've noticed assert failures on
pgstat_tracks_io_op inside a wal writer a few times. I couldn't reproduce it
after the fixes above, but still it may indicate that something is off. E.g.
it's somehow not expected that the wal writer will do buffer eviction IO (from
what I understand, the current shrinking implementation allows that).

Buffer lookup table resizing
------------------------------------
The size of the buffer lookup table depends upon (number of shared
buffers + number of partitions in the shared buffer lookup table). If we
shrink the buffer pool, the buffer lookup table will become sparse but
still useful. If we expand the buffers we need to expand the buffer lookup
table too. That's not implemented in the current patchset.

Just FYI, buffer lookup table has its own STRATEGY_SHMEM_SEGMENT shared memory
segment and is resized in the same way as others. There could be lots of
details missing, but at least the corresponding resizable segment is already
there.

[1]: /messages/by-id/sh5uqe4a4aqo5zkkpfy5fobe2rg2zzouctdjz7kou4t74c66ql@yzpkxb7pgoxf

Attachments:

v5-0001-Process-config-reload-in-AIO-workers.patchtext/plain; charset=us-asciiDownload

From 5d4f46dc13bacf9c1df79233f8e4017a9dcd6919 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 15:14:33 +0200
Subject: [PATCH v5 01/10] Process config reload in AIO workers

Currenly AIO workers process interrupts only via CHECK_FOR_INTERRUPTS,
which does not include ConfigReloadPending. Thus we need to check for it
explicitly.
---
 src/backend/storage/aio/method_worker.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 36be179678d..b4d5c46fb94 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -80,6 +80,7 @@ static void pgaio_worker_shmem_init(bool first_time);
 static bool pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh);
 static int	pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
 
+static void pgaio_worker_process_interrupts(void);
 
 const IoMethodOps pgaio_worker_ops = {
 	.shmem_size = pgaio_worker_shmem_size,
@@ -461,6 +462,8 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		int			nwakeups = 0;
 		int			worker;
 
+		pgaio_worker_process_interrupts();
+
 		/*
 		 * Try to get a job to do.
 		 *
@@ -584,3 +587,25 @@ pgaio_workers_enabled(void)
 {
 	return io_method == IOMETHOD_WORKER;
 }
+
+/*
+ * Process any new interrupts.
+ */
+static void
+pgaio_worker_process_interrupts(void)
+{
+	/*
+	 * Reloading config can trigger further signals, complicating interrupts
+	 * processing -- so let it run first.
+	 *
+	 * XXX: Is there any need in memory barrier after ProcessConfigFile?
+	 */
+	if (ConfigReloadPending)
+	{
+		ConfigReloadPending = false;
+		ProcessConfigFile(PGC_SIGHUP);
+	}
+
+	if (ProcSignalBarrierPending)
+		ProcessProcSignalBarrier();
+}
-- 
2.49.0

v5-0002-Introduce-pending-flag-for-GUC-assign-hooks.patchtext/plain; charset=us-asciiDownload

From fd29084b3221e1901a1f07656d4e7abb31335caf Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sun, 6 Apr 2025 16:40:32 +0200
Subject: [PATCH v5 02/10] Introduce pending flag for GUC assign hooks

Currently an assing hook can perform some preprocessing of a new value,
but it cannot change the behavior, which dictates that the new value
will be applied immediately after the hook. Certain GUC options (like
shared_buffers, coming in subsequent patches) may need coordinating work
between backends to change, meaning we cannot apply it right away.

Add a new flag "pending" for an assign hook to allow the hook indicate
exactly that. If the pending flag is set after the hook, the new value
will not be applied and it's handling becomes the hook's implementation
responsibility.

Note, that this also requires changes in the way how GUCs are getting
reported, but the patch does not cover that yet.
---
 src/backend/access/transam/xlog.c    |  2 +-
 src/backend/commands/variable.c      |  6 +--
 src/backend/libpq/pqcomm.c           |  8 ++--
 src/backend/tcop/postgres.c          |  2 +-
 src/backend/utils/misc/guc.c         | 59 +++++++++++++++++++---------
 src/backend/utils/misc/stack_depth.c |  2 +-
 src/include/utils/guc.h              |  2 +-
 src/include/utils/guc_hooks.h        | 20 +++++-----
 8 files changed, 61 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 47ffc0a2307..d1be780683b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2321,7 +2321,7 @@ CalculateCheckpointSegments(void)
 }
 
 void
-assign_max_wal_size(int newval, void *extra)
+assign_max_wal_size(int newval, void *extra, bool *pending)
 {
 	max_wal_size_mb = newval;
 	CalculateCheckpointSegments();
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 608f10d9412..e40dae2ddf2 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1143,7 +1143,7 @@ check_cluster_name(char **newval, void **extra, GucSource source)
  * GUC assign_hook for maintenance_io_concurrency
  */
 void
-assign_maintenance_io_concurrency(int newval, void *extra)
+assign_maintenance_io_concurrency(int newval, void *extra, bool *pending)
 {
 	/*
 	 * Reconfigure recovery prefetching, because a setting it depends on
@@ -1161,12 +1161,12 @@ assign_maintenance_io_concurrency(int newval, void *extra)
  * they may be assigned in either order.
  */
 void
-assign_io_max_combine_limit(int newval, void *extra)
+assign_io_max_combine_limit(int newval, void *extra, bool *pending)
 {
 	io_combine_limit = Min(newval, io_combine_limit_guc);
 }
 void
-assign_io_combine_limit(int newval, void *extra)
+assign_io_combine_limit(int newval, void *extra, bool *pending)
 {
 	io_combine_limit = Min(io_max_combine_limit, newval);
 }
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index e5171467de1..2a6a587ef76 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -1952,7 +1952,7 @@ pq_settcpusertimeout(int timeout, Port *port)
  * GUC assign_hook for tcp_keepalives_idle
  */
 void
-assign_tcp_keepalives_idle(int newval, void *extra)
+assign_tcp_keepalives_idle(int newval, void *extra, bool *pending)
 {
 	/*
 	 * The kernel API provides no way to test a value without setting it; and
@@ -1985,7 +1985,7 @@ show_tcp_keepalives_idle(void)
  * GUC assign_hook for tcp_keepalives_interval
  */
 void
-assign_tcp_keepalives_interval(int newval, void *extra)
+assign_tcp_keepalives_interval(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_setkeepalivesinterval(newval, MyProcPort);
@@ -2008,7 +2008,7 @@ show_tcp_keepalives_interval(void)
  * GUC assign_hook for tcp_keepalives_count
  */
 void
-assign_tcp_keepalives_count(int newval, void *extra)
+assign_tcp_keepalives_count(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_setkeepalivescount(newval, MyProcPort);
@@ -2031,7 +2031,7 @@ show_tcp_keepalives_count(void)
  * GUC assign_hook for tcp_user_timeout
  */
 void
-assign_tcp_user_timeout(int newval, void *extra)
+assign_tcp_user_timeout(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_settcpusertimeout(newval, MyProcPort);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 2f8c3d5f918..0d1b6466d1e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3591,7 +3591,7 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 
 /* GUC assign hook for transaction_timeout */
 void
-assign_transaction_timeout(int newval, void *extra)
+assign_transaction_timeout(int newval, void *extra, bool *pending)
 {
 	if (IsTransactionState())
 	{
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 667df448732..bb681f5bc60 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1679,6 +1679,7 @@ InitializeOneGUCOption(struct config_generic *gconf)
 				struct config_int *conf = (struct config_int *) gconf;
 				int			newval = conf->boot_val;
 				void	   *extra = NULL;
+				bool 	   pending = false;
 
 				Assert(newval >= conf->min);
 				Assert(newval <= conf->max);
@@ -1687,9 +1688,13 @@ InitializeOneGUCOption(struct config_generic *gconf)
 					elog(FATAL, "failed to initialize %s to %d",
 						 conf->gen.name, newval);
 				if (conf->assign_hook)
-					conf->assign_hook(newval, extra);
-				*conf->variable = conf->reset_val = newval;
-				conf->gen.extra = conf->reset_extra = extra;
+					conf->assign_hook(newval, extra, &pending);
+
+				if (!pending)
+				{
+					*conf->variable = conf->reset_val = newval;
+					conf->gen.extra = conf->reset_extra = extra;
+				}
 				break;
 			}
 		case PGC_REAL:
@@ -2041,13 +2046,18 @@ ResetAllOptions(void)
 			case PGC_INT:
 				{
 					struct config_int *conf = (struct config_int *) gconf;
+					bool 			  pending = false;
 
 					if (conf->assign_hook)
 						conf->assign_hook(conf->reset_val,
-										  conf->reset_extra);
-					*conf->variable = conf->reset_val;
-					set_extra_field(&conf->gen, &conf->gen.extra,
-									conf->reset_extra);
+										  conf->reset_extra,
+										  &pending);
+					if (!pending)
+					{
+						*conf->variable = conf->reset_val;
+						set_extra_field(&conf->gen, &conf->gen.extra,
+										conf->reset_extra);
+					}
 					break;
 				}
 			case PGC_REAL:
@@ -2424,16 +2434,21 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
 							struct config_int *conf = (struct config_int *) gconf;
 							int			newval = newvalue.val.intval;
 							void	   *newextra = newvalue.extra;
+							bool 	    pending = false;
 
 							if (*conf->variable != newval ||
 								conf->gen.extra != newextra)
 							{
 								if (conf->assign_hook)
-									conf->assign_hook(newval, newextra);
-								*conf->variable = newval;
-								set_extra_field(&conf->gen, &conf->gen.extra,
-												newextra);
-								changed = true;
+									conf->assign_hook(newval, newextra, &pending);
+
+								if (!pending)
+								{
+									*conf->variable = newval;
+									set_extra_field(&conf->gen, &conf->gen.extra,
+													newextra);
+									changed = true;
+								}
 							}
 							break;
 						}
@@ -3850,18 +3865,24 @@ set_config_with_handle(const char *name, config_handle *handle,
 
 				if (changeVal)
 				{
+					bool pending = false;
+
 					/* Save old value to support transaction abort */
 					if (!makeDefault)
 						push_old_value(&conf->gen, action);
 
 					if (conf->assign_hook)
-						conf->assign_hook(newval, newextra);
-					*conf->variable = newval;
-					set_extra_field(&conf->gen, &conf->gen.extra,
-									newextra);
-					set_guc_source(&conf->gen, source);
-					conf->gen.scontext = context;
-					conf->gen.srole = srole;
+						conf->assign_hook(newval, newextra, &pending);
+
+					if (!pending)
+					{
+						*conf->variable = newval;
+						set_extra_field(&conf->gen, &conf->gen.extra,
+										newextra);
+						set_guc_source(&conf->gen, source);
+						conf->gen.scontext = context;
+						conf->gen.srole = srole;
+					}
 				}
 				if (makeDefault)
 				{
diff --git a/src/backend/utils/misc/stack_depth.c b/src/backend/utils/misc/stack_depth.c
index 8f7cf531fbc..ef59ae62008 100644
--- a/src/backend/utils/misc/stack_depth.c
+++ b/src/backend/utils/misc/stack_depth.c
@@ -156,7 +156,7 @@ check_max_stack_depth(int *newval, void **extra, GucSource source)
 
 /* GUC assign hook for max_stack_depth */
 void
-assign_max_stack_depth(int newval, void *extra)
+assign_max_stack_depth(int newval, void *extra, bool *pending)
 {
 	ssize_t		newval_bytes = newval * (ssize_t) 1024;
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index f619100467d..8802ad8a3cb 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -187,7 +187,7 @@ typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource sourc
 typedef bool (*GucEnumCheckHook) (int *newval, void **extra, GucSource source);
 
 typedef void (*GucBoolAssignHook) (bool newval, void *extra);
-typedef void (*GucIntAssignHook) (int newval, void *extra);
+typedef void (*GucIntAssignHook) (int newval, void *extra, bool *pending);
 typedef void (*GucRealAssignHook) (double newval, void *extra);
 typedef void (*GucStringAssignHook) (const char *newval, void *extra);
 typedef void (*GucEnumAssignHook) (int newval, void *extra);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 799fa7ace68..c8300cffa8e 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -81,14 +81,14 @@ extern bool check_log_stats(bool *newval, void **extra, GucSource source);
 extern bool check_log_timezone(char **newval, void **extra, GucSource source);
 extern void assign_log_timezone(const char *newval, void *extra);
 extern const char *show_log_timezone(void);
-extern void assign_maintenance_io_concurrency(int newval, void *extra);
-extern void assign_io_max_combine_limit(int newval, void *extra);
-extern void assign_io_combine_limit(int newval, void *extra);
+extern void assign_maintenance_io_concurrency(int newval, void *extra, bool *pending);
+extern void assign_io_max_combine_limit(int newval, void *extra, bool *pending);
+extern void assign_io_combine_limit(int newval, void *extra, bool *pending);
 extern bool check_max_slot_wal_keep_size(int *newval, void **extra,
 										 GucSource source);
-extern void assign_max_wal_size(int newval, void *extra);
+extern void assign_max_wal_size(int newval, void *extra, bool *pending);
 extern bool check_max_stack_depth(int *newval, void **extra, GucSource source);
-extern void assign_max_stack_depth(int newval, void *extra);
+extern void assign_max_stack_depth(int newval, void *extra, bool *pending);
 extern bool check_multixact_member_buffers(int *newval, void **extra,
 										   GucSource source);
 extern bool check_multixact_offset_buffers(int *newval, void **extra,
@@ -143,13 +143,13 @@ extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 extern void assign_syslog_facility(int newval, void *extra);
 extern void assign_syslog_ident(const char *newval, void *extra);
-extern void assign_tcp_keepalives_count(int newval, void *extra);
+extern void assign_tcp_keepalives_count(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_count(void);
-extern void assign_tcp_keepalives_idle(int newval, void *extra);
+extern void assign_tcp_keepalives_idle(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_idle(void);
-extern void assign_tcp_keepalives_interval(int newval, void *extra);
+extern void assign_tcp_keepalives_interval(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_interval(void);
-extern void assign_tcp_user_timeout(int newval, void *extra);
+extern void assign_tcp_user_timeout(int newval, void *extra, bool *pending);
 extern const char *show_tcp_user_timeout(void);
 extern bool check_temp_buffers(int *newval, void **extra, GucSource source);
 extern bool check_temp_tablespaces(char **newval, void **extra,
@@ -165,7 +165,7 @@ extern bool check_transaction_buffers(int *newval, void **extra, GucSource sourc
 extern bool check_transaction_deferrable(bool *newval, void **extra, GucSource source);
 extern bool check_transaction_isolation(int *newval, void **extra, GucSource source);
 extern bool check_transaction_read_only(bool *newval, void **extra, GucSource source);
-extern void assign_transaction_timeout(int newval, void *extra);
+extern void assign_transaction_timeout(int newval, void *extra, bool *pending);
 extern const char *show_unix_socket_permissions(void);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern bool check_wal_consistency_checking(char **newval, void **extra,
-- 
2.49.0

v5-0003-Introduce-pss_barrierReceivedGeneration.patchtext/plain; charset=us-asciiDownload

From efbe93b30e0174c4fba42047b14208b3fd5c0f43 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 4 Apr 2025 21:46:14 +0200
Subject: [PATCH v5 03/10] Introduce pss_barrierReceivedGeneration

Currently WaitForProcSignalBarrier allows to make sure the message sent
via EmitProcSignalBarrier was processed by all ProcSignal mechanism
participants.

Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
which will be updated when a process has received the message, but not
processed it yet. This makes it possible to support a new mode of
waiting, when ProcSignal participants want to synchronize message
processing. To do that, a participant can wait via
WaitForProcSignalBarrierReceived when processing a message, effectively
making sure that all processes are going to start processing
ProcSignalBarrier simultaneously.
---
 src/backend/storage/ipc/procsignal.c | 67 ++++++++++++++++++++++------
 src/include/storage/procsignal.h     |  1 +
 2 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index a9bb540b55a..c6bec9be423 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -58,7 +58,10 @@
  * of it. For such use cases, we set a bit in pss_barrierCheckMask and then
  * increment the current "barrier generation"; when the new barrier generation
  * (or greater) appears in the pss_barrierGeneration flag of every process,
- * we know that the message has been received everywhere.
+ * we know that the message has been received and processed everywhere. In case
+ * if we only need to know only that the message was received everywhere (e.g.
+ * receiving processes need to handle the message in a coordinated fashion)
+ * use pss_barrierReceivedGeneration in the same way.
  */
 typedef struct
 {
@@ -70,6 +73,7 @@ typedef struct
 
 	/* Barrier-related fields (not protected by pss_mutex) */
 	pg_atomic_uint64 pss_barrierGeneration;
+	pg_atomic_uint64 pss_barrierReceivedGeneration;
 	pg_atomic_uint32 pss_barrierCheckMask;
 	ConditionVariable pss_barrierCV;
 } ProcSignalSlot;
@@ -152,6 +156,8 @@ ProcSignalShmemInit(void)
 			slot->pss_cancel_key_len = 0;
 			MemSet(slot->pss_signalFlags, 0, sizeof(slot->pss_signalFlags));
 			pg_atomic_init_u64(&slot->pss_barrierGeneration, PG_UINT64_MAX);
+			pg_atomic_init_u64(&slot->pss_barrierReceivedGeneration,
+							   PG_UINT64_MAX);
 			pg_atomic_init_u32(&slot->pss_barrierCheckMask, 0);
 			ConditionVariableInit(&slot->pss_barrierCV);
 		}
@@ -199,6 +205,8 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
 	barrier_generation =
 		pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration);
 	pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation);
+	pg_atomic_write_u64(&slot->pss_barrierReceivedGeneration,
+						barrier_generation);
 
 	if (cancel_key_len > 0)
 		memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
@@ -263,6 +271,7 @@ CleanupProcSignalState(int status, Datum arg)
 	 * no barrier waits block on it.
 	 */
 	pg_atomic_write_u64(&slot->pss_barrierGeneration, PG_UINT64_MAX);
+	pg_atomic_write_u64(&slot->pss_barrierReceivedGeneration, PG_UINT64_MAX);
 
 	SpinLockRelease(&slot->pss_mutex);
 
@@ -416,12 +425,8 @@ EmitProcSignalBarrier(ProcSignalBarrierType type)
 	return generation;
 }
 
-/*
- * WaitForProcSignalBarrier - wait until it is guaranteed that all changes
- * requested by a specific call to EmitProcSignalBarrier() have taken effect.
- */
-void
-WaitForProcSignalBarrier(uint64 generation)
+static void
+WaitForProcSignalBarrierInternal(uint64 generation, bool receivedOnly)
 {
 	Assert(generation <= pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration));
 
@@ -436,12 +441,17 @@ WaitForProcSignalBarrier(uint64 generation)
 		uint64		oldval;
 
 		/*
-		 * It's important that we check only pss_barrierGeneration here and
-		 * not pss_barrierCheckMask. Bits in pss_barrierCheckMask get cleared
-		 * before the barrier is actually absorbed, but pss_barrierGeneration
+		 * It's important that we check only pss_barrierGeneration &
+		 * pss_barrierGeneration here and not pss_barrierCheckMask. Bits in
+		 * pss_barrierCheckMask get cleared before the barrier is actually
+		 * absorbed, but pss_barrierGeneration & pss_barrierReceivedGeneration
 		 * is updated only afterward.
 		 */
-		oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+		if (receivedOnly)
+			oldval = pg_atomic_read_u64(&slot->pss_barrierReceivedGeneration);
+		else
+			oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+
 		while (oldval < generation)
 		{
 			if (ConditionVariableTimedSleep(&slot->pss_barrierCV,
@@ -450,7 +460,11 @@ WaitForProcSignalBarrier(uint64 generation)
 				ereport(LOG,
 						(errmsg("still waiting for backend with PID %d to accept ProcSignalBarrier",
 								(int) pg_atomic_read_u32(&slot->pss_pid))));
-			oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+
+			if (receivedOnly)
+				oldval = pg_atomic_read_u64(&slot->pss_barrierReceivedGeneration);
+			else
+				oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
 		}
 		ConditionVariableCancelSleep();
 	}
@@ -464,12 +478,33 @@ WaitForProcSignalBarrier(uint64 generation)
 	 * The caller is probably calling this function because it wants to read
 	 * the shared state or perform further writes to shared state once all
 	 * backends are known to have absorbed the barrier. However, the read of
-	 * pss_barrierGeneration was performed unlocked; insert a memory barrier
-	 * to separate it from whatever follows.
+	 * pss_barrierGeneration & pss_barrierReceivedGeneration was performed
+	 * unlocked; insert a memory barrier to separate it from whatever follows.
 	 */
 	pg_memory_barrier();
 }
 
+/*
+ * WaitForProcSignalBarrier - wait until it is guaranteed that all changes
+ * requested by a specific call to EmitProcSignalBarrier() have taken effect.
+ */
+void
+WaitForProcSignalBarrier(uint64 generation)
+{
+	WaitForProcSignalBarrierInternal(generation, false);
+}
+
+/*
+ * WaitForProcSignalBarrierReceived - wait until it is guaranteed that all
+ * backends have observed the message sent by a specific call to
+ * EmitProcSignalBarrier().
+ */
+void
+WaitForProcSignalBarrierReceived(uint64 generation)
+{
+	WaitForProcSignalBarrierInternal(generation, true);
+}
+
 /*
  * Handle receipt of an interrupt indicating a global barrier event.
  *
@@ -523,6 +558,10 @@ ProcessProcSignalBarrier(void)
 	if (local_gen == shared_gen)
 		return;
 
+	/* The message is observed, record that */
+	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
+						shared_gen);
+
 	/*
 	 * Get and clear the flags that are set for this backend. Note that
 	 * pg_atomic_exchange_u32 is a full barrier, so we're guaranteed that the
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index afeeb1ca019..2733bbb8c5b 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -79,6 +79,7 @@ extern void SendCancelRequest(int backendPID, const uint8 *cancel_key, int cance
 
 extern uint64 EmitProcSignalBarrier(ProcSignalBarrierType type);
 extern void WaitForProcSignalBarrier(uint64 generation);
+extern void WaitForProcSignalBarrierReceived(uint64 generation);
 extern void ProcessProcSignalBarrier(void);
 
 extern void procsignal_sigusr1_handler(SIGNAL_ARGS);
-- 
2.49.0

v5-0004-Allow-to-use-multiple-shared-memory-mappings.patchtext/plain; charset=us-asciiDownload

From a8e77ba00c05765f6d7ed05c8239a6bfecdbce4c Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 28 Feb 2025 19:54:47 +0100
Subject: [PATCH v5 04/10] Allow to use multiple shared memory mappings

Currently all the work with shared memory is done via a single anonymous
memory mapping, which limits ways how the shared memory could be organized.

Introduce possibility to allocate multiple shared memory mappings, where
a single mapping is associated with a specified shared memory segment.
There is only fixed amount of available segments, currently only one
main shared memory segment is allocated. A new shared memory API is
introduces, extended with a segment as a new parameter. As a path of
least resistance, the original API is kept in place, utilizing the main
shared memory segment.
---
 src/backend/port/posix_sema.c     |   4 +-
 src/backend/port/sysv_sema.c      |   4 +-
 src/backend/port/sysv_shmem.c     | 138 +++++++++++++++++++---------
 src/backend/port/win32_sema.c     |   2 +-
 src/backend/storage/ipc/ipc.c     |   4 +-
 src/backend/storage/ipc/ipci.c    |  63 +++++++------
 src/backend/storage/ipc/shmem.c   | 148 +++++++++++++++++++++---------
 src/backend/storage/lmgr/lwlock.c |  13 ++-
 src/include/storage/ipc.h         |   2 +-
 src/include/storage/pg_sema.h     |   2 +-
 src/include/storage/pg_shmem.h    |  18 ++++
 src/include/storage/shmem.h       |  12 +++
 12 files changed, 284 insertions(+), 126 deletions(-)

diff --git a/src/backend/port/posix_sema.c b/src/backend/port/posix_sema.c
index 269c7460817..401e1113fa1 100644
--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -193,7 +193,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * we don't have to expose the counters to other processes.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	struct stat statbuf;
 
@@ -220,7 +220,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSegment(PGSemaphoreShmemSize(maxSemas), shmem_segment);
 #endif
 
 	numSems = 0;
diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c
index 423b2b4f9d6..4ce2cfb662b 100644
--- a/src/backend/port/sysv_sema.c
+++ b/src/backend/port/sysv_sema.c
@@ -307,7 +307,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * have clobbered.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	struct stat statbuf;
 
@@ -328,7 +328,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSegment(PGSemaphoreShmemSize(maxSemas), shmem_segment);
 	numSharedSemas = 0;
 	maxSharedSemas = maxSemas;
 
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..56af0231d24 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -94,8 +94,19 @@ typedef enum
 unsigned long UsedShmemSegID = 0;
 void	   *UsedShmemSegAddr = NULL;
 
-static Size AnonymousShmemSize;
-static void *AnonymousShmem = NULL;
+typedef struct AnonymousMapping
+{
+	int shmem_segment;
+	Size shmem_size; 			/* Size of the mapping */
+	Pointer shmem; 				/* Pointer to the start of the mapped memory */
+	Pointer seg_addr; 			/* SysV shared memory for the header */
+	unsigned long seg_id; 		/* IPC key */
+} AnonymousMapping;
+
+static AnonymousMapping Mappings[ANON_MAPPINGS];
+
+/* Keeps track of used mapping segments */
+static int next_free_segment = 0;
 
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
@@ -104,6 +115,28 @@ static IpcMemoryState PGSharedMemoryAttach(IpcMemoryId shmId,
 										   void *attachAt,
 										   PGShmemHeader **addr);
 
+static const char*
+MappingName(int shmem_segment)
+{
+	switch (shmem_segment)
+	{
+		case MAIN_SHMEM_SEGMENT:
+			return "main";
+		default:
+			return "unknown";
+	}
+}
+
+static void
+DebugMappings()
+{
+	for(int i = 0; i < next_free_segment; i++)
+	{
+		AnonymousMapping m = Mappings[i];
+		elog(DEBUG1, "Mapping[%s]: addr %p, size %zu",
+			 MappingName(i), m.shmem, m.shmem_size);
+	}
+}
 
 /*
  *	InternalIpcMemoryCreate(memKey, size)
@@ -591,14 +624,13 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
 /*
  * Creates an anonymous mmap()ed shared memory segment.
  *
- * Pass the requested size in *size.  This function will modify *size to the
- * actual size of the allocation, if it ends up allocating a segment that is
- * larger than requested.
+ * This function will modify mapping size to the actual size of the allocation,
+ * if it ends up allocating a segment that is larger than requested.
  */
-static void *
-CreateAnonymousSegment(Size *size)
+static void
+CreateAnonymousSegment(AnonymousMapping *mapping)
 {
-	Size		allocsize = *size;
+	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
 
@@ -623,8 +655,11 @@ CreateAnonymousSegment(Size *size)
 				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
-			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-				 allocsize);
+		{
+			DebugMappings();
+			elog(DEBUG1, "segment[%s]: mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
+				 MappingName(mapping->shmem_segment), allocsize);
+		}
 	}
 #endif
 
@@ -642,7 +677,7 @@ CreateAnonymousSegment(Size *size)
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
-		allocsize = *size;
+		allocsize = mapping->shmem_size;
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
 				   PG_MMAP_FLAGS, -1, 0);
 		mmap_errno = errno;
@@ -651,8 +686,10 @@ CreateAnonymousSegment(Size *size)
 	if (ptr == MAP_FAILED)
 	{
 		errno = mmap_errno;
+		DebugMappings();
 		ereport(FATAL,
-				(errmsg("could not map anonymous shared memory: %m"),
+				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
+						MappingName(mapping->shmem_segment)),
 				 (mmap_errno == ENOMEM) ?
 				 errhint("This error usually means that PostgreSQL's request "
 						 "for a shared memory segment exceeded available memory, "
@@ -663,8 +700,8 @@ CreateAnonymousSegment(Size *size)
 						 allocsize) : 0));
 	}
 
-	*size = allocsize;
-	return ptr;
+	mapping->shmem = ptr;
+	mapping->shmem_size = allocsize;
 }
 
 /*
@@ -674,13 +711,18 @@ CreateAnonymousSegment(Size *size)
 static void
 AnonymousShmemDetach(int status, Datum arg)
 {
-	/* Release anonymous shared memory block, if any. */
-	if (AnonymousShmem != NULL)
+	for(int i = 0; i < next_free_segment; i++)
 	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		AnonymousMapping m = Mappings[i];
+
+		/* Release anonymous shared memory block, if any. */
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
 
@@ -705,6 +747,7 @@ PGSharedMemoryCreate(Size size,
 	PGShmemHeader *hdr;
 	struct stat statbuf;
 	Size		sysvsize;
+	AnonymousMapping *mapping = &Mappings[next_free_segment];
 
 	/*
 	 * We use the data directory's ID info (inode and device numbers) to
@@ -733,11 +776,15 @@ PGSharedMemoryCreate(Size size,
 
 	/* Room for a header? */
 	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
+	mapping->shmem_size = size;
+	mapping->shmem_segment = next_free_segment;
 
 	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
-		AnonymousShmem = CreateAnonymousSegment(&size);
-		AnonymousShmemSize = size;
+		/* On success, mapping data will be modified. */
+		CreateAnonymousSegment(mapping);
+
+		next_free_segment++;
 
 		/* Register on-exit routine to unmap the anonymous segment */
 		on_shmem_exit(AnonymousShmemDetach, (Datum) 0);
@@ -760,7 +807,7 @@ PGSharedMemoryCreate(Size size,
 	 * loop simultaneously.  (CreateDataDirLockFile() does not entirely ensure
 	 * that, but prefer fixing it over coping here.)
 	 */
-	NextShmemSegID = statbuf.st_ino;
+	NextShmemSegID = statbuf.st_ino + next_free_segment;
 
 	for (;;)
 	{
@@ -852,13 +899,13 @@ PGSharedMemoryCreate(Size size,
 	/*
 	 * Initialize space allocation status for segment.
 	 */
-	hdr->totalsize = size;
+	hdr->totalsize = mapping->shmem_size;
 	hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader));
 	*shim = hdr;
 
 	/* Save info for possible future use */
-	UsedShmemSegAddr = memAddress;
-	UsedShmemSegID = (unsigned long) NextShmemSegID;
+	mapping->seg_addr = memAddress;
+	mapping->seg_id = (unsigned long) NextShmemSegID;
 
 	/*
 	 * If AnonymousShmem is NULL here, then we're not using anonymous shared
@@ -866,10 +913,10 @@ PGSharedMemoryCreate(Size size,
 	 * block. Otherwise, the System V shared memory block is only a shim, and
 	 * we must return a pointer to the real block.
 	 */
-	if (AnonymousShmem == NULL)
+	if (mapping->shmem == NULL)
 		return hdr;
-	memcpy(AnonymousShmem, hdr, sizeof(PGShmemHeader));
-	return (PGShmemHeader *) AnonymousShmem;
+	memcpy(mapping->shmem, hdr, sizeof(PGShmemHeader));
+	return (PGShmemHeader *) mapping->shmem;
 }
 
 #ifdef EXEC_BACKEND
@@ -969,23 +1016,28 @@ PGSharedMemoryNoReAttach(void)
 void
 PGSharedMemoryDetach(void)
 {
-	if (UsedShmemSegAddr != NULL)
+	for(int i = 0; i < next_free_segment; i++)
 	{
-		if ((shmdt(UsedShmemSegAddr) < 0)
+		AnonymousMapping m = Mappings[i];
+
+		if (m.seg_addr != NULL)
+		{
+			if ((shmdt(m.seg_addr) < 0)
 #if defined(EXEC_BACKEND) && defined(__CYGWIN__)
-		/* Work-around for cygipc exec bug */
-			&& shmdt(NULL) < 0
+			/* Work-around for cygipc exec bug */
+				&& shmdt(NULL) < 0
 #endif
-			)
-			elog(LOG, "shmdt(%p) failed: %m", UsedShmemSegAddr);
-		UsedShmemSegAddr = NULL;
-	}
+				)
+				elog(LOG, "shmdt(%p) failed: %m", m.seg_addr);
+			m.seg_addr = NULL;
+		}
 
-	if (AnonymousShmem != NULL)
-	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
diff --git a/src/backend/port/win32_sema.c b/src/backend/port/win32_sema.c
index 5854ad1f54d..e7365ff8060 100644
--- a/src/backend/port/win32_sema.c
+++ b/src/backend/port/win32_sema.c
@@ -44,7 +44,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * process exits.
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	mySemSet = (HANDLE *) malloc(maxSemas * sizeof(HANDLE));
 	if (mySemSet == NULL)
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index 567739b5be9..5b55bec8d9d 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -61,6 +61,8 @@ static void proc_exit_prepare(int code);
  * but provide some additional features we need --- in particular,
  * we want to register callbacks to invoke when we are disconnecting
  * from a broken shared-memory context but not exiting the postmaster.
+ * Maximum number of such exit callbacks depends on the number of shared
+ * segments.
  *
  * Callback functions can take zero, one, or two args: the first passed
  * arg is the integer exitcode, the second is the Datum supplied when
@@ -68,7 +70,7 @@ static void proc_exit_prepare(int code);
  * ----------------------------------------------------------------
  */
 
-#define MAX_ON_EXITS 20
+#define MAX_ON_EXITS 40
 
 struct ONEXIT
 {
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..8b38e985327 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -86,7 +86,7 @@ RequestAddinShmemSpace(Size size)
  * required.
  */
 Size
-CalculateShmemSize(int *num_semaphores)
+CalculateShmemSize(int *num_semaphores, int shmem_segment)
 {
 	Size		size;
 	int			numSemas;
@@ -206,33 +206,38 @@ CreateSharedMemoryAndSemaphores(void)
 
 	Assert(!IsUnderPostmaster);
 
-	/* Compute the size of the shared-memory block */
-	size = CalculateShmemSize(&numSemas);
-	elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
-
-	/*
-	 * Create the shmem segment
-	 */
-	seghdr = PGSharedMemoryCreate(size, &shim);
-
-	/*
-	 * Make sure that huge pages are never reported as "unknown" while the
-	 * server is running.
-	 */
-	Assert(strcmp("unknown",
-				  GetConfigOption("huge_pages_status", false, false)) != 0);
-
-	InitShmemAccess(seghdr);
-
-	/*
-	 * Create semaphores
-	 */
-	PGReserveSemaphores(numSemas);
-
-	/*
-	 * Set up shared memory allocation mechanism
-	 */
-	InitShmemAllocation();
+	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
+	{
+		/* Compute the size of the shared-memory block */
+		size = CalculateShmemSize(&numSemas, segment);
+		elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
+
+		/*
+		 * Create the shmem segment.
+		 *
+		 * XXX: Do multiple shims are needed, one per segment?
+		 */
+		seghdr = PGSharedMemoryCreate(size, &shim);
+
+		/*
+		 * Make sure that huge pages are never reported as "unknown" while the
+		 * server is running.
+		 */
+		Assert(strcmp("unknown",
+					  GetConfigOption("huge_pages_status", false, false)) != 0);
+
+		InitShmemAccessInSegment(seghdr, segment);
+
+		/*
+		 * Create semaphores
+		 */
+		PGReserveSemaphores(numSemas, segment);
+
+		/*
+		 * Set up shared memory allocation mechanism
+		 */
+		InitShmemAllocationInSegment(segment);
+	}
 
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
@@ -363,7 +368,7 @@ InitializeShmemGUCs(void)
 	/*
 	 * Calculate the shared memory size and round up to the nearest megabyte.
 	 */
-	size_b = CalculateShmemSize(&num_semas);
+	size_b = CalculateShmemSize(&num_semas, MAIN_SHMEM_SEGMENT);
 	size_mb = add_size(size_b, (1024 * 1024) - 1) / (1024 * 1024);
 	sprintf(buf, "%zu", size_mb);
 	SetConfigOption("shared_memory_size", buf,
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index c9ae3b45b76..72255a1c5ca 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -76,19 +76,19 @@
 #include "utils/builtins.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
+static void *ShmemAllocRawInSegment(Size size, Size *allocated_size,
+								 int shmem_segment);
 
 /* shared memory global variables */
 
-static PGShmemHeader *ShmemSegHdr;	/* shared mem segment header */
+ShmemSegment Segments[ANON_MAPPINGS];
 
-static void *ShmemBase;			/* start address of shared memory */
-
-static void *ShmemEnd;			/* end+1 address of shared memory */
-
-slock_t    *ShmemLock;			/* spinlock for shared memory and LWLock
-								 * allocation */
-
-static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/*
+ * Primary index hashtable for shmem, for simplicity we use a single for all
+ * shared memory segments. There can be performance consequences of that, and
+ * an alternative option would be to have one index per shared memory segments.
+ */
+static HTAB *ShmemIndex = NULL;
 
 /* To get reliable results for NUMA inquiry we need to "touch pages" once */
 static bool firstNumaTouch = true;
@@ -101,9 +101,17 @@ Datum		pg_numa_available(PG_FUNCTION_ARGS);
 void
 InitShmemAccess(PGShmemHeader *seghdr)
 {
-	ShmemSegHdr = seghdr;
-	ShmemBase = seghdr;
-	ShmemEnd = (char *) ShmemBase + seghdr->totalsize;
+	InitShmemAccessInSegment(seghdr, MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAccessInSegment(PGShmemHeader *seghdr, int shmem_segment)
+{
+	PGShmemHeader *shmhdr = (PGShmemHeader *) seghdr;
+	ShmemSegment *seg = &Segments[shmem_segment];
+	seg->ShmemSegHdr = shmhdr;
+	seg->ShmemBase = (void *) shmhdr;
+	seg->ShmemEnd = (char *) seg->ShmemBase + shmhdr->totalsize;
 }
 
 /*
@@ -114,7 +122,13 @@ InitShmemAccess(PGShmemHeader *seghdr)
 void
 InitShmemAllocation(void)
 {
-	PGShmemHeader *shmhdr = ShmemSegHdr;
+	InitShmemAllocationInSegment(MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAllocationInSegment(int shmem_segment)
+{
+	PGShmemHeader *shmhdr = Segments[shmem_segment].ShmemSegHdr;
 	char	   *aligned;
 
 	Assert(shmhdr != NULL);
@@ -123,9 +137,9 @@ InitShmemAllocation(void)
 	 * Initialize the spinlock used by ShmemAlloc.  We must use
 	 * ShmemAllocUnlocked, since obviously ShmemAlloc can't be called yet.
 	 */
-	ShmemLock = (slock_t *) ShmemAllocUnlocked(sizeof(slock_t));
+	Segments[shmem_segment].ShmemLock = (slock_t *) ShmemAllocUnlockedInSegment(sizeof(slock_t), shmem_segment);
 
-	SpinLockInit(ShmemLock);
+	SpinLockInit(Segments[shmem_segment].ShmemLock);
 
 	/*
 	 * Allocations after this point should go through ShmemAlloc, which
@@ -150,11 +164,17 @@ InitShmemAllocation(void)
  */
 void *
 ShmemAlloc(Size size)
+{
+	return ShmemAllocInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocInSegment(Size size, int shmem_segment)
 {
 	void	   *newSpace;
 	Size		allocated_size;
 
-	newSpace = ShmemAllocRaw(size, &allocated_size);
+	newSpace = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
 	if (!newSpace)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -184,6 +204,12 @@ ShmemAllocNoError(Size size)
  */
 static void *
 ShmemAllocRaw(Size size, Size *allocated_size)
+{
+	return ShmemAllocRawInSegment(size, allocated_size, MAIN_SHMEM_SEGMENT);
+}
+
+static void *
+ShmemAllocRawInSegment(Size size, Size *allocated_size, int shmem_segment)
 {
 	Size		newStart;
 	Size		newFree;
@@ -203,22 +229,22 @@ ShmemAllocRaw(Size size, Size *allocated_size)
 	size = CACHELINEALIGN(size);
 	*allocated_size = size;
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-	SpinLockAcquire(ShmemLock);
+	SpinLockAcquire(Segments[shmem_segment].ShmemLock);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree <= ShmemSegHdr->totalsize)
+	if (newFree <= Segments[shmem_segment].ShmemSegHdr->totalsize)
 	{
-		newSpace = (char *) ShmemBase + newStart;
-		ShmemSegHdr->freeoffset = newFree;
+		newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
+		Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 	}
 	else
 		newSpace = NULL;
 
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[shmem_segment].ShmemLock);
 
 	/* note this assert is okay with newSpace == NULL */
 	Assert(newSpace == (void *) CACHELINEALIGN(newSpace));
@@ -236,6 +262,12 @@ ShmemAllocRaw(Size size, Size *allocated_size)
  */
 void *
 ShmemAllocUnlocked(Size size)
+{
+	return ShmemAllocUnlockedInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocUnlockedInSegment(Size size, int shmem_segment)
 {
 	Size		newStart;
 	Size		newFree;
@@ -246,19 +278,19 @@ ShmemAllocUnlocked(Size size)
 	 */
 	size = MAXALIGN(size);
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree > ShmemSegHdr->totalsize)
+	if (newFree > Segments[shmem_segment].ShmemSegHdr->totalsize)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of shared memory (%zu bytes requested)",
 						size)));
-	ShmemSegHdr->freeoffset = newFree;
+	Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 
-	newSpace = (char *) ShmemBase + newStart;
+	newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
 
 	Assert(newSpace == (void *) MAXALIGN(newSpace));
 
@@ -273,7 +305,13 @@ ShmemAllocUnlocked(Size size)
 bool
 ShmemAddrIsValid(const void *addr)
 {
-	return (addr >= ShmemBase) && (addr < ShmemEnd);
+	return ShmemAddrIsValidInSegment(addr, MAIN_SHMEM_SEGMENT);
+}
+
+bool
+ShmemAddrIsValidInSegment(const void *addr, int shmem_segment)
+{
+	return (addr >= Segments[shmem_segment].ShmemBase) && (addr < Segments[shmem_segment].ShmemEnd);
 }
 
 /*
@@ -334,6 +372,18 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
+{
+	return ShmemInitHashInSegment(name, init_size, max_size, infoP, hash_flags,
+							   MAIN_SHMEM_SEGMENT);
+}
+
+HTAB *
+ShmemInitHashInSegment(const char *name,		/* table string name for shmem index */
+			  long init_size,		/* initial table size */
+			  long max_size,		/* max size of the table */
+			  HASHCTL *infoP,		/* info about key and bucket size */
+			  int hash_flags,		/* info about infoP */
+			  int shmem_segment) 	/* in which segment to keep the table */
 {
 	bool		found;
 	void	   *location;
@@ -350,9 +400,9 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 	hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
 
 	/* look it up in the shmem index */
-	location = ShmemInitStruct(name,
+	location = ShmemInitStructInSegment(name,
 							   hash_get_shared_size(infoP, hash_flags),
-							   &found);
+							   &found, shmem_segment);
 
 	/*
 	 * if it already exists, attach to it rather than allocate and initialize
@@ -385,6 +435,13 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
  */
 void *
 ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+{
+	return ShmemInitStructInSegment(name, size, foundPtr, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
+					  int shmem_segment)
 {
 	ShmemIndexEnt *result;
 	void	   *structPtr;
@@ -393,7 +450,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	if (!ShmemIndex)
 	{
-		PGShmemHeader *shmemseghdr = ShmemSegHdr;
+		PGShmemHeader *shmemseghdr = Segments[shmem_segment].ShmemSegHdr;
 
 		/* Must be trying to create/attach to ShmemIndex itself */
 		Assert(strcmp(name, "ShmemIndex") == 0);
@@ -416,7 +473,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 			 * process can be accessing shared memory yet.
 			 */
 			Assert(shmemseghdr->index == NULL);
-			structPtr = ShmemAlloc(size);
+			structPtr = ShmemAllocInSegment(size, shmem_segment);
 			shmemseghdr->index = structPtr;
 			*foundPtr = false;
 		}
@@ -433,8 +490,8 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		LWLockRelease(ShmemIndexLock);
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("could not create ShmemIndex entry for data structure \"%s\"",
-						name)));
+				 errmsg("could not create ShmemIndex entry for data structure \"%s\" in segment %d",
+						name, shmem_segment)));
 	}
 
 	if (*foundPtr)
@@ -459,7 +516,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		Size		allocated_size;
 
 		/* It isn't in the table yet. allocate and initialize it */
-		structPtr = ShmemAllocRaw(size, &allocated_size);
+		structPtr = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
 		if (structPtr == NULL)
 		{
 			/* out of memory; remove the failed ShmemIndex entry */
@@ -478,14 +535,13 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	LWLockRelease(ShmemIndexLock);
 
-	Assert(ShmemAddrIsValid(structPtr));
+	Assert(ShmemAddrIsValidInSegment(structPtr, shmem_segment));
 
 	Assert(structPtr == (void *) CACHELINEALIGN(structPtr));
 
 	return structPtr;
 }
 
-
 /*
  * Add two Size values, checking for overflow
  */
@@ -542,10 +598,11 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 
 	/* output all allocated entries */
 	memset(nulls, 0, sizeof(nulls));
+	/* XXX: take all shared memory segments into account. */
 	while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
 	{
 		values[0] = CStringGetTextDatum(ent->key);
-		values[1] = Int64GetDatum((char *) ent->location - (char *) ShmemSegHdr);
+		values[1] = Int64GetDatum((char *) ent->location - (char *) Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr);
 		values[2] = Int64GetDatum(ent->size);
 		values[3] = Int64GetDatum(ent->allocated_size);
 		named_allocated += ent->allocated_size;
@@ -557,15 +614,15 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 	/* output shared memory allocated but not counted via the shmem index */
 	values[0] = CStringGetTextDatum("<anonymous>");
 	nulls[1] = true;
-	values[2] = Int64GetDatum(ShmemSegHdr->freeoffset - named_allocated);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset - named_allocated);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
 	/* output as-of-yet unused shared memory */
 	nulls[0] = true;
-	values[1] = Int64GetDatum(ShmemSegHdr->freeoffset);
+	values[1] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
 	nulls[1] = false;
-	values[2] = Int64GetDatum(ShmemSegHdr->totalsize - ShmemSegHdr->freeoffset);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->totalsize - Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
@@ -630,7 +687,12 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
 	 * this is not very likely, and moreover we have more entries, each of
 	 * them using only fraction of the total pages.
 	 */
-	shm_total_page_count = (ShmemSegHdr->totalsize / os_page_size) + 1;
+	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
+	{
+		PGShmemHeader *shmhdr = Segments[segment].ShmemSegHdr;
+		shm_total_page_count += (shmhdr->totalsize / os_page_size) + 1;
+	}
+
 	page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
 	pages_status = palloc(sizeof(int) * shm_total_page_count);
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 46f44bc4511..a36b08895c8 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -80,6 +80,8 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
+#include "postmaster/postmaster.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/proclist.h"
 #include "storage/procnumber.h"
@@ -618,10 +620,15 @@ LWLockNewTrancheId(void)
 	int		   *LWLockCounter;
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - sizeof(int));
-	/* We use the ShmemLock spinlock to protect LWLockCounter */
-	SpinLockAcquire(ShmemLock);
+	/*
+	 * We use the ShmemLock spinlock to protect LWLockCounter.
+	 *
+	 * XXX: Looks like this is the only use of Segments outside of shmem.c,
+	 * it's maybe worth it to reshape this part to hide Segments structure.
+	 */
+	SpinLockAcquire(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 	result = (*LWLockCounter)++;
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 
 	return result;
 }
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 3baf418b3d1..6ebda479ced 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -77,7 +77,7 @@ extern void check_on_shmem_exit_lists_are_empty(void);
 /* ipci.c */
 extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;
 
-extern Size CalculateShmemSize(int *num_semaphores);
+extern Size CalculateShmemSize(int *num_semaphores, int shmem_segment);
 extern void CreateSharedMemoryAndSemaphores(void);
 #ifdef EXEC_BACKEND
 extern void AttachSharedMemoryStructs(void);
diff --git a/src/include/storage/pg_sema.h b/src/include/storage/pg_sema.h
index fa6ca35a51f..8ae9637fcd0 100644
--- a/src/include/storage/pg_sema.h
+++ b/src/include/storage/pg_sema.h
@@ -41,7 +41,7 @@ typedef HANDLE PGSemaphore;
 extern Size PGSemaphoreShmemSize(int maxSemas);
 
 /* Module initialization (called during postmaster start or shmem reinit) */
-extern void PGReserveSemaphores(int maxSemas);
+extern void PGReserveSemaphores(int maxSemas, int shmem_segment);
 
 /* Allocate a PGSemaphore structure with initial count 1 */
 extern PGSemaphore PGSemaphoreCreate(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 5f7d4b83a60..2348c59b5a0 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -25,6 +25,7 @@
 #define PG_SHMEM_H
 
 #include "storage/dsm_impl.h"
+#include "storage/spin.h"
 
 typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 {
@@ -41,6 +42,20 @@ typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 #endif
 } PGShmemHeader;
 
+typedef struct ShmemSegment
+{
+	PGShmemHeader *ShmemSegHdr; 	/* shared mem segment header */
+	void *ShmemBase; 				/* start address of shared memory */
+	void *ShmemEnd; 				/* end+1 address of shared memory */
+	slock_t    *ShmemLock; 			/* spinlock for shared memory and LWLock
+									 * allocation */
+} ShmemSegment;
+
+/* Number of available segments for anonymous memory mappings */
+#define ANON_MAPPINGS 1
+
+extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -91,4 +106,7 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+/* The main segment, contains everything except buffer blocks and related data. */
+#define MAIN_SHMEM_SEGMENT 0
+
 #endif							/* PG_SHMEM_H */
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index c1f668ded95..69663d412c3 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -29,15 +29,27 @@
 extern PGDLLIMPORT slock_t *ShmemLock;
 struct PGShmemHeader;			/* avoid including storage/pg_shmem.h here */
 extern void InitShmemAccess(struct PGShmemHeader *seghdr);
+extern void InitShmemAccessInSegment(struct PGShmemHeader *seghdr,
+									 int shmem_segment);
 extern void InitShmemAllocation(void);
+extern void InitShmemAllocationInSegment(int shmem_segment);
 extern void *ShmemAlloc(Size size);
+extern void *ShmemAllocInSegment(Size size, int shmem_segment);
 extern void *ShmemAllocNoError(Size size);
 extern void *ShmemAllocUnlocked(Size size);
+extern void *ShmemAllocUnlockedInSegment(Size size, int shmem_segment);
 extern bool ShmemAddrIsValid(const void *addr);
+extern bool ShmemAddrIsValidInSegment(const void *addr, int shmem_segment);
 extern void InitShmemIndex(void);
+extern void InitVariableShmemIndex(void);
 extern HTAB *ShmemInitHash(const char *name, long init_size, long max_size,
 						   HASHCTL *infoP, int hash_flags);
+extern HTAB *ShmemInitHashInSegment(const char *name, long init_size,
+									long max_size, HASHCTL *infoP,
+									int hash_flags, int shmem_segment);
 extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
+extern void *ShmemInitStructInSegment(const char *name, Size size,
+									  bool *foundPtr, int shmem_segment);
 extern Size add_size(Size s1, Size s2);
 extern Size mul_size(Size s1, Size s2);
 
-- 
2.49.0

v5-0005-Address-space-reservation-for-shared-memory.patchtext/plain; charset=us-asciiDownload

From 6238657ddb8c9e63d28a1a96712278f548d3292c Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 11:47:04 +0200
Subject: [PATCH v5 05/10] Address space reservation for shared memory

Currently the shared memory layout is designed to pack everything tight
together, leaving no space between mappings for resizing. Here is how it
looks like for one mapping in /proc/$PID/maps, /dev/zero represents the
anonymous shared memory we talk about:

    00400000-00490000         /path/bin/postgres
    ...
    012d9000-0133e000         [heap]
    7f443a800000-7f470a800000 /dev/zero (deleted)
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34
    ...

Make the layout more dynamic via splitting every shared memory segment
into two parts:

* An anonymous file, which actually contains shared memory content. Such
  an anonymous file is created via memfd_create, it lives in memory,
  behaves like a regular file and semantically equivalent to an
  anonymous memory allocated via mmap with MAP_ANONYMOUS.

* A reservation mapping, which size is much larger than required shared
  segment size. This mapping is created with flags PROT_NONE (which
  makes sure the reserved space is not used), and MAP_NORESERVE (to not
  count the reserved space against memory limits). The anonymous file is
  mapped into this reservation mapping.

The resulting layout looks like this:

    00400000-00490000         /path/bin/postgres
    ...
    3f526000-3f590000 rw-p 		[heap]
    7fbd827fe000-7fbd8bdde000 rw-s 	/memfd:main (deleted) -- anon file
    7fbd8bdde000-7fbe82800000 ---s 	/memfd:main (deleted) -- reservation
    7fbe82800000-7fbe90670000 r--p 	/usr/lib/locale/locale-archive
    7fbe90800000-7fbe90941000 r-xp 	/usr/lib64/libstdc++.so.6.0.34

To resize a shared memory segment in this layout it's possible to use ftruncate
on the anonymous file, adjusting access permissions on the reserved space as
needed.

This approach also do not impact the actual memory usage as reported by
the kernel. Here is the output of /proc/$PID/status for the master
version with shared_buffers = 128 MB:

    // Peak virtual memory size, which is described as total pages
    // mapped in mm_struct. It corresponds to the mapped reserved space
    // and is the only number that grows with it.
    VmPeak:          2043192 kB
    // Size of memory portions. It contains RssAnon + RssFile + RssShmem
    VmRSS:             22908 kB
    // Size of resident anonymous memory
    RssAnon:             768 kB
    // Size of resident file mappings
    RssFile:           10364 kB
    // Size of resident shmem memory (includes SysV shm, mapping of tmpfs and
    // shared anonymous mappings)
    RssShmem:          11776 kB

Here is the same for the patch when reserving 20GB of space:

    VmPeak:         21255824 kB
    VmRSS:             25020 kB
    RssAnon:             768 kB
    RssFile:           10812 kB
    RssShmem:          13440 kB

Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup
was created with the memory limit 256 MB, then PostgreSQL was launched withing
this cgroup with shared_buffers = 128 MB:

    $ cd /sys/fs/cgroup
    $ mkdir postgres
    $ cd postres
    $ echo 268435456 > memory.max

    $ echo $MASTER_PID_SHELL > cgroup.procs
    # postgres from the master branch has being successfully launched
    #  from that shell
    $ cat memory.current
    17465344 (~16.6 MB)
    # stop postgres

    $ echo $PATCH_PID_SHELL > cgroup.procs
    # postgres from the patch has being successfully launched from that shell
    $ cat memory.current
    20770816 (~19.8 MB)

To control the amount of space reserved a new GUC max_available_memory
is introduced. Ideally it should be based on the maximum available
memory, hense the name.

There are also few unrelated advantages of using anon files:

* We've got a file descriptor, which could be used for regular file
  operations (modification, truncation, you name it).

* The file could be given a name, which improves readability when it
  comes to process maps.

* By default, Linux will not add file-backed shared mappings into a core dump,
  making it more convenient to work with them in PostgreSQL: no more huge dumps
  to process.

The downside is that memfd_create is Linux specific.
---
 src/backend/port/sysv_shmem.c       | 290 ++++++++++++++++++++++------
 src/backend/port/win32_shmem.c      |   2 +-
 src/backend/storage/ipc/ipci.c      |   5 +-
 src/backend/storage/ipc/shmem.c     |   2 +-
 src/backend/utils/init/globals.c    |   1 +
 src/backend/utils/misc/guc_tables.c |  14 ++
 src/include/miscadmin.h             |   1 +
 src/include/portability/mem.h       |   2 +-
 src/include/storage/pg_shmem.h      |   5 +-
 9 files changed, 262 insertions(+), 60 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 56af0231d24..363ddfd1fca 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -97,10 +97,12 @@ void	   *UsedShmemSegAddr = NULL;
 typedef struct AnonymousMapping
 {
 	int shmem_segment;
-	Size shmem_size; 			/* Size of the mapping */
+	Size shmem_size; 			/* Size of the actually used memory */
+	Size shmem_reserved; 		/* Size of the reserved mapping */
 	Pointer shmem; 				/* Pointer to the start of the mapped memory */
 	Pointer seg_addr; 			/* SysV shared memory for the header */
 	unsigned long seg_id; 		/* IPC key */
+	int segment_fd; 			/* fd for the backing anon file */
 } AnonymousMapping;
 
 static AnonymousMapping Mappings[ANON_MAPPINGS];
@@ -108,6 +110,49 @@ static AnonymousMapping Mappings[ANON_MAPPINGS];
 /* Keeps track of used mapping segments */
 static int next_free_segment = 0;
 
+/*
+ * Anonymous mapping layout we use looks like this:
+ *
+ * 00400000-00c2a000 r-xp 			/bin/postgres
+ * ...
+ * 3f526000-3f590000 rw-p 			[heap]
+ * 7fbd827fe000-7fbd8bdde000 rw-s 	/memfd:main (deleted)
+ * 7fbd8bdde000-7fbe82800000 ---s 	/memfd:main (deleted)
+ * 7fbe82800000-7fbe90670000 r--p 	/usr/lib/locale/locale-archive
+ * 7fbe90800000-7fbe90941000 r-xp 	/usr/lib64/libstdc++.so.6.0.34
+ * ...
+ *
+ * We need to place shared memory mappings in such a way, that there will be
+ * gaps between them in the address space. Those gaps have to be large enough
+ * to resize the mapping up to certain size, without counting towards the total
+ * memory consumption.
+ *
+ * To achieve this, for each shared memory segment we first create an anonymous
+ * file of specified size using memfd_create, which will accomodate actual
+ * shared memory mapping content. It is represented by the first /memfd:main
+ * with rw permissions. Then we create a mapping for this file using mmap, with
+ * size much larger than required and flags PROT_NONE (allows to make sure the
+ * reserved space will not be used) and MAP_NORESERVE (prevents the space from
+ * being counted against memory limits). The mapping serves as an address space
+ * reservation, into which shared memory segment can be extended and is
+ * represented by the second /memfd:main with no permissions.
+ *
+ * The reserved space for each segment is calculated as a fraction of the total
+ * reserved space (MaxAvailableMemory), as specified in the SHMEM_RESIZE_RATIO
+ * array.
+ */
+static double SHMEM_RESIZE_RATIO[1] = {
+	1.0, 									/* MAIN_SHMEM_SLOT */
+};
+
+/*
+ * Flag telling that we have decided to use huge pages.
+ *
+ * XXX: It's possible to use GetConfigOption("huge_pages_status", false, false)
+ * instead, but it feels like an overkill.
+ */
+static bool huge_pages_on = false;
+
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
@@ -503,19 +548,20 @@ PGSharedMemoryAttach(IpcMemoryId shmId,
  * hugepage sizes, we might want to think about more invasive strategies,
  * such as increasing shared_buffers to absorb the extra space.
  *
- * Returns the (real, assumed or config provided) page size into
- * *hugepagesize, and the hugepage-related mmap flags to use into
- * *mmap_flags if requested by the caller.  If huge pages are not supported,
- * *hugepagesize and *mmap_flags are set to 0.
+ * Returns the (real, assumed or config provided) page size into *hugepagesize,
+ * the hugepage-related mmap and memfd flags to use into *mmap_flags and
+ * *memfd_flags if requested by the caller. If huge pages are not supported,
+ * *hugepagesize, *mmap_flags and *memfd_flags are set to 0.
  */
 void
-GetHugePageSize(Size *hugepagesize, int *mmap_flags)
+GetHugePageSize(Size *hugepagesize, int *mmap_flags, int *memfd_flags)
 {
 #ifdef MAP_HUGETLB
 
 	Size		default_hugepagesize = 0;
 	Size		hugepagesize_local = 0;
 	int			mmap_flags_local = 0;
+	int			memfd_flags_local = 0;
 
 	/*
 	 * System-dependent code to find out the default huge page size.
@@ -574,6 +620,7 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 	}
 
 	mmap_flags_local = MAP_HUGETLB;
+	memfd_flags_local = MFD_HUGETLB;
 
 	/*
 	 * On recent enough Linux, also include the explicit page size, if
@@ -584,7 +631,16 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 	{
 		int			shift = pg_ceil_log2_64(hugepagesize_local);
 
-		mmap_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
+		memfd_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
+	}
+#endif
+
+#if defined(MFD_HUGE_MASK) && defined(MFD_HUGE_SHIFT)
+	if (hugepagesize_local != default_hugepagesize)
+	{
+		int			shift = pg_ceil_log2_64(hugepagesize_local);
+
+		memfd_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
 	}
 #endif
 
@@ -593,6 +649,8 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 		*mmap_flags = mmap_flags_local;
 	if (hugepagesize)
 		*hugepagesize = hugepagesize_local;
+	if (memfd_flags)
+		*memfd_flags = memfd_flags_local;
 
 #else
 
@@ -600,6 +658,8 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 		*hugepagesize = 0;
 	if (mmap_flags)
 		*mmap_flags = 0;
+	if (memfd_flags)
+		*memfd_flags = 0;
 
 #endif							/* MAP_HUGETLB */
 }
@@ -625,72 +685,90 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
  * Creates an anonymous mmap()ed shared memory segment.
  *
  * This function will modify mapping size to the actual size of the allocation,
- * if it ends up allocating a segment that is larger than requested.
+ * if it ends up allocating a segment that is larger than requested. If needed,
+ * it also rounds up the mapping reserved size to be a multiple of huge page
+ * size.
+ *
+ * Note that we do not fallback from huge pages to regular pages in this
+ * function, this decision was already made in ReserveAnonymousMemory and we
+ * stick to it.
  */
 static void
 CreateAnonymousSegment(AnonymousMapping *mapping)
 {
 	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
-	int			mmap_errno = 0;
+	int			save_errno = 0;
+	int			mmap_flags = PG_MMAP_FLAGS, memfd_flags = 0;
+
+	elog(DEBUG1, "segment[%s]: size %zu, reserved %zu",
+		 MappingName(mapping->shmem_segment), mapping->shmem_size,
+		 mapping->shmem_reserved);
 
 #ifndef MAP_HUGETLB
-	/* PGSharedMemoryCreate should have dealt with this case */
-	Assert(huge_pages != HUGE_PAGES_ON);
+	/* PrepareHugePages should have dealt with this case */
+	Assert(huge_pages != HUGE_PAGES_ON && !huge_pages_on);
 #else
-	if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
+	if (huge_pages_on)
 	{
-		/*
-		 * Round up the request size to a suitable large value.
-		 */
 		Size		hugepagesize;
-		int			mmap_flags;
 
-		GetHugePageSize(&hugepagesize, &mmap_flags);
+		/* Make sure nothing is messed up */
+		Assert(huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY);
+
+		/* Round up the request size to a suitable large value */
+		GetHugePageSize(&hugepagesize, &mmap_flags, &memfd_flags);
 
 		if (allocsize % hugepagesize != 0)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
-		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
-		mmap_errno = errno;
-		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
-		{
-			DebugMappings();
-			elog(DEBUG1, "segment[%s]: mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-				 MappingName(mapping->shmem_segment), allocsize);
-		}
+		/*
+		 * The reserved space is multiple of BLCKSZ. We know the huge page
+		 * size, round up the reserved space to it.
+		 */
+		mapping->shmem_reserved = mapping->shmem_reserved + hugepagesize -
+			(mapping->shmem_reserved % hugepagesize);
+
+		/* Verify that the new size is withing the reserved boundaries */
+		if (mapping->shmem_reserved < mapping->shmem_size)
+			ereport(ERROR,
+					(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+					 errmsg("not enough shared memory is reserved"),
+					 errhint("You may need to increase \"max_available_memory\".")));
+
+		mmap_flags = PG_MMAP_FLAGS | mmap_flags;
 	}
 #endif
 
 	/*
-	 * Report whether huge pages are in use.  This needs to be tracked before
-	 * the second mmap() call if attempting to use huge pages failed
-	 * previously.
+	 * Prepare an anonymous file backing the segment. Its size will be
+	 * specified later via ftruncate.
+	 *
+	 * The file behaves like a regular file, but lives in memory. Once all
+	 * references to the file are dropped,  it is automatically released.
+	 * Anonymous memory is used for all backing pages of the file, thus it has
+	 * the same semantics as anonymous memory allocations using mmap with the
+	 * MAP_ANONYMOUS flag.
 	 */
-	SetConfigOption("huge_pages_status", (ptr == MAP_FAILED) ? "off" : "on",
-					PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
+	mapping->segment_fd = memfd_create(MappingName(mapping->shmem_segment),
+									   memfd_flags);
 
-	if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
+	/*
+	 * Specify the segment file size using allocsize, which contains
+	 * potentially modified value.
+	 */
+	if(ftruncate(mapping->segment_fd, allocsize) == -1)
 	{
-		/*
-		 * Use the original size, not the rounded-up value, when falling back
-		 * to non-huge pages.
-		 */
-		allocsize = mapping->shmem_size;
-		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS, -1, 0);
-		mmap_errno = errno;
-	}
+		save_errno = errno;
 
-	if (ptr == MAP_FAILED)
-	{
-		errno = mmap_errno;
 		DebugMappings();
+		close(mapping->segment_fd);
+
+		errno = save_errno;
 		ereport(FATAL,
-				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
+				(errmsg("segment[%s]: could not truncate anonymous file: %m",
 						MappingName(mapping->shmem_segment)),
-				 (mmap_errno == ENOMEM) ?
+				 (save_errno == ENOMEM) ?
 				 errhint("This error usually means that PostgreSQL's request "
 						 "for a shared memory segment exceeded available memory, "
 						 "swap space, or huge pages. To reduce the request size "
@@ -700,10 +778,112 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 						 allocsize) : 0));
 	}
 
+	elog(DEBUG1, "segment[%s]: mmap(%zu)",
+		 MappingName(mapping->shmem_segment), allocsize);
+
+	/*
+	 * Create a reservation mapping.
+	 */
+	ptr = mmap(NULL, mapping->shmem_reserved, PROT_NONE,
+			   mmap_flags | MAP_NORESERVE, mapping->segment_fd, 0);
+	save_errno = errno;
+
+	if (ptr == MAP_FAILED)
+	{
+		DebugMappings();
+
+		errno = save_errno;
+		ereport(FATAL,
+				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
+						MappingName(mapping->shmem_segment))));
+	}
+
+	/* Make the memory accessible */
+	if(mprotect(ptr, allocsize, PROT_READ | PROT_WRITE) == -1)
+	{
+		save_errno = errno;
+		DebugMappings();
+
+		errno = save_errno;
+		ereport(FATAL,
+				(errmsg("segment[%s]: could not mprotect anonymous shared memory: %m",
+						MappingName(mapping->shmem_segment))));
+	}
+
 	mapping->shmem = ptr;
 	mapping->shmem_size = allocsize;
 }
 
+/*
+ * PrepareHugePages
+ *
+ * Figure out if there are enough huge pages to allocate all shared memory
+ * segments, and report that information via huge_pages_status and
+ * huge_pages_on. It needs to be called before creating shared memory segments.
+ *
+ * It is necessary to maintain the same semantic (simple on/off) for
+ * huge_pages_status, even if there are multiple shared memory segments: all
+ * segments either use huge pages or not, there is no mix of segments with
+ * different page size. The latter might be actually beneficial, in particular
+ * because only some segments may require large amount of memory, but for now
+ * we go with a simple solution.
+ */
+void
+PrepareHugePages()
+{
+	void	   *ptr = MAP_FAILED;
+
+	/* Reset to handle reinitialization */
+	next_free_segment = 0;
+
+	/* Complain if hugepages demanded but we can't possibly support them */
+#if !defined(MAP_HUGETLB)
+	if (huge_pages == HUGE_PAGES_ON)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("huge pages not supported on this platform")));
+#else
+	if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
+	{
+		Size		hugepagesize, total_size = 0;
+		int			mmap_flags;
+
+		GetHugePageSize(&hugepagesize, &mmap_flags, NULL);
+
+		/*
+		 * Figure out how much memory is needed for all segments, keeping in
+		 * mind that for every segment this value will be rounding up by the
+		 * huge page size. The resulting value will be used to probe memory and
+		 * decide whether we will allocate huge pages or not.
+		 */
+		for(int segment = 0; segment < ANON_MAPPINGS; segment++)
+		{
+			int	numSemas;
+			Size segment_size = CalculateShmemSize(&numSemas, segment);
+
+			if (segment_size % hugepagesize != 0)
+				segment_size += hugepagesize - (segment_size % hugepagesize);
+
+			total_size += segment_size;
+		}
+
+		/* Map total amount of memory to test its availability. */
+		elog(DEBUG1, "reserving space: probe mmap(%zu) with MAP_HUGETLB",
+					 total_size);
+		ptr = mmap(NULL, total_size, PROT_NONE,
+				   PG_MMAP_FLAGS | MAP_ANONYMOUS | mmap_flags, -1, 0);
+	}
+#endif
+
+	/*
+	 * Report whether huge pages are in use. This needs to be tracked before
+	 * creating shared memory segments.
+	 */
+	SetConfigOption("huge_pages_status", (ptr == MAP_FAILED) ? "off" : "on",
+					PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
+	huge_pages_on = ptr != MAP_FAILED;
+}
+
 /*
  * AnonymousShmemDetach --- detach from an anonymous mmap'd block
  * (called as an on_shmem_exit callback, hence funny argument list)
@@ -746,7 +926,7 @@ PGSharedMemoryCreate(Size size,
 	void	   *memAddress;
 	PGShmemHeader *hdr;
 	struct stat statbuf;
-	Size		sysvsize;
+	Size		sysvsize, total_reserved;
 	AnonymousMapping *mapping = &Mappings[next_free_segment];
 
 	/*
@@ -760,14 +940,6 @@ PGSharedMemoryCreate(Size size,
 				 errmsg("could not stat data directory \"%s\": %m",
 						DataDir)));
 
-	/* Complain if hugepages demanded but we can't possibly support them */
-#if !defined(MAP_HUGETLB)
-	if (huge_pages == HUGE_PAGES_ON)
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("huge pages not supported on this platform")));
-#endif
-
 	/* For now, we don't support huge pages in SysV memory */
 	if (huge_pages == HUGE_PAGES_ON && shared_memory_type != SHMEM_TYPE_MMAP)
 		ereport(ERROR,
@@ -776,8 +948,16 @@ PGSharedMemoryCreate(Size size,
 
 	/* Room for a header? */
 	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
+
+	/* Prepare the mapping information */
 	mapping->shmem_size = size;
 	mapping->shmem_segment = next_free_segment;
+	total_reserved = (Size) MaxAvailableMemory * BLCKSZ;
+	mapping->shmem_reserved = total_reserved * SHMEM_RESIZE_RATIO[next_free_segment];
+
+	/* Round up to be a multiple of BLCKSZ */
+	mapping->shmem_reserved = mapping->shmem_reserved + BLCKSZ -
+		(mapping->shmem_reserved % BLCKSZ);
 
 	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
diff --git a/src/backend/port/win32_shmem.c b/src/backend/port/win32_shmem.c
index 4dee856d6bd..732fedee87e 100644
--- a/src/backend/port/win32_shmem.c
+++ b/src/backend/port/win32_shmem.c
@@ -627,7 +627,7 @@ pgwin32_ReserveSharedMemoryRegion(HANDLE hChild)
  * use GetLargePageMinimum() instead.
  */
 void
-GetHugePageSize(Size *hugepagesize, int *mmap_flags)
+GetHugePageSize(Size *hugepagesize, int *mmap_flags, int *memfd_flags)
 {
 	if (hugepagesize)
 		*hugepagesize = 0;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 8b38e985327..b60f7ef9ce2 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -206,6 +206,9 @@ CreateSharedMemoryAndSemaphores(void)
 
 	Assert(!IsUnderPostmaster);
 
+	/* Decide if we use huge pages or regular size pages */
+	PrepareHugePages();
+
 	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
 	{
 		/* Compute the size of the shared-memory block */
@@ -377,7 +380,7 @@ InitializeShmemGUCs(void)
 	/*
 	 * Calculate the number of huge pages required.
 	 */
-	GetHugePageSize(&hp_size, NULL);
+	GetHugePageSize(&hp_size, NULL, NULL);
 	if (hp_size != 0)
 	{
 		Size		hp_required;
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 72255a1c5ca..8d025f0e907 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -817,7 +817,7 @@ pg_get_shmem_pagesize(void)
 	Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
 
 	if (huge_pages_status == HUGE_PAGES_ON)
-		GetHugePageSize(&os_page_size, NULL);
+		GetHugePageSize(&os_page_size, NULL, NULL);
 
 	return os_page_size;
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..90d3feb547c 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -140,6 +140,7 @@ int			max_parallel_maintenance_workers = 2;
  * register background workers.
  */
 int			NBuffers = 16384;
+int			MaxAvailableMemory = 524288;
 int			MaxConnections = 100;
 int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f04bfedb2fd..a221e446d6a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2376,6 +2376,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"max_available_memory", PGC_SIGHUP, RESOURCES_MEM,
+			gettext_noop("Sets the upper limit for the shared_buffers value."),
+			gettext_noop("Shared memory could be resized at runtime, this "
+						 "parameters sets the upper limit for it, beyond which "
+						 "resizing would not be supported. Normally this value "
+						 "would be the same as the total available memory."),
+			GUC_UNIT_BLOCKS
+		},
+		&MaxAvailableMemory,
+		524288, 16, INT_MAX / 2,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_buffer_usage_limit", PGC_USERSET, RESOURCES_MEM,
 			gettext_noop("Sets the buffer pool size for VACUUM, ANALYZE, and autovacuum."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..a0c37a7749e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -173,6 +173,7 @@ extern PGDLLIMPORT char *DataDir;
 extern PGDLLIMPORT int data_directory_mode;
 
 extern PGDLLIMPORT int NBuffers;
+extern PGDLLIMPORT int MaxAvailableMemory;
 extern PGDLLIMPORT int MaxBackends;
 extern PGDLLIMPORT int MaxConnections;
 extern PGDLLIMPORT int max_worker_processes;
diff --git a/src/include/portability/mem.h b/src/include/portability/mem.h
index ef9800732d9..40588ff6968 100644
--- a/src/include/portability/mem.h
+++ b/src/include/portability/mem.h
@@ -38,7 +38,7 @@
 #define MAP_NOSYNC			0
 #endif
 
-#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_HASSEMAPHORE)
 
 /* Some really old systems don't define MAP_FAILED. */
 #ifndef MAP_FAILED
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 2348c59b5a0..79b0b1ef9eb 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
 extern PGDLLIMPORT int huge_page_size;
 extern PGDLLIMPORT int huge_pages_status;
+extern PGDLLIMPORT int MaxAvailableMemory;
 
 /* Possible values for huge_pages and huge_pages_status */
 typedef enum
@@ -104,7 +105,9 @@ extern PGShmemHeader *PGSharedMemoryCreate(Size size,
 										   PGShmemHeader **shim);
 extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
-extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
+extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
+							int *memfd_flags);
+void PrepareHugePages(void);
 
 /* The main segment, contains everything except buffer blocks and related data. */
 #define MAIN_SHMEM_SEGMENT 0
-- 
2.49.0

v5-0006-Introduce-multiple-shmem-segments-for-shared-buff.patchtext/plain; charset=us-asciiDownload

From f23d42ef1ccdb28b751a8f12c7737002def4e674 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 11:22:02 +0200
Subject: [PATCH v5 06/10] Introduce multiple shmem segments for shared buffers

Add more shmem segments to split shared buffers into following chunks:
* BUFFERS_SHMEM_SEGMENT: contains buffer blocks
* BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
* BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
* CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
* STRATEGY_SHMEM_SEGMENT: contains buffer strategy status

Size of the corresponding shared data directly depends on NBuffers,
meaning that if we would like to change NBuffers, they have to be
resized correspondingly. Placing each of them in a separate shmem
segment allows to achieve that.

There are some asumptions made about each of shmem segments upper size
limit. The buffer blocks have the largest, while the rest claim less
extra room for resize. Ideally those limits have to be deduced from the
maximum allowed shared memory.
---
 src/backend/port/sysv_shmem.c          | 24 +++++++-
 src/backend/storage/buffer/buf_init.c  | 79 +++++++++++++++++---------
 src/backend/storage/buffer/buf_table.c |  6 +-
 src/backend/storage/buffer/freelist.c  |  5 +-
 src/backend/storage/ipc/ipci.c         |  2 +-
 src/include/storage/bufmgr.h           |  2 +-
 src/include/storage/pg_shmem.h         | 24 +++++++-
 7 files changed, 105 insertions(+), 37 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 363ddfd1fca..dac011b766b 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -139,10 +139,18 @@ static int next_free_segment = 0;
  *
  * The reserved space for each segment is calculated as a fraction of the total
  * reserved space (MaxAvailableMemory), as specified in the SHMEM_RESIZE_RATIO
- * array.
+ * array. E.g. we allow BUFFERS_SHMEM_SEGMENT to take up to 60% of the whole
+ * space when resizing, based on the fact that it most likely will be the main
+ * consumer of this memory. Those numbers are pulled out of thin air for now,
+ * makes sense to evaluate them more precise.
  */
-static double SHMEM_RESIZE_RATIO[1] = {
-	1.0, 									/* MAIN_SHMEM_SLOT */
+static double SHMEM_RESIZE_RATIO[6] = {
+	0.1,    /* MAIN_SHMEM_SEGMENT */
+	0.6,    /* BUFFERS_SHMEM_SEGMENT */
+	0.1,    /* BUFFER_DESCRIPTORS_SHMEM_SEGMENT */
+	0.1,    /* BUFFER_IOCV_SHMEM_SEGMENT */
+	0.05,   /* CHECKPOINT_BUFFERS_SHMEM_SEGMENT */
+	0.05,   /* STRATEGY_SHMEM_SEGMENT */
 };
 
 /*
@@ -167,6 +175,16 @@ MappingName(int shmem_segment)
 	{
 		case MAIN_SHMEM_SEGMENT:
 			return "main";
+		case BUFFERS_SHMEM_SEGMENT:
+			return "buffers";
+		case BUFFER_DESCRIPTORS_SHMEM_SEGMENT:
+			return "descriptors";
+		case BUFFER_IOCV_SHMEM_SEGMENT:
+			return "iocv";
+		case CHECKPOINT_BUFFERS_SHMEM_SEGMENT:
+			return "checkpoint";
+		case STRATEGY_SHMEM_SEGMENT:
+			return "strategy";
 		default:
 			return "unknown";
 	}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..bd68b69ee98 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -62,7 +62,10 @@ CkptSortItem *CkptBufferIds;
  * Initialize shared buffer pool
  *
  * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend).
+ * postmaster, or in a standalone backend). Size of data structures initialized
+ * here depends on NBuffers, and to be able to change NBuffers without a
+ * restart we store each structure into a separate shared memory segment, which
+ * could be resized on demand.
  */
 void
 BufferManagerShmemInit(void)
@@ -74,22 +77,22 @@ BufferManagerShmemInit(void)
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
+		ShmemInitStructInSegment("Buffer Descriptors",
 						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+						&foundDescs, BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
 		TYPEALIGN(PG_IO_ALIGN_SIZE,
-				  ShmemInitStruct("Buffer Blocks",
+				  ShmemInitStructInSegment("Buffer Blocks",
 								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
-								  &foundBufs));
+								  &foundBufs, BUFFERS_SHMEM_SEGMENT));
 
 	/* Align condition variables to cacheline boundary. */
 	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
-		ShmemInitStruct("Buffer IO Condition Variables",
+		ShmemInitStructInSegment("Buffer IO Condition Variables",
 						NBuffers * sizeof(ConditionVariableMinimallyPadded),
-						&foundIOCV);
+						&foundIOCV, BUFFER_IOCV_SHMEM_SEGMENT);
 
 	/*
 	 * The array used to sort to-be-checkpointed buffer ids is located in
@@ -99,8 +102,9 @@ BufferManagerShmemInit(void)
 	 * painful.
 	 */
 	CkptBufferIds = (CkptSortItem *)
-		ShmemInitStruct("Checkpoint BufferIds",
-						NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
+		ShmemInitStructInSegment("Checkpoint BufferIds",
+						NBuffers * sizeof(CkptSortItem), &foundBufCkpt,
+						CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
 
 	if (foundDescs || foundBufs || foundIOCV || foundBufCkpt)
 	{
@@ -156,33 +160,54 @@ BufferManagerShmemInit(void)
  * BufferManagerShmemSize
  *
  * compute the size of shared memory for the buffer pool including
- * data pages, buffer descriptors, hash tables, etc.
+ * data pages, buffer descriptors, hash tables, etc. based on the
+ * shared memory segment. The main segment must not allocate anything
+ * related to buffers, every other segment will receive part of the
+ * data.
  */
 Size
-BufferManagerShmemSize(void)
+BufferManagerShmemSize(int shmem_segment)
 {
 	Size		size = 0;
 
-	/* size of buffer descriptors */
-	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
-	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_segment == MAIN_SHMEM_SEGMENT)
+		return size;
 
-	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
-	size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	if (shmem_segment == BUFFER_DESCRIPTORS_SHMEM_SEGMENT)
+	{
+		/* size of buffer descriptors */
+		size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
+		/* to allow aligning buffer descriptors */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
 
-	/* size of stuff controlled by freelist.c */
-	size = add_size(size, StrategyShmemSize());
+	if (shmem_segment == BUFFERS_SHMEM_SEGMENT)
+	{
+		/* size of data pages, plus alignment padding */
+		size = add_size(size, PG_IO_ALIGN_SIZE);
+		size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	}
 
-	/* size of I/O condition variables */
-	size = add_size(size, mul_size(NBuffers,
-								   sizeof(ConditionVariableMinimallyPadded)));
-	/* to allow aligning the above */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_segment == STRATEGY_SHMEM_SEGMENT)
+	{
+		/* size of stuff controlled by freelist.c */
+		size = add_size(size, StrategyShmemSize());
+	}
 
-	/* size of checkpoint sort array in bufmgr.c */
-	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	if (shmem_segment == BUFFER_IOCV_SHMEM_SEGMENT)
+	{
+		/* size of I/O condition variables */
+		size = add_size(size, mul_size(NBuffers,
+									   sizeof(ConditionVariableMinimallyPadded)));
+		/* to allow aligning the above */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
+
+	if (shmem_segment == CHECKPOINT_BUFFERS_SHMEM_SEGMENT)
+	{
+		/* size of checkpoint sort array in bufmgr.c */
+		size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	}
 
 	return size;
 }
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index a50955d5286..a9952b36eba 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -22,6 +22,7 @@
 #include "postgres.h"
 
 #include "storage/buf_internals.h"
+#include "storage/pg_shmem.h"
 
 /* entry for buffer lookup hashtable */
 typedef struct
@@ -59,10 +60,11 @@ InitBufTable(int size)
 	info.entrysize = sizeof(BufferLookupEnt);
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
-	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
+	SharedBufHash = ShmemInitHashInSegment("Shared Buffer Lookup Table",
 								  size, size,
 								  &info,
-								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION,
+								  STRATEGY_SHMEM_SEGMENT);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..bd390f2709d 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
@@ -491,9 +492,9 @@ StrategyInitialize(bool init)
 	 * Get or create the shared strategy control block
 	 */
 	StrategyControl = (BufferStrategyControl *)
-		ShmemInitStruct("Buffer Strategy Status",
+		ShmemInitStructInSegment("Buffer Strategy Status",
 						sizeof(BufferStrategyControl),
-						&found);
+						&found, STRATEGY_SHMEM_SEGMENT);
 
 	if (!found)
 	{
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index b60f7ef9ce2..2dbd81afc87 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -113,7 +113,7 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
 											 sizeof(ShmemIndexEnt)));
 	size = add_size(size, dsm_estimate_size());
 	size = add_size(size, DSMRegistryShmemSize());
-	size = add_size(size, BufferManagerShmemSize());
+	size = add_size(size, BufferManagerShmemSize(shmem_segment));
 	size = add_size(size, LockManagerShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41fdc1e7693..edac9db6a12 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -318,7 +318,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
-extern Size BufferManagerShmemSize(void);
+extern Size BufferManagerShmemSize(int);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 79b0b1ef9eb..a7b275b4db9 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -52,7 +52,7 @@ typedef struct ShmemSegment
 } ShmemSegment;
 
 /* Number of available segments for anonymous memory mappings */
-#define ANON_MAPPINGS 1
+#define ANON_MAPPINGS 6
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 
@@ -109,7 +109,29 @@ extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
 							int *memfd_flags);
 void PrepareHugePages(void);
 
+/*
+ * To be able to dynamically resize largest parts of the data stored in shared
+ * memory, we split it into multiple shared memory mappings segments. Each
+ * segment contains only certain part of the data, which size depends on
+ * NBuffers.
+ */
+
 /* The main segment, contains everything except buffer blocks and related data. */
 #define MAIN_SHMEM_SEGMENT 0
 
+/* Buffer blocks */
+#define BUFFERS_SHMEM_SEGMENT 1
+
+/* Buffer descriptors */
+#define BUFFER_DESCRIPTORS_SHMEM_SEGMENT 2
+
+/* Condition variables for buffers */
+#define BUFFER_IOCV_SHMEM_SEGMENT 3
+
+/* Checkpoint BufferIds */
+#define CHECKPOINT_BUFFERS_SHMEM_SEGMENT 4
+
+/* Buffer strategy status */
+#define STRATEGY_SHMEM_SEGMENT 5
+
 #endif							/* PG_SHMEM_H */
-- 
2.49.0

v5-0007-Allow-to-resize-shared-memory-without-restart.patchtext/plain; charset=us-asciiDownload

From c5717c5b68439abf4488fc3e1c2e47a94c6cd9f3 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 14:16:55 +0200
Subject: [PATCH v5 07/10] Allow to resize shared memory without restart

Add assing hook for shared_buffers to resize shared memory using space,
introduced in the previous commits without requiring PostgreSQL restart.
Essentially the implementation is based on two mechanisms: a
ProcSignalBarrier is used to make sure all processes are starting the
resize procedure simultaneously, and a global Barrier is used to
coordinate after that and make sure all finished processes are waiting
for others that are in progress.

The resize process looks like this:

* The GUC assign hook sets a flag to let the Postmaster know that resize
  was requested.

* Postmaster verifies the flag in the event loop, and starts the resize
  by emitting a ProcSignal barrier.

* All processes, that participate in ProcSignal mechanism, begin to
  process ProcSignal barrier. First a process waits until all processes
  have confirmed they received the message and can start simultaneously.

* Every process recalculates shared memory size based on the new
  NBuffers, adjusts its size using ftruncate and adjust reservation
  permissions with mprotect. One elected process signals the postmaster
  to do the same.

* When finished, every process waits on a global ShmemControl barrier,
  untill all others are finished as well. This way we ensure three
  stages with clear boundaries: before the resize, when all processes
  use old NBuffers; during the resize, when processes have mix of old
  and new NBuffers, and wait until it's done; after the resize, when all
  processes use new NBuffers.

* After all processes are using new value, one of them will initialize
  new shared structures (buffer blocks, descriptors, etc) as needed and
  broadcast new value of NBuffers via ShmemControl in shared memory.
  Other backends are waiting for this operation to finish as well. Then
  the barrier is lifted and everything goes as usual.

Since resizing takes time, we need to take into account that during that time:

- New backends can be spawned. They will check status of the barrier
  early during the bootstrap, and wait until everything is over to work
  with the new NBuffers value.

- Old backends can exit before attempting to resize. Synchronization
  used between backends relies on ProcSignalBarrier and waits for all
  participants received the message at the beginning to gather all
  existing backends.

- Some backends might be blocked and not responsing either before or
  after receiving the message. In the first case such backend still
  have ProcSignalSlot and should be waited for, in the second case
  shared barrier will make sure we still waiting for those backends. In
  any case there is an unbounded wait.

- Backends might join barrier in disjoint groups with some time in
  between. That means that relying only on the shared dynamic barrier is
  not enough -- it will only synchronize resize procedure withing those
  groups. That's why we wait first for all participants of ProcSignal
  mechanism who received the message.

Here is how it looks like after raising shared_buffers from 128 MB to
512 MB and calling pg_reload_conf():

    -- 128 MB
    7f87909fc000-7f8798248000 rw-s /memfd:strategy (deleted)
    7f8798248000-7f879d6ca000 ---s /memfd:strategy (deleted)
    7f879d6ca000-7f87a4e84000 rw-s /memfd:checkpoint (deleted)
    7f87a4e84000-7f87aa398000 ---s /memfd:checkpoint (deleted)
    7f87aa398000-7f87b1b42000 rw-s /memfd:iocv (deleted)
    7f87b1b42000-7f87c3d32000 ---s /memfd:iocv (deleted)
    7f87c3d32000-7f87cb59c000 rw-s /memfd:descriptors (deleted)
    7f87cb59c000-7f87dd6cc000 ---s /memfd:descriptors (deleted)
    7f87dd6cc000-7f87ece38000 rw-s /memfd:buffers (deleted)
    ^ buffers content, ~247 MB
    7f87ece38000-7f8877066000 ---s /memfd:buffers (deleted)
    ^ reserved space, ~2210 MB
    7f8877066000-7f887e7d0000 rw-s /memfd:main (deleted)
    7f887e7d0000-7f8890a00000 ---s /memfd:main (deleted)

    -- 512 MB
    7f87909fc000-7f879866a000 rw-s /memfd:strategy (deleted)
    7f879866a000-7f879d6ca000 ---s /memfd:strategy (deleted)
    7f879d6ca000-7f87a50f4000 rw-s /memfd:checkpoint (deleted)
    7f87a50f4000-7f87aa398000 ---s /memfd:checkpoint (deleted)
    7f87aa398000-7f87b1d82000 rw-s /memfd:iocv (deleted)
    7f87b1d82000-7f87c3d32000 ---s /memfd:iocv (deleted)
    7f87c3d32000-7f87cba1c000 rw-s /memfd:descriptors (deleted)
    7f87cba1c000-7f87dd6cc000 ---s /memfd:descriptors (deleted)
    7f87dd6cc000-7f8804fb8000 rw-s /memfd:buffers (deleted)
    ^ buffers content, ~632 MB
    7f8804fb8000-7f8877066000 ---s /memfd:buffers (deleted)
    ^ reserved space, ~1824 MB
    7f8877066000-7f887e950000 rw-s /memfd:main (deleted)
    7f887e950000-7f8890a00000 ---s /memfd:main (deleted)

The implementation supports only increasing of shared_buffers. For
decreasing the value a similar procedure is needed. But the buffer
blocks with data have to be drained first, so that the actual data set
fits into the new smaller space.

From experiment it turns out that shared mappings have to be extended
separately for each process that uses them. Another rough edge is that a
backend blocked on ReadCommand will not apply shared_buffers change
until it receives something.

Authors: Dmitrii Dolgov, Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c                 | 446 ++++++++++++++++++
 src/backend/postmaster/checkpointer.c         |  12 +-
 src/backend/postmaster/postmaster.c           |  18 +
 src/backend/storage/buffer/buf_init.c         |  74 +--
 src/backend/storage/ipc/ipci.c                |  18 +-
 src/backend/storage/ipc/procsignal.c          |  46 ++
 src/backend/storage/ipc/shmem.c               |  23 +-
 src/backend/tcop/postgres.c                   |  10 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/backend/utils/misc/guc_tables.c           |   4 +-
 src/include/storage/bufmgr.h                  |   2 +-
 src/include/storage/ipc.h                     |   3 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/storage/pg_shmem.h                |  26 +
 src/include/storage/pmsignal.h                |   1 +
 src/include/storage/procsignal.h              |   1 +
 src/tools/pgindent/typedefs.list              |   1 +
 17 files changed, 644 insertions(+), 45 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index dac011b766b..b3c90d15d52 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -30,13 +30,19 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "portability/mem.h"
+#include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
+#include "storage/pmsignal.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
 #include "utils/pidfile.h"
+#include "utils/wait_event.h"
 
 
 /*
@@ -107,6 +113,13 @@ typedef struct AnonymousMapping
 
 static AnonymousMapping Mappings[ANON_MAPPINGS];
 
+/* Flag telling postmaster that resize is needed */
+volatile bool pending_pm_shmem_resize = false;
+
+/* Keeps track of the previous NBuffers value */
+static int NBuffersOld = -1;
+static int NBuffersPending = -1;
+
 /* Keeps track of used mapping segments */
 static int next_free_segment = 0;
 
@@ -161,6 +174,49 @@ static double SHMEM_RESIZE_RATIO[6] = {
  */
 static bool huge_pages_on = false;
 
+/*
+ * Flag telling that we have prepared the memory layout to be resizable. If
+ * false after all shared memory segments creation, it means we failed to setup
+ * needed layout and falled back to the regular non-resizable approach.
+ */
+static bool shmem_resizable = false;
+
+/*
+ * Currently broadcasted value of NBuffers in shared memory.
+ *
+ * Most of the time this value is going to be equal to NBuffers. But if
+ * postmaster is resizing shared memory and a new backend was created
+ * at the same time, there is a possibility for the new backend to inherit the
+ * old NBuffers value, but miss the resize signal if ProcSignal infrastructure
+ * was not initialized yet. Consider this situation:
+ *
+ *     Postmaster ------> New Backend
+ *         |                   |
+ *         |                Launch
+ *         |                   |
+ *         |             Inherit NBuffers
+ *         |                   |
+ *     Resize NBuffers         |
+ *         |                   |
+ *     Emit Barrier            |
+ *         |            Init ProcSignal
+ *         |                   |
+ *     Finish resize           |
+ *         |                   |
+ *     New NBuffers       Old NBuffers
+ *
+ * In this case the backend is not yet ready to receive a signal from
+ * EmitProcSignalBarrier, and will be ignored. The same happens if ProcSignal
+ * is initialized even later, after the resizing was finished.
+ *
+ * To address resulting inconsistency, postmaster broadcasts the current
+ * NBuffers value via shared memory. Every new backend has to verify this value
+ * before it will access the buffer pool: if it differs from its own value,
+ * this indicates a shared memory resize has happened and the backend has to
+ * first synchronize with rest of the pack.
+ */
+ShmemControl *ShmemCtrl = NULL;
+
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
@@ -924,6 +980,349 @@ AnonymousShmemDetach(int status, Datum arg)
 	}
 }
 
+/*
+ * Resize all shared memory segments based on the current NBuffers value, which
+ * is is applied from NBuffersPending. The actual segment resizing is done via
+ * ftruncate, which will fail if is not sufficient space to expand the anon
+ * file. When finished, based on the new and old values initialize new buffer
+ * blocks if any.
+ *
+ * If reinitializing took place, as the last step this function does buffers
+ * reinitialization as well and broadcasts the new value of NSharedBuffers. All
+ * of that needs to be done only by one backend, the first one that managed to
+ * grab the ShmemResizeLock.
+ */
+bool
+AnonymousShmemResize(void)
+{
+	int		numSemas;
+	bool 	reinit = false;
+	int		mmap_flags = PG_MMAP_FLAGS;
+	Size 	hugepagesize;
+
+	NBuffers = NBuffersPending;
+
+	elog(DEBUG1, "Resize shmem from %d to %d", NBuffersOld, NBuffers);
+
+	/*
+	 * XXX: Where to reset the flag is still an open question. E.g. do we
+	 * consider a no-op when NBuffers is equal to NBuffersOld a genuine resize
+	 * and reset the flag?
+	 */
+	pending_pm_shmem_resize = false;
+
+	/*
+	 * XXX: Currently only increasing of shared_buffers is supported. For
+	 * decreasing something similar has to be done, but buffer blocks with
+	 * data have to be drained first.
+	 */
+	if(NBuffersOld > NBuffers)
+		return false;
+
+#ifndef MAP_HUGETLB
+	/* PrepareHugePages should have dealt with this case */
+	Assert(huge_pages != HUGE_PAGES_ON && !huge_pages_on);
+#else
+	if (huge_pages_on)
+	{
+		/* Make sure nothing is messed up */
+		Assert(huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY);
+
+		/* Round up the new size to a suitable large value */
+		GetHugePageSize(&hugepagesize, &mmap_flags, NULL);
+	}
+#endif
+
+	for(int i = 0; i < next_free_segment; i++)
+	{
+		/* Note that CalculateShmemSize indirectly depends on NBuffers */
+		Size new_size = CalculateShmemSize(&numSemas, i);
+		AnonymousMapping *m = &Mappings[i];
+
+#ifdef MAP_HUGETLB
+		if (huge_pages_on && (new_size % hugepagesize != 0))
+			new_size += hugepagesize - (new_size % hugepagesize);
+#endif
+
+		if (m->shmem == NULL)
+			continue;
+
+		if (m->shmem_size == new_size)
+			continue;
+
+		if (m->shmem_reserved < new_size)
+			ereport(ERROR,
+					(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+					 errmsg("not enough shared memory is reserved"),
+					 errhint("You may need to increase \"max_available_memory\".")));
+
+		elog(DEBUG1, "segment[%s]: resize from %zu to %zu at address %p",
+					 MappingName(m->shmem_segment), m->shmem_size,
+					 new_size, m->shmem);
+
+		/* Resize the backing anon file. */
+		if(ftruncate(m->segment_fd, new_size) == -1)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not truncase anonymous file for \"%s\": %m",
+							MappingName(m->shmem_segment))));
+
+		/* Adjust memory accessibility */
+		if(mprotect(m->shmem, new_size, PROT_READ | PROT_WRITE) == -1)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not mprotect anonymous shared memory for \"%s\": %m",
+							MappingName(m->shmem_segment))));
+
+		/* If shrinking, make reserved space unavailable again */
+		if(new_size < m->shmem_size &&
+		   mprotect(m->shmem + new_size, m->shmem_size - new_size, PROT_NONE) == -1)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not mprotect reserved shared memory for \"%s\": %m",
+							MappingName(m->shmem_segment))));
+
+		reinit = true;
+		m->shmem_size = new_size;
+	}
+
+	if (reinit)
+	{
+		if(IsUnderPostmaster &&
+			LWLockConditionalAcquire(ShmemResizeLock, LW_EXCLUSIVE))
+		{
+			/*
+			 * If the new NBuffers was already broadcasted, the buffer pool was
+			 * already initialized before.
+			 *
+			 * Since we're not on a hot path, we use lwlocks and do not need to
+			 * involve memory barrier.
+			 */
+			if(pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers)
+			{
+				/*
+				 * Allow the first backend that managed to get the lock to
+				 * reinitialize the new portion of buffer pool. Every other
+				 * process will wait on the shared barrier for that to finish,
+				 * since it's a part of the SHMEM_RESIZE_DONE phase.
+				 *
+				 * Note that it's enough when only one backend will do that,
+				 * even the ShmemInitStruct part. The reason is that resized
+				 * shared memory will maintain the same addresses, meaning that
+				 * all the pointers are still valid, and we only need to update
+				 * structures size in the ShmemIndex once -- any other backend
+				 * will pick up this shared structure from the index.
+				 *
+				 * XXX: This is the right place for buffer eviction as well.
+				 */
+				BufferManagerShmemInit(NBuffersOld);
+
+				/* If all fine, broadcast the new value */
+				pg_atomic_write_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
+			}
+
+			LWLockRelease(ShmemResizeLock);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * We are asked to resize shared memory. Wait for all ProcSignal participants
+ * to join the barrier, then do the resize and wait on the barrier until all
+ * participating finish resizing as well -- otherwise we face danger of
+ * inconsistency between backends.
+ *
+ * XXX: If a backend is blocked on ReadCommand in PostgresMain, it will not
+ * proceed with AnonymousShmemResize after receiving SIGHUP, until something
+ * will be sent.
+ */
+bool
+ProcessBarrierShmemResize(Barrier *barrier)
+{
+	Assert(IsUnderPostmaster);
+
+	elog(DEBUG1, "Handle a barrier for shmem resizing from %d to %d, %d",
+		 NBuffersOld, NBuffersPending, pending_pm_shmem_resize);
+
+	/* Wait until we have seen the new NBuffers value */
+	if (!pending_pm_shmem_resize)
+		return false;
+
+	/*
+	 * First thing to do after attaching to the barrier is to wait for others.
+	 * We can't simply use BarrierArriveAndWait, because backends might arrive
+	 * here in disjoint groups, e.g. first two backends, pause, then second two
+	 * backends. If the resize is quick enough that can lead to a situation
+	 * when the first group is already finished before the second has appeared,
+	 * and the barrier will only synchonize withing those groups.
+	 */
+	if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED)
+		WaitForProcSignalBarrierReceived(
+				pg_atomic_read_u64(&ShmemCtrl->Generation));
+
+	/*
+	 * Now start the procedure, and elect one backend to ping postmaster to do
+	 * the same.
+	 *
+	 * XXX: If we need to be able to abort resizing, this has to be done later,
+	 * after the SHMEM_RESIZE_DONE.
+	 */
+	if (BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START))
+	{
+		Assert(IsUnderPostmaster);
+		SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
+	}
+
+	AnonymousShmemResize();
+
+	/* The second phase means the resize has finished, SHMEM_RESIZE_DONE */
+	BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_DONE);
+
+	BarrierDetach(barrier);
+	return true;
+}
+
+/*
+ * GUC assign hook for shared_buffers. It's recommended for an assign hook to
+ * be as minimal as possible, thus we just request shared memory resize and
+ * remember the previous value.
+ */
+void
+assign_shared_buffers(int newval, void *extra, bool *pending)
+{
+	elog(DEBUG1, "Received SIGHUP for shmem resizing");
+
+	/* Request shared memory resize only when it was initialized */
+	if (next_free_segment != 0)
+	{
+		pending_pm_shmem_resize = true;
+		*pending = true;
+		NBuffersPending = newval;
+	}
+
+	NBuffersOld = NBuffers;
+}
+
+/*
+ * Test if we have somehow missed a shmem resize signal and NBuffers value
+ * differs from NSharedBuffers. If yes, catchup and do resize.
+ */
+void
+AdjustShmemSize(void)
+{
+	uint32 NSharedBuffers = pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers);
+
+	if (NSharedBuffers != NBuffers)
+	{
+		/*
+		 * If the broadcasted shared_buffers is different from the one we see,
+		 * it could be that the backend has missed a resize signal. To avoid
+		 * any inconsistency, adjust the shared mappings, before having a
+		 * chance to access the buffer pool.
+		 */
+		ereport(LOG,
+				(errmsg("shared_buffers has been changed from %d to %d, "
+						"resize shared memory",
+						NBuffers, NSharedBuffers)));
+		NBuffers = NSharedBuffers;
+		AnonymousShmemResize();
+	}
+}
+
+/*
+ * Start resizing procedure, making sure all existing processes will have
+ * consistent view of shared memory size. Must be called only in postmaster.
+ */
+void
+CoordinateShmemResize(void)
+{
+	elog(DEBUG1, "Coordinating shmem resize from %d to %d",
+		 NBuffersOld, NBuffers);
+	Assert(!IsUnderPostmaster);
+
+	/*
+	 * We use dynamic barrier to help dealing with backends that were spawned
+	 * during the resize.
+	 */
+	BarrierInit(&ShmemCtrl->Barrier, 0);
+
+	/*
+	 * If the value did not change, or shared memory segments are not
+	 * initialized yet, skip the resize.
+	 */
+	if (NBuffersPending == NBuffersOld || next_free_segment == 0)
+	{
+		elog(DEBUG1, "Skip resizing, new %d, old %d, free segment %d",
+			 NBuffers, NBuffersOld, next_free_segment);
+		return;
+	}
+
+	/*
+	 * Shared memory resize requires some coordination done by postmaster,
+	 * and consists of three phases:
+	 *
+	 * - Before the resize all existing backends have the same old NBuffers.
+	 * - When resize is in progress, backends are expected to have a
+	 *   mixture of old a new values. They're not allowed to touch buffer
+	 *   pool during this time frame.
+	 * - After resize has been finished, all existing backends, that can access
+	 *   the buffer pool, are expected to have the same new value of NBuffers.
+	 *
+	 * Those phases are ensured by joining the shared barrier associated with
+	 * the procedure. Since resizing takes time, we need to take into account
+	 * that during that time:
+	 *
+	 * - New backends can be spawned. They will check status of the barrier
+	 *   early during the bootstrap, and wait until everything is over to work
+	 *   with the new NBuffers value.
+	 *
+	 * - Old backends can exit before attempting to resize. Synchronization
+	 *   used between backends relies on ProcSignalBarrier and waits for all
+	 *   participants received the message at the beginning to gather all
+	 *   existing backends.
+	 *
+	 * - Some backends might be blocked and not responsing either before or
+	 *   after receiving the message. In the first case such backend still
+	 *   have ProcSignalSlot and should be waited for, in the second case
+	 *   shared barrier will make sure we still waiting for those backends. In
+	 *   any case there is an unbounded wait.
+	 *
+	 * - Backends might join barrier in disjoint groups with some time in
+	 *   between. That means that relying only on the shared dynamic barrier is
+	 *   not enough -- it will only synchronize resize procedure withing those
+	 *   groups. That's why we wait first for all participants of ProcSignal
+	 *   mechanism who received the message.
+	 */
+	elog(DEBUG1, "Emit a barrier for shmem resizing");
+	pg_atomic_init_u64(&ShmemCtrl->Generation,
+					   EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SHMEM_RESIZE));
+
+	/* To order everything after setting Generation value */
+	pg_memory_barrier();
+
+	/*
+	 * After that postmaster waits for PMSIGNAL_SHMEM_RESIZE as a sign that all
+	 * the rest of the pack has started the procedure and it can resize shared
+	 * memory as well.
+	 *
+	 * Normally we would call WaitForProcSignalBarrier here to wait until every
+	 * backend has reported on the ProcSignalBarrier. But for shared memory
+	 * resize we don't need this, as every participating backend will
+	 * synchronize on the ProcSignal barrier. In fact even if we would like to
+	 * wait here, it wouldn't be possible -- we're in the postmaster, without
+	 * any waiting infrastructure available.
+	 *
+	 * If at some point it will turn out that waiting is essential, we would
+	 * need to consider some alternatives. E.g. it could be a designated
+	 * coordination process, which is not a postmaster. Another option would be
+	 * to introduce a CoordinateShmemResize lock and allow only one process to
+	 * take it (this probably would have to be something different than
+	 * LWLocks, since they block interrupts, and coordination relies on them).
+	 */
+}
+
 /*
  * PGSharedMemoryCreate
  *
@@ -1239,3 +1638,50 @@ PGSharedMemoryDetach(void)
 		}
 	}
 }
+
+void
+WaitOnShmemBarrier()
+{
+	Barrier *barrier = &ShmemCtrl->Barrier;
+
+	/* Nothing to do if resizing is not started */
+	if (BarrierPhase(barrier) < SHMEM_RESIZE_START)
+		return;
+
+	BarrierAttach(barrier);
+
+	/* Otherwise wait through all available phases */
+	while (BarrierPhase(barrier) < SHMEM_RESIZE_DONE)
+	{
+		ereport(LOG, (errmsg("ProcSignal barrier is in phase %d, waiting",
+							 BarrierPhase(barrier))));
+
+		BarrierArriveAndWait(barrier, 0);
+	}
+
+	BarrierDetach(barrier);
+}
+
+void
+ShmemControlInit(void)
+{
+	bool foundShmemCtrl;
+
+	ShmemCtrl = (ShmemControl *)
+	ShmemInitStruct("Shmem Control", sizeof(ShmemControl),
+									 &foundShmemCtrl);
+
+	if (!foundShmemCtrl)
+	{
+		/*
+		 * The barrier is missing here, it will be initialized right before
+		 * starting the resizing process as a convenient way to reset it.
+		 */
+
+		/* Initialize with the currently known value */
+		pg_atomic_init_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
+
+		/* shmem_resizable should be initialized by now */
+		ShmemCtrl->Resizable = shmem_resizable;
+	}
+}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fda91ffd1ce..ab08e1a182b 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -638,9 +638,12 @@ CheckpointerMain(const void *startup_data, size_t startup_data_len)
 static void
 ProcessCheckpointerInterrupts(void)
 {
-	if (ProcSignalBarrierPending)
-		ProcessProcSignalBarrier();
-
+	/*
+	 * Reloading config can trigger further signals, complicating interrupts
+	 * processing -- so let it run first.
+	 *
+	 * XXX: Is there any need in memory barrier after ProcessConfigFile?
+	 */
 	if (ConfigReloadPending)
 	{
 		ConfigReloadPending = false;
@@ -660,6 +663,9 @@ ProcessCheckpointerInterrupts(void)
 		UpdateSharedMemoryConfig();
 	}
 
+	if (ProcSignalBarrierPending)
+		ProcessProcSignalBarrier();
+
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
 		ProcessLogMemoryContextInterrupt();
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 490f7ce3664..f0cb0098dcd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -426,6 +426,7 @@ static void process_pm_pmsignal(void);
 static void process_pm_child_exit(void);
 static void process_pm_reload_request(void);
 static void process_pm_shutdown_request(void);
+static void process_pm_shmem_resize(void);
 static void dummy_handler(SIGNAL_ARGS);
 static void CleanupBackend(PMChild *bp, int exitstatus);
 static void HandleChildCrash(int pid, int exitstatus, const char *procname);
@@ -1694,6 +1695,9 @@ ServerLoop(void)
 			if (pending_pm_pmsignal)
 				process_pm_pmsignal();
 
+			if (pending_pm_shmem_resize)
+				process_pm_shmem_resize();
+
 			if (events[i].events & WL_SOCKET_ACCEPT)
 			{
 				ClientSocket s;
@@ -2039,6 +2043,17 @@ process_pm_reload_request(void)
 	}
 }
 
+static void
+process_pm_shmem_resize(void)
+{
+	/*
+	 * Failure to resize is considered to be fatal and will not be
+	 * retried, which means we can disable pending flag right here.
+	 */
+	pending_pm_shmem_resize = false;
+	CoordinateShmemResize();
+}
+
 /*
  * pg_ctl uses SIGTERM, SIGINT and SIGQUIT to request different types of
  * shutdown.
@@ -3852,6 +3867,9 @@ process_pm_pmsignal(void)
 		request_state_update = true;
 	}
 
+	if (CheckPostmasterSignal(PMSIGNAL_SHMEM_RESIZE))
+		AnonymousShmemResize();
+
 	/*
 	 * Try to advance postmaster's state machine, if a child requests it.
 	 */
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index bd68b69ee98..8c1ea623392 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -17,6 +17,7 @@
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -62,18 +63,28 @@ CkptSortItem *CkptBufferIds;
  * Initialize shared buffer pool
  *
  * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend). Size of data structures initialized
- * here depends on NBuffers, and to be able to change NBuffers without a
- * restart we store each structure into a separate shared memory segment, which
- * could be resized on demand.
+ * postmaster, or in a standalone backend) or during shared-memory resize. Size
+ * of data structures initialized here depends on NBuffers, and to be able to
+ * change NBuffers without a restart we store each structure into a separate
+ * shared memory segment, which could be resized on demand.
+ *
+ * FirstBufferToInit tells where to start initializing buffers. For
+ * initialization it always will be zero, but when resizing shared-memory it
+ * indicates the number of already initialized buffers.
+ *
+ * No locks are taking in this function, it is the caller responsibility to
+ * make sure only one backend can work with new buffers.
  */
 void
-BufferManagerShmemInit(void)
+BufferManagerShmemInit(int FirstBufferToInit)
 {
 	bool		foundBufs,
 				foundDescs,
 				foundIOCV,
 				foundBufCkpt;
+	int			i;
+	elog(DEBUG1, "BufferManagerShmemInit from %d to %d",
+				 FirstBufferToInit, NBuffers);
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
@@ -110,43 +121,44 @@ BufferManagerShmemInit(void)
 	{
 		/* should find all of these, or none of them */
 		Assert(foundDescs && foundBufs && foundIOCV && foundBufCkpt);
-		/* note: this path is only taken in EXEC_BACKEND case */
-	}
-	else
-	{
-		int			i;
-
 		/*
-		 * Initialize all the buffer headers.
+		 * note: this path is only taken in EXEC_BACKEND case when initializing
+		 * shared memory, or in all cases when resizing shared memory.
 		 */
-		for (i = 0; i < NBuffers; i++)
-		{
-			BufferDesc *buf = GetBufferDescriptor(i);
+	}
 
-			ClearBufferTag(&buf->tag);
+#ifndef EXEC_BACKEND
+	/*
+	 * Initialize all the buffer headers.
+	 */
+	for (i = FirstBufferToInit; i < NBuffers; i++)
+	{
+		BufferDesc *buf = GetBufferDescriptor(i);
 
-			pg_atomic_init_u32(&buf->state, 0);
-			buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
+		ClearBufferTag(&buf->tag);
 
-			buf->buf_id = i;
+		pg_atomic_init_u32(&buf->state, 0);
+		buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
 
-			pgaio_wref_clear(&buf->io_wref);
+		buf->buf_id = i;
 
-			/*
-			 * Initially link all the buffers together as unused. Subsequent
-			 * management of this list is done by freelist.c.
-			 */
-			buf->freeNext = i + 1;
+		pgaio_wref_clear(&buf->io_wref);
 
-			LWLockInitialize(BufferDescriptorGetContentLock(buf),
-							 LWTRANCHE_BUFFER_CONTENT);
+		/*
+		 * Initially link all the buffers together as unused. Subsequent
+		 * management of this list is done by freelist.c.
+		 */
+		buf->freeNext = i + 1;
 
-			ConditionVariableInit(BufferDescriptorGetIOCV(buf));
-		}
+		LWLockInitialize(BufferDescriptorGetContentLock(buf),
+						 LWTRANCHE_BUFFER_CONTENT);
 
-		/* Correct last entry of linked list */
-		GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
+		ConditionVariableInit(BufferDescriptorGetIOCV(buf));
 	}
+#endif
+
+	/* Correct last entry of linked list */
+	GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
 
 	/* Init other shared buffer-management stuff */
 	StrategyInitialize(!foundDescs);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2dbd81afc87..c5725f55120 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -84,6 +84,9 @@ RequestAddinShmemSpace(Size size)
  *
  * If num_semaphores is not NULL, it will be set to the number of semaphores
  * required.
+ *
+ * XXX: Calculation for non main shared memory segments are incorrect, it
+ * includes more than needed for buffers only.
  */
 Size
 CalculateShmemSize(int *num_semaphores, int shmem_segment)
@@ -151,6 +154,14 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
 
+	/*
+	 * XXX: For some reason slightly more memory is needed for larger
+	 * shared_buffers, but this size is enough for any large value I've tested
+	 * with. Is it a mistake in how slots are split, or there was a hidden
+	 * inconsistency in shmem calculation?
+	 */
+	size = add_size(size, 1024 * 1024 * 100);
+
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
 
@@ -298,7 +309,7 @@ CreateOrAttachShmemStructs(void)
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
-	BufferManagerShmemInit();
+	BufferManagerShmemInit(0);
 
 	/*
 	 * Set up lock manager
@@ -310,6 +321,11 @@ CreateOrAttachShmemStructs(void)
 	 */
 	PredicateLockShmemInit();
 
+	/*
+	 * Set up shared memory resize manager
+	 */
+	ShmemControlInit();
+
 	/*
 	 * Set up process table
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6bec9be423..d7b56a18b24 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -27,6 +27,7 @@
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
+#include "storage/pg_shmem.h"
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "storage/smgr.h"
@@ -113,6 +114,10 @@ static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
 
+#ifdef DEBUG_SHMEM_RESIZE
+bool delay_proc_signal_init = false;
+#endif
+
 /*
  * ProcSignalShmemSize
  *		Compute space needed for ProcSignal's shared memory
@@ -176,6 +181,43 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
 	uint32		old_pss_pid;
 
 	Assert(cancel_key_len >= 0 && cancel_key_len <= MAX_CANCEL_KEY_LENGTH);
+
+#ifdef DEBUG_SHMEM_RESIZE
+	/*
+	 * Introduced for debugging purposes. You can change the variable at
+	 * runtime using gdb, then start new backends with delayed ProcSignal
+	 * initialization. Simple pg_usleep wont work here due to SIGHUP interrupt
+	 * needed for testing. Taken from pg_sleep;
+	 */
+	if (delay_proc_signal_init)
+	{
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+		float8		endtime = GetNowFloat() + 5;
+
+		for (;;)
+		{
+			float8		delay;
+			long		delay_ms;
+
+			CHECK_FOR_INTERRUPTS();
+
+			delay = endtime - GetNowFloat();
+			if (delay >= 600.0)
+				delay_ms = 600000;
+			else if (delay > 0.0)
+				delay_ms = (long) (delay * 1000.0);
+			else
+				break;
+
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 delay_ms,
+							 WAIT_EVENT_PG_SLEEP);
+			ResetLatch(MyLatch);
+		}
+	}
+#endif
+
 	if (MyProcNumber < 0)
 		elog(ERROR, "MyProcNumber not set");
 	if (MyProcNumber >= NumProcSignalSlots)
@@ -615,6 +657,10 @@ ProcessProcSignalBarrier(void)
 					case PROCSIGNAL_BARRIER_SMGRRELEASE:
 						processed = ProcessBarrierSmgrRelease();
 						break;
+					case PROCSIGNAL_BARRIER_SHMEM_RESIZE:
+						processed = ProcessBarrierShmemResize(
+								&ShmemCtrl->Barrier);
+						break;
 				}
 
 				/*
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 8d025f0e907..9fa277b91d7 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -498,17 +498,26 @@ ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
 	{
 		/*
 		 * Structure is in the shmem index so someone else has allocated it
-		 * already.  The size better be the same as the size we are trying to
-		 * initialize to, or there is a name conflict (or worse).
+		 * already. Verify the structure's size:
+		 * - If it's the same, we've found the expected structure.
+		 * - If it's different, we're resizing the expected structure.
+		 *
+		 * XXX: There is an implicit assumption this can only happen in
+		 * "resizable" segments, where only one shared structure is allowed.
+		 * This has to be implemented more cleanly.
 		 */
 		if (result->size != size)
 		{
-			LWLockRelease(ShmemIndexLock);
-			ereport(ERROR,
-					(errmsg("ShmemIndex entry size is wrong for data structure"
-							" \"%s\": expected %zu, actual %zu",
-							name, size, result->size)));
+			Size delta = size - result->size;
+
+			result->size = size;
+
+			/* Reflect size change in the shared segment */
+			SpinLockAcquire(Segments[shmem_segment].ShmemLock);
+			Segments[shmem_segment].ShmemSegHdr->freeoffset += delta;
+			SpinLockRelease(Segments[shmem_segment].ShmemLock);
 		}
+
 		structPtr = result->location;
 	}
 	else
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 0d1b6466d1e..0942d2bffe2 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -62,6 +62,7 @@
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -4309,6 +4310,15 @@ PostgresMain(const char *dbname, const char *username)
 	 */
 	BeginReportingGUCOptions();
 
+	/* Verify the shared barrier, if it's still active: join and wait. */
+	WaitOnShmemBarrier();
+
+	/*
+	 * After waiting on the barrier above we guaranteed to have NSharedBuffers
+	 * broadcasted, so we can use it in the function below.
+	 */
+	AdjustShmemSize();
+
 	/*
 	 * Also set up handler to log session end; we have to wait till now to be
 	 * sure Log_disconnections has its final value.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 4da68312b5f..691fa14e9e3 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -155,6 +155,8 @@ REPLICATION_ORIGIN_DROP	"Waiting for a replication origin to become inactive so
 REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it can be dropped."
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
+SHMEM_RESIZE_START	"Waiting for other backends to start resizing shared memory."
+SHMEM_RESIZE_DONE	"Waiting for other backends to finish resizing shared memory."
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
 WAL_BUFFER_INIT	"Waiting on WAL buffer to be initialized."
 WAL_RECEIVER_EXIT	"Waiting for the WAL receiver to exit."
@@ -352,6 +354,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+ShmemResize	"Waiting to resize shared memory."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index a221e446d6a..d8acec0f911 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2366,14 +2366,14 @@ struct config_int ConfigureNamesInt[] =
 	 * checking for overflow, so we mustn't allow more than INT_MAX / 2.
 	 */
 	{
-		{"shared_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+		{"shared_buffers", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the number of shared memory buffers used by the server."),
 			NULL,
 			GUC_UNIT_BLOCKS
 		},
 		&NBuffers,
 		16384, 16, INT_MAX / 2,
-		NULL, NULL, NULL
+		NULL, assign_shared_buffers, NULL
 	},
 
 	{
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index edac9db6a12..4239ebe640b 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -317,7 +317,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 									int32 *buffers_skipped);
 
 /* in buf_init.c */
-extern void BufferManagerShmemInit(void);
+extern void BufferManagerShmemInit(int);
 extern Size BufferManagerShmemSize(int);
 
 /* in localbuf.c */
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 6ebda479ced..bb7ae4d33b3 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -64,6 +64,7 @@ typedef void (*shmem_startup_hook_type) (void);
 /* ipc.c */
 extern PGDLLIMPORT bool proc_exit_inprogress;
 extern PGDLLIMPORT bool shmem_exit_inprogress;
+extern PGDLLIMPORT volatile bool pending_pm_shmem_resize;
 
 pg_noreturn extern void proc_exit(int code);
 extern void shmem_exit(int code);
@@ -83,5 +84,7 @@ extern void CreateSharedMemoryAndSemaphores(void);
 extern void AttachSharedMemoryStructs(void);
 #endif
 extern void InitializeShmemGUCs(void);
+extern void CoordinateShmemResize(void);
+extern bool AnonymousShmemResize(void);
 
 #endif							/* IPC_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index a9681738146..558da6fdd55 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, ShmemResize)
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index a7b275b4db9..bccdd45b1f7 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -24,6 +24,7 @@
 #ifndef PG_SHMEM_H
 #define PG_SHMEM_H
 
+#include "storage/barrier.h"
 #include "storage/dsm_impl.h"
 #include "storage/spin.h"
 
@@ -56,6 +57,25 @@ typedef struct ShmemSegment
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 
+/*
+ * ShmemControl is shared between backends and helps to coordinate shared
+ * memory resize.
+ */
+typedef struct
+{
+	pg_atomic_uint32 	NSharedBuffers;
+	Barrier 			Barrier;
+	pg_atomic_uint64 	Generation;
+	bool                Resizable;
+} ShmemControl;
+
+extern PGDLLIMPORT ShmemControl *ShmemCtrl;
+
+/* The phases for shared memory resizing, used by for ProcSignal barrier. */
+#define SHMEM_RESIZE_REQUESTED			0
+#define SHMEM_RESIZE_START				1
+#define SHMEM_RESIZE_DONE				2
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -109,6 +129,12 @@ extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
 							int *memfd_flags);
 void PrepareHugePages(void);
 
+bool ProcessBarrierShmemResize(Barrier *barrier);
+void assign_shared_buffers(int newval, void *extra, bool *pending);
+void AdjustShmemSize(void);
+extern void WaitOnShmemBarrier(void);
+extern void ShmemControlInit(void);
+
 /*
  * To be able to dynamically resize largest parts of the data stored in shared
  * memory, we split it into multiple shared memory mappings segments. Each
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 428aa3fd68a..1a55bf57a70 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
 	PMSIGNAL_XLOG_IS_SHUTDOWN,	/* ShutdownXLOG() completed */
+	PMSIGNAL_SHMEM_RESIZE,	/* resize shared memory */
 } PMSignalReason;
 
 #define NUM_PMSIGNALS (PMSIGNAL_XLOG_IS_SHUTDOWN+1)
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 2733bbb8c5b..97033f84dce 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -54,6 +54,7 @@ typedef enum
 typedef enum
 {
 	PROCSIGNAL_BARRIER_SMGRRELEASE, /* ask smgr to close files */
+	PROCSIGNAL_BARRIER_SHMEM_RESIZE, /* ask backends to resize shared memory */
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 32d6e718adc..36e25ea8cd6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2751,6 +2751,7 @@ ShellTypeInfo
 ShippableCacheEntry
 ShippableCacheKey
 ShmemIndexEnt
+ShmemControl
 ShutdownForeignScan_function
 ShutdownInformation
 ShutdownMode
-- 
2.49.0

v5-0008-Support-shrinking-shared-buffers.patchtext/plain; charset=us-asciiDownload

From 5e190dc84ce7078b6c680d18736e56be90cecdab Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Thu, 19 Jun 2025 17:38:29 +0200
Subject: [PATCH v5 08/10] Support shrinking shared buffers

When shrinking the shared buffers pool, each buffer in the area being
shrunk needs to be flushed if it's dirty so as not to loose the changes
to that buffer after shrinking. Also, each such buffer needs to be
removed from the buffer mapping table so that backends do not access it
after shrinking.

Buffer eviction requires a separate barrier phase for two reasons:

1. No other backend should map a new page to any of  buffers being
   evicted when eviction is in progress. So they wait while eviction is
   in progress.

2. Since a pinned buffer has the pin recorded in the backend local
   memory as well as the buffer descriptor (which is in shared memory),
   eviction should not coincide with remapping the shared memory of a
   backend. Otherwise we might loose consistency of local and shared
   pinning records. Hence it needs to be carried out in
   ProcessBarrierShmemResize() and not in AnonymousShmemResize() as
   indicated by now removed comment.

If a buffer being evicted is pinned, we raise a FATAL error but this should
improve. There are multiple options 1. to wait for the pinned buffer to get
unpinned, 2. the backend is killed or it itself cancels the query  or 3.
rollback the operation. Note that option 1 and 2 would require the pinning
related local and shared records to be accessed. But we need infrastructure to
do either of this right now.

Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c                 | 42 +++++---
 src/backend/storage/buffer/buf_init.c         |  8 +-
 src/backend/storage/buffer/bufmgr.c           | 95 +++++++++++++++++++
 src/backend/storage/buffer/freelist.c         | 68 +++++++++++++
 .../utils/activity/wait_event_names.txt       |  1 +
 src/include/storage/buf_internals.h           |  1 +
 src/include/storage/bufmgr.h                  |  1 +
 src/include/storage/pg_shmem.h                |  1 +
 8 files changed, 202 insertions(+), 15 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index b3c90d15d52..e612a83c9f0 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -1011,14 +1011,6 @@ AnonymousShmemResize(void)
 	 */
 	pending_pm_shmem_resize = false;
 
-	/*
-	 * XXX: Currently only increasing of shared_buffers is supported. For
-	 * decreasing something similar has to be done, but buffer blocks with
-	 * data have to be drained first.
-	 */
-	if(NBuffersOld > NBuffers)
-		return false;
-
 #ifndef MAP_HUGETLB
 	/* PrepareHugePages should have dealt with this case */
 	Assert(huge_pages != HUGE_PAGES_ON && !huge_pages_on);
@@ -1112,11 +1104,14 @@ AnonymousShmemResize(void)
 				 * all the pointers are still valid, and we only need to update
 				 * structures size in the ShmemIndex once -- any other backend
 				 * will pick up this shared structure from the index.
-				 *
-				 * XXX: This is the right place for buffer eviction as well.
 				 */
 				BufferManagerShmemInit(NBuffersOld);
 
+				/*
+				 * Wipe out the evictor PID so that it can be used for the next
+				 * buffer resizing operation.
+				*/
+				ShmemCtrl->evictor_pid = 0;
 				/* If all fine, broadcast the new value */
 				pg_atomic_write_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
 			}
@@ -1169,11 +1164,31 @@ ProcessBarrierShmemResize(Barrier *barrier)
 	 * XXX: If we need to be able to abort resizing, this has to be done later,
 	 * after the SHMEM_RESIZE_DONE.
 	 */
-	if (BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START))
+
+	/*
+	 * Evict extra buffers when shrinking shared buffers. We need to do this
+	 * while the memory for extra buffers is still mapped i.e. before remapping
+	 * the shared memory segments to a smaller memory area.
+	 */
+	if (NBuffersOld > NBuffersPending)
 	{
-		Assert(IsUnderPostmaster);
-		SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
+		BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START);
+
+		/*
+		 * TODO: If the buffer eviction fails for any reason, we should
+		 * gracefully rollback the shared buffer resizing and try again. But the
+		 * infrastructure to do so is not available right now. Hence just raise
+		 * a FATAL so that the system restarts.
+		 */
+		if (!EvictExtraBuffers(NBuffersPending, NBuffersOld))
+			elog(FATAL, "buffer eviction failed");
+
+		if (BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_EVICT))
+			SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
 	}
+	else
+		if (BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START))
+			SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
 
 	AnonymousShmemResize();
 
@@ -1683,5 +1698,6 @@ ShmemControlInit(void)
 
 		/* shmem_resizable should be initialized by now */
 		ShmemCtrl->Resizable = shmem_resizable;
+		ShmemCtrl->evictor_pid = 0;
 	}
 }
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 8c1ea623392..6f5743036a2 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -157,8 +157,12 @@ BufferManagerShmemInit(int FirstBufferToInit)
 	}
 #endif
 
-	/* Correct last entry of linked list */
-	GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
+	/*
+	 * Correct last entry of linked list, when initializing the buffers or when
+	 * expanding the buffers.
+	 */
+	if (FirstBufferToInit < NBuffers)
+		GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
 
 	/* Init other shared buffer-management stuff */
 	StrategyInitialize(!foundDescs);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 667aa0c0c78..169a44dd9fc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -57,6 +57,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/read_stream.h"
 #include "storage/smgr.h"
@@ -7453,3 +7454,97 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+/*
+ * When shrinking shared buffers pool, evict the buffers which will not be part
+ * of the shrunk buffer pool.
+ */
+bool
+EvictExtraBuffers(int newBufSize, int oldBufSize)
+{
+	bool result = true;
+
+	/*
+	 * If the buffer being evicated is locked, this function will need to wait.
+	 * This function should not be called from a Postmaster since it can not wait on a lock.
+	 */
+	Assert(IsUnderPostmaster);
+
+	/*
+	 * Let only one backend perform eviction. We could split the work across all
+	 * the backends but that doesn't seem necessary.
+	 *
+	 * The first backend to acquire ShmemResizeLock, sets its own PID as the
+	 * evictor PID for other backends to know that the eviction is in progress or
+	 * has already been performed. The evictor backend releases the lock when it
+	 * finishes eviction.  While the eviction is in progress, backends other than
+	 * evictor backend won't be able to take the lock. They won't perform
+	 * eviction. A backend may acquire the lock after eviction has completed, but
+	 * it will not perform eviction since the evictor PID is already set. Evictor
+	 * PID is reset only when the buffer resizing finishes. Thus only one backend
+	 * will perform eviction in a given instance of shared buffers resizing.
+	 *
+	 * Any backend which acquires this lock will release it before the eviction
+	 * phase finishes, hence the same lock can be reused for the next phase of
+	 * resizing buffers.
+	 */
+	if (LWLockConditionalAcquire(ShmemResizeLock, LW_EXCLUSIVE))
+	{
+		if (ShmemCtrl->evictor_pid == 0)
+		{
+			ShmemCtrl->evictor_pid = MyProcPid;
+
+			StrategyPurgeFreeList(newBufSize);
+
+			/*
+			 * TODO: Before evicting any buffer, we should check whether any of the
+			 * buffers are pinned. If we find that a buffer is pinned after evicting
+			 * most of them, that will impact performance since all those evicted
+			 * buffers might need to be read again.
+			 */
+			for (Buffer buf = newBufSize + 1; buf <= oldBufSize; buf++)
+			{
+				BufferDesc *desc = GetBufferDescriptor(buf - 1);
+				uint32		buf_state;
+				bool		buffer_flushed;
+
+				buf_state = pg_atomic_read_u32(&desc->state);
+
+				/*
+				 * Nobody is expected to touch the buffers while resizing is
+				 * going one hence unlocked precheck should be safe and saves
+				 * some cycles.
+				 */
+				if (!(buf_state & BM_VALID))
+					continue;
+
+				/*
+				 * XXX: Looks like CurrentResourceOwner can be NULL here, find
+				 * another one in that case?
+				 * */
+				if (CurrentResourceOwner)
+					ResourceOwnerEnlarge(CurrentResourceOwner);
+
+				ReservePrivateRefCountEntry();
+
+				LockBufHdr(desc);
+
+				/*
+				 * Now that we have locked buffer descriptor, make sure that the
+				 * buffer without valid data has been skipped above.
+				 */
+				Assert(buf_state & BM_VALID);
+
+				if (!EvictUnpinnedBufferInternal(desc, &buffer_flushed))
+				{
+					elog(WARNING, "could not remove buffer %u, it is pinned", buf);
+					result = false;
+					break;
+				}
+			}
+		}
+		LWLockRelease(ShmemResizeLock);
+	}
+
+	return result;
+}
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index bd390f2709d..7b9ed010e2f 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -527,6 +527,74 @@ StrategyInitialize(bool init)
 }
 
 
+/*
+ * StrategyPurgeFreeList -- remove all buffers with id higher than the number of
+ * buffers in the buffer pool.
+ *
+ * This is called before evicting buffers while shrinking shared buffers, so that
+ * the free list does not reference a buffer that will be removed.
+ *
+ * The function is called after resizing has started and thus nobody should be
+ * traversing the free list and also not touching the buffers.
+ */
+void
+StrategyPurgeFreeList(int numBuffers)
+{
+	int	firstBuffer = FREENEXT_END_OF_LIST;
+	int	nextFree = StrategyControl->firstFreeBuffer;
+	BufferDesc *prevValidBuf = NULL;
+
+	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+
+	while (nextFree != FREENEXT_END_OF_LIST)
+	{
+		BufferDesc *buf = GetBufferDescriptor(nextFree);
+
+		/* nextFree should be id of buffer being examined. */
+		Assert(nextFree == buf->buf_id);
+		/* The buffer should not be marked as not in the list. */
+		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+
+		/*
+		 * If the buffer is within the new size of pool, keep it in the free list
+		 * otherwise discard it.
+		 */
+		if (buf->buf_id < numBuffers)
+		{
+			if (prevValidBuf != NULL)
+				prevValidBuf->freeNext = buf->buf_id;
+			prevValidBuf = buf;
+
+			/* Save the first free buffer in the list if not already known. */
+			if (firstBuffer == FREENEXT_NOT_IN_LIST)
+				firstBuffer = nextFree;
+		}
+		/* Examine the next buffer in the free list. */
+		nextFree = buf->freeNext;
+	}
+
+	/* Update the last valid free buffer, if there's any. */
+	if (prevValidBuf != NULL)
+	{
+		StrategyControl->lastFreeBuffer = prevValidBuf->buf_id;
+		prevValidBuf->freeNext = FREENEXT_END_OF_LIST;
+	}
+	else
+		StrategyControl->lastFreeBuffer = FREENEXT_END_OF_LIST;
+
+	/* Update first valid free buffer, if there's any. */
+	StrategyControl->firstFreeBuffer = firstBuffer;
+
+	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+
+	/*
+	 * TODO: following was suggested by AI. Check whether it is required.
+	 * If we removed all buffers from the freelist, reset the clock sweep
+	 * pointer to zero.  This is not strictly necessary, but it seems like a
+	 * good idea to avoid confusion.
+	 */
+}
+
 /* ----------------------------------------------------------------
  *				Backend-private buffer ring management
  * ----------------------------------------------------------------
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 691fa14e9e3..0c588b69a90 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -156,6 +156,7 @@ REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it c
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
 SHMEM_RESIZE_START	"Waiting for other backends to start resizing shared memory."
+SHMEM_RESIZE_EVICT	"Waiting for other backends to finish buffer evication phase."
 SHMEM_RESIZE_DONE	"Waiting for other backends to finish resizing shared memory."
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
 WAL_BUFFER_INIT	"Waiting on WAL buffer to be initialized."
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0dec7d93b3b..add15e3723b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -453,6 +453,7 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
+extern void StrategyPurgeFreeList(int numBuffers);
 extern bool have_free_buffer(void);
 
 /* buf_table.c */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 4239ebe640b..0c554f0b130 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -315,6 +315,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 									int32 *buffers_evicted,
 									int32 *buffers_flushed,
 									int32 *buffers_skipped);
+extern bool EvictExtraBuffers(int fromBuf, int toBuf);
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(int);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index bccdd45b1f7..9a3e68f76d8 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -64,6 +64,7 @@ extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 typedef struct
 {
 	pg_atomic_uint32 	NSharedBuffers;
+	pid_t				evictor_pid;
 	Barrier 			Barrier;
 	pg_atomic_uint64 	Generation;
 	bool                Resizable;
-- 
2.49.0

v5-0009-Reinitialize-StrategyControl-after-resizing-buffe.patchtext/plain; charset=us-asciiDownload

From 475fffc314b4d5975faf94d43504f2dcc0f1dc8d Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Thu, 19 Jun 2025 17:38:51 +0200
Subject: [PATCH v5 09/10] Reinitialize StrategyControl after resizing buffers

... and BgBufferSync and ClockSweepTick adjustments

The commit introduces a separate function StrategyReInitialize() instead
of reusing StrategyInitialize() since some of the things that the second
one does are not required in the first one. Here's list of what
StrategyReInitialize() does and how does it differ from
StrategyInitialize().

1. When expanding the buffer pool add new buffers to the free list.
2. When shrinking buffers, we remove any buffers, in the area being
   shrunk, from the freelist. While doing so we adjust the first and
   last free buffer pointers in the StrategyControl area. Hence nothing
   more needed after resizing.
3. Check the sanity of the free buffer list is added after resizing.
4. StrategyControl pointer needn't be fetched again since it should not
   change. But added an Assert to make sure the pointer is valid.
5. &StrategyControl->buffer_strategy_lock need not be initialized again.
6. nextVictimBuffer, completePasses and numBufferAllocs are viewed in
   the context of NBuffers. Now that NBuffers itself has changed, those
   three do not make sense. Reset them as if the server has restarted
   again.

This commit introduces a flag delay_shmem_resize, which postgresql
backends and workers can use to signal the coordinator to delay resizing
operation. Background writer sets this flag when its scanning buffers. Background
writer is blocked when the actual resizing is in progress. But if
resizing is about to begin, it does not scan the buffers by returning
from BgBufferSync(). It stops a scan in progress when it sees that the
resizing has begun. After the resizing is finished, it adjusts the
collected statistics according to the new size of the buffer pool at the
end of barrier processing.

Once the buffer resizing is finished, before resuming the regular
operation, bgwriter resets the information saved so far. This
information is viewed in the context of NBuffers and hence does not make
sense after NBuffer has changed.

Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c         |  24 ++++-
 src/backend/postmaster/bgwriter.c     |   2 +-
 src/backend/storage/buffer/buf_init.c |  11 ++-
 src/backend/storage/buffer/bufmgr.c   |  74 ++++++++++----
 src/backend/storage/buffer/freelist.c | 133 ++++++++++++++++++++++++++
 src/include/storage/buf_internals.h   |   1 +
 src/include/storage/bufmgr.h          |   3 +-
 src/include/storage/ipc.h             |   1 +
 8 files changed, 226 insertions(+), 23 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index e612a83c9f0..e61a557966b 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -115,6 +115,7 @@ static AnonymousMapping Mappings[ANON_MAPPINGS];
 
 /* Flag telling postmaster that resize is needed */
 volatile bool pending_pm_shmem_resize = false;
+volatile bool delay_shmem_resize = false;
 
 /* Keeps track of the previous NBuffers value */
 static int NBuffersOld = -1;
@@ -1118,6 +1119,12 @@ AnonymousShmemResize(void)
 
 			LWLockRelease(ShmemResizeLock);
 		}
+
+		/*
+		 * TODO: Shouldn't we call ResizeBufferPool() here as well? Or those
+		 * backend who can not lock the LWLock conditionally won't resize the
+		 * buffers.
+		 */
 	}
 
 	return true;
@@ -1138,13 +1145,17 @@ ProcessBarrierShmemResize(Barrier *barrier)
 {
 	Assert(IsUnderPostmaster);
 
-	elog(DEBUG1, "Handle a barrier for shmem resizing from %d to %d, %d",
-		 NBuffersOld, NBuffersPending, pending_pm_shmem_resize);
+	elog(DEBUG1, "Handle a barrier for shmem resizing from %d to %d, %d, %d",
+		 NBuffersOld, NBuffersPending, pending_pm_shmem_resize, delay_shmem_resize);
 
 	/* Wait until we have seen the new NBuffers value */
 	if (!pending_pm_shmem_resize)
 		return false;
 
+	/* Wait till this process becomes ready to resize buffers. */
+	if (delay_shmem_resize)
+		return false;
+
 	/*
 	 * First thing to do after attaching to the barrier is to wait for others.
 	 * We can't simply use BarrierArriveAndWait, because backends might arrive
@@ -1195,6 +1206,15 @@ ProcessBarrierShmemResize(Barrier *barrier)
 	/* The second phase means the resize has finished, SHMEM_RESIZE_DONE */
 	BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_DONE);
 
+	if (MyBackendType == B_BG_WRITER)
+	{
+		/*
+		 * Before resuming regular background writer activity, adjust the
+		 * statistics collected so far.
+		 */
+		BgBufferSyncAdjust(NBuffersOld, NBuffers);
+	}
+
 	BarrierDetach(barrier);
 	return true;
 }
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 72f5acceec7..32b34f28ead 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -233,7 +233,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(&wb_context, false);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6f5743036a2..62c1672fb70 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -164,8 +164,15 @@ BufferManagerShmemInit(int FirstBufferToInit)
 	if (FirstBufferToInit < NBuffers)
 		GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
 
-	/* Init other shared buffer-management stuff */
-	StrategyInitialize(!foundDescs);
+	/*
+	 * Init other shared buffer-management stuff from scratch configuring buffer
+	 * pool the first time. If we are just resizing buffer pool adjust only the
+	 * required structures.
+	 */
+	if (FirstBufferToInit == 0)
+		StrategyInitialize(!foundDescs);
+	else
+		StrategyReInitialize(FirstBufferToInit);
 
 	/* Initialize per-backend file flush context */
 	WritebackContextInit(&BackendWritebackContext,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 169a44dd9fc..590fc737da8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3611,6 +3611,32 @@ BufferSync(int flags)
 	TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_scan);
 }
 
+/*
+ * Information saved between BgBufferSync() calls so we can determine the
+ * strategy point's advance rate and avoid scanning already-cleaned buffers. The
+ * variables are global instead of static local so that BgBufferSyncAdjust() can
+ * adjust it when resizing shared buffers.
+ */
+static bool saved_info_valid = false;
+static int	prev_strategy_buf_id;
+static uint32 prev_strategy_passes;
+static int	next_to_clean;
+static uint32 next_passes;
+
+/* Moving averages of allocation rate and clean-buffer density */
+static float smoothed_alloc = 0;
+static float smoothed_density = 10.0;
+
+void
+BgBufferSyncAdjust(int NBuffersOld, int NBuffersNew)
+{
+				saved_info_valid = false;
+#ifdef BGW_DEBUG
+			elog(DEBUG2, "invalidated background writer status after resizing buffers from %d to %d",
+				 NBuffersOld, NBuffersNew);
+#endif
+}
+
 /*
  * BgBufferSync -- Write out some dirty buffers in the pool.
  *
@@ -3623,27 +3649,13 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(WritebackContext *wb_context, bool reset)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
 	uint32		strategy_passes;
 	uint32		recent_alloc;
 
-	/*
-	 * Information saved between calls so we can determine the strategy
-	 * point's advance rate and avoid scanning already-cleaned buffers.
-	 */
-	static bool saved_info_valid = false;
-	static int	prev_strategy_buf_id;
-	static uint32 prev_strategy_passes;
-	static int	next_to_clean;
-	static uint32 next_passes;
-
-	/* Moving averages of allocation rate and clean-buffer density */
-	static float smoothed_alloc = 0;
-	static float smoothed_density = 10.0;
-
 	/* Potentially these could be tunables, but for now, not */
 	float		smoothing_samples = 16;
 	float		scan_whole_pool_milliseconds = 120000.0;
@@ -3666,6 +3678,22 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	/*
+	 * If buffer pool is being shrunk the buffer being written out may not remain
+	 * valid. If the buffer pool is being expanded, more buffers will become
+	 * available without even this function writing out any. Hence wait till
+	 * buffer resizing finishes i.e. go into hibernation mode.
+	 */
+	if (pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers)
+		return true;
+
+	/*
+	 * Resizing shared buffers while this function is performing an LRU scan on
+	 * them may lead to wrong results. Indicate that the resizing should wait for
+	 * the LRU scan to complete.
+	 */
+	delay_shmem_resize = true;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -3842,8 +3870,17 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
-	/* Execute the LRU scan */
-	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
+	/*
+	 * Execute the LRU scan.
+	 *
+	 * If buffer pool is being shrunk, the buffer being written may not remain
+	 * valid. If the buffer pool is being expanded, more buffers will become
+	 * available without even this function writing any. Hence stop what we are doing. This
+	 * also unblocks other processes that are waiting for buffer resizing to
+	 * finish.
+	 */
+	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est &&
+			pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) == NBuffers)
 	{
 		int			sync_state = SyncOneBuffer(next_to_clean, true,
 											   wb_context);
@@ -3902,6 +3939,9 @@ BgBufferSync(WritebackContext *wb_context)
 #endif
 	}
 
+	/* Let the resizing commence. */
+	delay_shmem_resize = false;
+
 	/* Return true if OK to hibernate */
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7b9ed010e2f..41641bb3ae6 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -98,6 +98,9 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
+#ifdef USE_ASSERT_CHECKING
+static void StrategyValidateFreeList(void);
+#endif /* USE_ASSERT_CHECKING */
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -526,6 +529,88 @@ StrategyInitialize(bool init)
 		Assert(!init);
 }
 
+/*
+ * StrategyReInitialize -- re-initialize the buffer cache replacement
+ *		strategy.
+ *
+ * To be called when resizing buffer manager and only from the coordinator.
+ * TODO: Assess the differences between this function and StrategyInitialize().
+ */
+void
+StrategyReInitialize(int FirstBufferIdToInit)
+{
+	bool		found;
+
+	/*
+	 * Resizing memory for buffer pools should not affect the address of
+	 * StrategyControl.
+	 */
+	if (StrategyControl != (BufferStrategyControl *)
+		ShmemInitStructInSegment("Buffer Strategy Status",
+						sizeof(BufferStrategyControl),
+						&found, STRATEGY_SHMEM_SEGMENT))
+		elog(FATAL, "something went wrong while re-initializing the buffer strategy");
+
+	Assert(found);
+
+	/* TODO: Buffer lookup table adjustment: There are two options:
+	 *
+	 * 1. Resize the buffer lookup table to match the new number of buffers. But
+	 * this requires rehashing all the entries in the buffer lookup table with
+	 * the new table size.
+	 *
+	 * 2. Allocate maximum size of the buffer lookup table at the beginning and
+	 * never resize it. This leaves sparse buffer lookup table which is
+	 * inefficient from both memory and time perspective. According to David
+	 * Rowley, the sparse entries in the buffer look up table cause frequent
+	 * cacheline reload which affect performance. If the impact of that
+	 * inefficiency in a benchmark is significant, we will need to consider first
+	 * option.
+	 */
+
+	/*
+	 * When shrinking buffers, we must have adjusted the first and the last free
+	 * buffer when removing the buffers being shrunk from the free list. Nothing
+	 * to be done here.
+	 *
+	 * When expanding the shared buffers, new buffers are added at the end of the
+	 * freelist or they form the new free list if there are no free buffers.
+	 */
+	if (FirstBufferIdToInit < NBuffers)
+	{
+		if (StrategyControl->firstFreeBuffer == FREENEXT_END_OF_LIST)
+			StrategyControl->firstFreeBuffer = FirstBufferIdToInit;
+		else
+		{
+			Assert(StrategyControl->lastFreeBuffer >= 0);
+			GetBufferDescriptor(StrategyControl->lastFreeBuffer - 1)->freeNext = FirstBufferIdToInit;
+		}
+
+		StrategyControl->lastFreeBuffer = NBuffers - 1;
+	}
+
+	/* Check free list sanity after resizing. */
+#ifdef USE_ASSERT_CHECKING
+	StrategyValidateFreeList();
+#endif /* USE_ASSERT_CHECKING */
+
+	/*
+	 * The clock sweep tick pointer might have got invalidated. Reset it as if
+	 * starting a fresh server.
+	 */
+	pg_atomic_write_u32(&StrategyControl->nextVictimBuffer, 0);
+
+	/*
+	 * The old statistics is viewed in the context of the number of shared
+	 * buffers. It does not make sense now that the number of shared buffers
+	 * itself has changed.
+	 */
+	StrategyControl->completePasses = 0;
+	pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
+
+	/* No pending notification */
+	StrategyControl->bgwprocno = -1;
+}
 
 /*
  * StrategyPurgeFreeList -- remove all buffers with id higher than the number of
@@ -595,6 +680,54 @@ StrategyPurgeFreeList(int numBuffers)
 	 */
 }
 
+#ifdef USE_ASSERT_CHECKING
+/*
+ * StrategyValidateFreeList-- check sanity of free buffer list.
+ */
+static void
+StrategyValidateFreeList(void)
+{
+	int			nextFree = StrategyControl->firstFreeBuffer;
+	int			numFreeBuffers = 0;
+	int			lastFreeBuffer = FREENEXT_END_OF_LIST;
+
+	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+
+	while (nextFree != FREENEXT_END_OF_LIST)
+	{
+		BufferDesc *buf = GetBufferDescriptor(nextFree);
+
+		/* nextFree should be id of buffer being examined. */
+		Assert(nextFree == buf->buf_id);
+		Assert(buf->buf_id < NBuffers);
+		/* The buffer should not be marked as not in the list. */
+		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+
+		/* Update our knowledge of last buffer in the free list. */
+		lastFreeBuffer = buf->buf_id;
+
+		numFreeBuffers++;
+
+		/* Avoid infinite recursion in case there are cycles in free list. */
+		if (numFreeBuffers > NBuffers)
+			break;
+
+		nextFree = buf->freeNext;
+	}
+
+	Assert(numFreeBuffers <= NBuffers);
+
+	/*
+	 * Make sure that the StrategyControl's knowledge of last free buffer
+	 * agrees with what's there in the free list.
+	 */
+	if (StrategyControl->firstFreeBuffer != FREENEXT_END_OF_LIST)
+		Assert(StrategyControl->lastFreeBuffer == lastFreeBuffer);
+
+	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+}
+#endif /* USE_ASSERT_CHECKING */
+
 /* ----------------------------------------------------------------
  *				Backend-private buffer ring management
  * ----------------------------------------------------------------
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index add15e3723b..46949e9d90e 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -454,6 +454,7 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
 extern void StrategyPurgeFreeList(int numBuffers);
+extern void StrategyReInitialize(int FirstBufferToInit);
 extern bool have_free_buffer(void);
 
 /* buf_table.c */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 0c554f0b130..83a75eab844 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -298,7 +298,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+extern bool BgBufferSync(struct WritebackContext *wb_context, bool reset);
+extern void BgBufferSyncAdjust(int NBuffersOld, int NBuffersNew);
 
 extern uint32 GetPinLimit(void);
 extern uint32 GetLocalPinLimit(void);
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index bb7ae4d33b3..7d1c64a9267 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -65,6 +65,7 @@ typedef void (*shmem_startup_hook_type) (void);
 extern PGDLLIMPORT bool proc_exit_inprogress;
 extern PGDLLIMPORT bool shmem_exit_inprogress;
 extern PGDLLIMPORT volatile bool pending_pm_shmem_resize;
+extern PGDLLIMPORT volatile bool delay_shmem_resize;
 
 pg_noreturn extern void proc_exit(int code);
 extern void shmem_exit(int code);
-- 
2.49.0

v5-0010-Additional-validation-for-buffer-in-the-ring.patchtext/plain; charset=us-asciiDownload

From 25816411c7a77a87b8527a901633865f1c1056b7 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Wed, 11 Jun 2025 18:15:06 +0530
Subject: [PATCH v5 10/10] Additional validation for buffer in the ring

If the buffer pool has been shrunk, the buffers in the buffer list may
not be valid anymore. Modify GetBufferFromRing to check if the buffer is
still valid before using it. This makes GetBufferFromRing() a bit more
expensive because of additional boolean condition. That may not be
expensive enough to affect query performance. The alternative to that is
more complex as explained below.

The strategy object is created in CurrentMemoryContext and is not
available in any global structure thus accessible when processing buffer
resizing barriers. We may modify GetAccessStrategy() to register
strategy in a global linked list and then arrange to deregister it once
it's no more in use. Looking at the places which use
GetAccessStrategy(), fixing all those may be some work.

Ashutosh Bapat
---
 src/backend/storage/buffer/freelist.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 41641bb3ae6..74d070733a4 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -948,12 +948,13 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 		strategy->current = 0;

 	/*
-	 * If the slot hasn't been filled yet, tell the caller to allocate a new
-	 * buffer with the normal allocation strategy.  He will then fill this
-	 * slot by calling AddBufferToRing with the new buffer.
+	 * If the slot hasn't been filled yet or the buffer in the slot has been
+	 * invalidated when buffer pool was shrunk, tell the caller to allocate a new
+	 * buffer with the normal allocation strategy.  He will then fill this slot
+	 * by calling AddBufferToRing with the new buffer.
 	 */
 	bufnum = strategy->buffers[strategy->current];
-	if (bufnum == InvalidBuffer)
+	if (bufnum == InvalidBuffer || bufnum > NBuffers)
 		return NULL;

 	/*
-- 
2.49.0

#76

Dmitry Dolgov

9erthalion6@gmail.com

7 months ago

In reply to: Dmitry Dolgov (#75)

Re: Changing shared_buffers without restart

On Fri, Jun 20, 2025 at 12:19:31PM +0200, Dmitry Dolgov wrote:
Thanks! I've reworked the series to implement approach suggested by
Thomas, and applied your patches to support buffers shrinking on top. I
had to restructure the patch set, here is how it looks like right now:

The base-commit was left in the cover letter which I didn't post, so for
posterity:

base-commit: 4464fddf7b50abe3dbb462f76fd925e10eedad1c

#77

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

6 months ago

In reply to: Dmitry Dolgov (#75)

1 attachment(s)

Re: Changing shared_buffers without restart

Hi Dmitry,
Thanks for sharing the patches.

On Fri, Jun 20, 2025 at 3:49 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

3. Shared memory shrinking

So far only shared memory increase was implemented. These patches from Ashutosh
support shrinking as well, which is tricky due to the need for buffer eviction.

* Support shrinking shared buffers
* Reinitialize StrategyControl after resizing buffers

This applies to both shrinking and expansion of shared buffers. When
expanding we need to add the new buffers to the freelist by changing
next pointer of last buffer in the free list to point to the first new
buffer.

* Additional validation for buffer in the ring

0009 adds support to shrink shared buffers. It has two changes: a.
evict the buffers outside the new buffer size b. remove buffers with
buffer id outside the new buffer size from the free list. If a buffer
being evicted is pinned, the operation is aborted and a FATAL error is
raised. I think we need to change this behaviour to be less severe
like rolling back the operation or waiting for the pinned buffer to be
unpinned etc. Better even if we could let users control the behaviour.
But we need better infrastructure to do such things. That's one TODO
left in the patch.

I haven't reviewed those, just tested a bit to finally include into the series.
Note that I had to tweak two things:

* The way it was originally implemented was sending resize signal to postmaster
before doing eviction, which could result in sigbus when accessing LSN of a
dirty buffer to be evicted. I've reshaped it a bit to make sure eviction always
happens first.

Will take a look at this.

* It seems the CurrentResource owner could be missing sometimes, so I've added
a band-aid checking its presence.

One side note, during my testing I've noticed assert failures on
pgstat_tracks_io_op inside a wal writer a few times. I couldn't reproduce it
after the fixes above, but still it may indicate that something is off. E.g.
it's somehow not expected that the wal writer will do buffer eviction IO (from
what I understand, the current shrinking implementation allows that).

Yes. I think, we have to find a better way to choose a backend which
does the actual work. Eviction can be done in that backend itself.

Compiler gives warning about an uninitialized variable, which seems to
be a real bug. Fix attached.

--
Best Wishes,
Ashutosh Bapat

Attachments:

unitialized_variable_warning.patch.txttext/plain; charset=US-ASCII; name=unitialized_variable_warning.patch.txtDownload

commit 5c54d36d28edcdd72edd6374132ac88e1693fcaa
Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date:   Wed Jul 2 15:47:45 2025 +0530

    fixup! Allow to use multiple shared memory mappings
    
    shm_total_page_count is used unitialized. If this variable has a random
    value to start with, the final sum would be wrong.
    
    Ashutosh Bapat

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 9dd22920a79..24aec528c3c 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -658,7 +658,7 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
 	Size		os_page_size;
 	void	  **page_ptrs;
 	int		   *pages_status;
-	uint64		shm_total_page_count,
+	uint64		shm_total_page_count = 0,
 				shm_ent_page_count,
 				max_nodes;
 	Size	   *nodes;

#78

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Ashutosh Bapat (#77)

1 attachment(s)

Re: Changing shared_buffers without restart

Hi Ashutosh dn Dmitry,

I took a look at this patch, because it's somewhat related to the NUMA
patch series I posted a couple days ago, and I've been wondering if
it makes some of the NUMA stuff harder or simpler.

I don't think it makes a bit difference (for the NUMA stuff). My main
question was when would we adjust the "NUMA location" of parts of memory
to keep stuff balanced, but this patch series already needs to update
some of these structs (like the freelists), so those places would be
updated to be NUMA-aware. Some of the changes could be made lazily,
to minimize the amount of time when activity is stopped (like shifting
the buffers to different NUMA nodes). It'd be harder if we wanted to
resize e.g. PGPROC, but that's not the case. So I think this is fine.

I agree it'd be useful to be able to resize shared buffers, without
having to restart the instance (which is obviously very disruptive). So
if we can make this work reliably, with reasonable trade offs (both on
the backends, and also the risks/complexity introduced by the feature).

I'm far from an expert on mmap() and similar low-level stuff, but the
current appproach (reserving a big chunk of shared memory and slicing
it by mmap() into smaller segments) seems reasonable.

But I'm getting a bit lost in how exactly this interacts with things
like overcommit, system memory accounting / OOM killer and this sort of
stuff. I went through the thread and it seems to me the reserve+map
approach works OK in this regard (and the messages on linux-mm seem to
confirm this). But this information is scattered over many messages and
it's hard to say for sure, because some of this might be relevant for
an earlier approach, or a subtly different variant of it.

A similar question is portability. The comments and commit messages
seem to suggest most of this is linux-specific, and other platforms just
don't have these capabilities. But there's a bunch of messages (mostly
by Thomas Munro) that hint FreeBSD might be capable of this too, even if
to some limited extent. And possibly even Windows/EXEC_BACKEND, although
that seems much trickier.

FWIW I think it's perfectly fine to only support resizing on selected
platforms, especially considering Linux is the most widely used system
for running Postgres. We still need to be able to build/run on other
systems, of course. And maybe it'd be good to be able to disable this
even on Linux, if that eliminates some overhead and/or risks for people
who don't need the feature. Just a thought.

Anyway, my main point is that this information is important, but very
scattered over the thread. It's a bit foolish to expect everyone who
wants to do a review to read the whole thread (which will inevitably
grow longer over time), and assemble all these pieces again an again,
following all the changes in the design etc. Few people will get over
that hurdle, IMHO.

So I think it'd be very helpful to write a README, explaining the
currnent design/approach, and summarizing all these aspects in a single
place. Including things like portability, interaction with the OS
accounting, OOM killer, this kind of stuff. Some of this stuff may be
already mentioned in code comments, but you it's hard to find those.

Especially worth documenting are the states the processes need to go
through (using the barriers), and the transacitons between them (i.e.
what is allowed in each phase, what blocks can be visible, etc.).

I'll go over some higher-level items first, and then over some comments
for individual patches.

1) no user docs

There are no user .sgml docs, and maybe it's time to write some,
explaining how to use this thing - how to configure it, how to trigger
the resizing, etc. It took me a while to realize I need to do ALTER
SYSTEM + pg_reload_conf() to kick this off.

It should also document the user-visible limitations, e.g. what activity
is blocked during the resizing, etc.

2) pending GUC changes

I'm somewhat skeptical about the GUC approach. I don't think it was
designed with this kind of use case in mind, and so I think it's quite
likely it won't be able to handle it well.

For example, there's almost no validation of the values, so how do you
ensure the new value makes sense? Because if it doesn't, it can easily
crash the system (I've seen such crashes repeatedly, I'll get to that).
Sure, you may do ALTER SYSTEM to set shared_buffers to nonsense and it
won't start after restart/reboot, but crashing an instance is maybe a
little bit more annoying.

Let's say we did the ALTER SYSTEM + pg_reload_conf(), and it gets stuck
waiting on something (can't evict a buffer or something). How do you
cancel it, when the change is already written to the .auto.conf file?
Can you simply do ALTER SYSTEM + pg_reload_conf() again?

It also seems a bit strange that the "switch" gets to be be driven by a
randomly selected backend (unless I'm misunderstanding this bit). It
seems to be true for the buffer eviction during shrinking, at least.

Perhaps this should be a separate utility command, or maybe even just
a new ALTER SYSTEM variant? Or even just a function, similar to what
the "online checksums" patch did, possibly combined with a bgworder
(but probably not needed, there are no db-specific tasks to do).

3) max_available_memory

Speaking of GUCs, I dislike how max_available_memory works. It seems a
bit backwards to me. I mean, we're specifying shared_buffers (and some
other parameters), and the system calculates the amount of shared memory
needed. But the limit determines the total limit?

I think the GUC should specify the maximum shared_buffers we want to
allow, and then we'd work out the total to pre-allocate? Considering
we're only allowing to resize shared_buffers, that should be pretty
trivial. Yes, it might happen that the "total limit" happens to exceed
the available memory or something, but we already have the problem
with shared_buffers. Seems fine if we explain this in the docs, and
perhaps print the calculated memory limit on start.

In any case, we should not allow setting a value that ends up
overflowing the internal reserved space. It's true we don't have a good
way to do checks for GUcs, but it's a bit silly to crash because of
hitting some non-obvious internal limit that we necessarily know about.

Maybe this is a reason why GUC hooks are not a good way to set this.

4) SHMEM_RESIZE_RATIO

The SHMEM_RESIZE_RATIO thing seems a bit strange too. There's no way
these ratios can make sense. For example, BLCKSZ is 8192 but the buffer
descriptor is 64B. That's 128x difference, but the ratios says 0.6 and
0.1, so 6x. Sure, we'll actually allocate only the memory we need, and
the rest is only "reserved".

However, that just makes the max_available_memory a bit misleading,
because you can't ever use it. You can use the 60% for shared buffers
(which is not mentioned anywhere, and good luck not overflowing that,
as it's never checked), but those smaller regions are guaranteed to be
mostly unused. Unfortunate.

And it's not just a matter of fixing those ratios, because then someone
rebuilds with 32kB blocks and you're in the same situation.

Moreover, all of the above is for mappings sized based on NBuffers. But
if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the
moment someone increases of max_connection, max_locks_per_transaction
and possibly some other stuff?

5) no tests

I mentioned no "user docs", but the patch has 0 tests too. Which seems
a bit strange for a patch of this age.

A really serious part of the patch series seems to be the coordination
of processes when going through the phases, enforced by the barriers.
This seems like a perfect match for testing using injection points, and
I know we did something like this in the online checksums patch, which
needs to coordinate processes in a similar way.

But even just a simple TAP test that does a bunch of (random?) resizes
while running a pgbench seem better than no tests. (That's what I did
manually, and it crashed right away.)

There's a lot more stuff to test here, I think. Idle sessions with
buffers pinned by open cursors, multiple backends doing ALTER SYSTEM
+ pg_reload_conf concurrently, other kinds of failures.

6) SIGBUS failures

As mentioned, I did some simple tests with shrink/resize with a pgbench
in the background, and it almost immediately crashed for me :-( With a
SIGBUS, which I think is fairly rare on x86 (definitely much less common
than e.g. SIGSEGV).

An example backtrace attached.

7) EXEC_BACKEND, FreeBSD

We clearly need to keep this working on systems without the necessary
bits (so likely EXEC_BACKEND, FreeBSD etc.). But the builds currently
fail in both cases, it seems.

I think it's fine to not support resizing on every platform, then we'd
never get it, but it still needs to build. It would be good to not have
two very different code versions, one for resizing and one without it,
though. I wonder if we can just have the "no-resize" use the same struct
(with the segments/mapping, ...) and all that, but skipping the space
reservation.

8) monitoring

So, let's say I start a resize of shared buffers. How will I know what
it's currently doing, how much longer it might take, what it's waiting
for, etc.? I think it'd be good to have progress monitoring, through
the regular system view (e.g. pg_stat_shmem_resize_progress?).

10) what to do about stuck resize?

AFAICS the resize can get stuck for various reasons, e.g. because it
can't evict pinned buffers, possibly indefinitely. Not great, it's not
clear to me if there's a way out (canceling the resize) after a timeout,
or something like that? Not great to start an "online resize" only to
get stuck with all activity blocked for indefinite amount of time, and
get to restart anyway.

Seems related to Thomas' message [2]/messages/by-id/CA+hUKGL5hW3i_pk5y_gcbF_C5kP-pWFjCuM8bAyCeHo3xUaH8g@mail.gmail.com, but AFAICS the patch does not do
anything about this yet, right? What's the plan here?

11) preparatory actions?

Even if it doesn't get stuck, some of the actions can take a while, like
evicting dirty buffers before shrinking, etc. This is similar to what
happens on restart, when the shutdown checkpoint can take a while, while
the system is (partly) unavailable.

The common mitigation is to do an explicit checkpoint right before the
restart, to make the shutdown checkpoint cheap. Could we do something
similar for the shrinking, e.g. flush buffers from the part to be
removed before actually starting the resize?

12) does this affect e.g. fork() costs?

I wonder if this affects the cost of fork() in some undesirable way?
Could it make fork() more measurably more expensive?

13) resize "state" is all over the place

For me, a big hurdle when reasoning about the resizing correctness is
that there's quite a lot of distinct pieces tracking what the current
"state" is. I mean, there's:

- ShmemCtrl->NSharedBuffers
- NBuffers
- NBuffersOld
- NBuffersPending
- ... (I'm sure I missed something)

There's no cohesive description how this fits together, it seems a bit
"ad hoc". Could be correct, but I find it hard to reason about.

14) interesting messages from the thread

While reading through the thread, I noticed a couple messages that I
think are still relevant:

- I see Peter E posted some review in 2024/11 [3]/messages/by-id/12add41a-7625-4639-a394-a5563e349322@eisentraut.org, but it seems his
comments were mostly ignored. I agree with most of them.

- Robert mentioned a couple interesting failure scenarios in [4]/messages/by-id/CA+TgmoZFfn0E+EkUAjnv_QM_00eUJPkgCJKzm3n1G4itJKMSsA@mail.gmail.com, not
sure if all of this was handled. He howerver assumes pointers would
not be stable (and that's something we should not allow, and the
current approach works OK in this regard, I think). He also outlines
how it'd happen in phases - this would be useful for the design README
I think. It also reminds me the "phases" in the checksums patch.

- Robert asked [5]/messages/by-id/cnthxg2eekacrejyeonuhiaezc7vd7o2uowlsbenxqfkjwgvwj@qgzu6eoqrglb if Linux might abruptly break this, but I find that
unlikely. We'd point out we rely on this, and they'd likely rethink.
This would be made safer if this was specified by POSIX - taking that
away once implemented seems way harder than for custom extensions.
It's likely they'd not take away the feature without an alternative
way to achieve the same effect, I think (yes, harder to maintain).
Tom suggests [7]/messages/by-id/397218.1732844567@sss.pgh.pa.us this is not in POSIX.

- Matthias mentioned [6]/messages/by-id/CAEze2WiMkmXUWg10y+_oGhJzXirZbYHB5bw0=VWte+YHwSBa=A@mail.gmail.com similar flags on other operating systems. Could
some of those be used to implement the same resizing?

- Andres had an interesting comment about how overcommit interacts with
MAP_NORESERVE. AFAIK it means we need the flag to not break overcommit
accounting. There's also some comments about from linux-mm people [9]https://lore.kernel.org/linux-mm/pr7zggtdgjqjwyrfqzusih2suofszxvlfxdptbo2smneixkp7i@nrmtbhemy3is/.

- There seem to be some issues with releasing memory backing a mapping
with hugetlb [10]/messages/by-id/3qzw5fhhb3eqwl3huqabyxechbz7frxs2vk3hx3tb3h7euyvul@pc2rmhehuglc. With the fd (and truncating the file), this seems
to release the memory, but it's linux-specific? But most of this stuff
is specific to linux, it seems. So is this a problem? With this it
should be working even for hugetlb ...

- It seems FreeBSD has MFD_HUGETLB [11]/messages/by-id/CA+hUKGJ-RfwSe3=ZS2HRV9rvgrZTJJButfE8Kh5C6Ta2Eb+mPQ@mail.gmail.com, so maybe we could use this and
make the hugetlb stuff work just like on Linux? Unclear. Also, I
thought the mfd stuff is linux-specific ... or am I confused?

- Andres objected to any approach without pointer stability, and I agree
with that. If we can figure out such solution, of course.

- Thomas asked [13]/messages/by-id/CA+hUKGLQhsZ1dEf5Zo6JuPbs6n-qX=cTGy49feKf1iFA_TBP1g@mail.gmail.com why we need to stop all the backends, instead of
just waiting for them to acknowledge the new (smaller) NBuffers value
and then let them continue. I also don't quite see why this should
not work, and it'd limit the disruption when we have to wait for
eviction of buffers pinned by paused cursors, etc.

Now, some comments about the individual patches (some of this may be a
bit redundant with the earlier points):

v5-0001-Process-config-reload-in-AIO-workers.patch

1) Hmmm, so which other workers may need such explicit handling? Do all
other processes participate in procsignal stuff, or does anything
need an explicit handling?

v5-0002-Introduce-pending-flag-for-GUC-assign-hooks.patch

No additional comments, see the points about resizing through a GUC
callback with pending flag vs. a separate utility command, monitoring
and so on.

v5-0003-Introduce-pss_barrierReceivedGeneration.patch

1) Do we actually need this? Isn't it enough to just have two barriers?
Or a barrier + condition variable, or something like that.

2) The comment talks about "coordinated way" when processing messages,
but it's not very clear to me. It should explain what is needed and
not possible with the current barrier code.

3) This very much reminds me what the online checksums patch needed to
do, and we managed to do it using plain barriers. So why does this
need this new thing? (No opinion on whether it's correct.)

v5-0004-Allow-to-use-multiple-shared-memory-mappings.patch

1) "int shmem_segment" - wouldn't it be better to have a separate enum
for this? I mean, we'll have a predefined list of segments, right?

2) typedef struct AnonymousMapping would deserve some comment

3) ANON_MAPPINGS - Probably should be MAX_ANON_MAPPINGS? But we'll know
how many we have, so why not to allocate exactly the right number?
Or even just an array of structs, like in similar cases?

4) static int next_free_segment = 0;

We exactly know what segments we'll create and in which order, no? So
why do we even bother with this next_free_segment thing? Can't we
simply declare an array of AnonymousMapping elements, with all the
elements, and then just walk it and calculate the sizes/pointers?

5) I'm a bit confused about the segment/mapping difference. The patch
seems to randomly mix those, or maybe I'm just confused. I mean,
we are creating just shmem segment, and the pieces are mappings,
right? So why do we index them by "shmem_segment"?

Also, consider

CreateAnonymousSegment(AnonymousMapping *mapping)

so is that creating a segment or mapping? Or what's the difference?

Or are we creating multiple segments, and I missed that? Or are there
different "segment" concepts, or what?

6) There should probably be some sort of API wrapping the mappings, so
that the various places don't need to mess with next_free_segments
directly, etc. Perhaps PGSharedMemoryCreate() shouldn't do this, and
should just pass size to CreateAnonymousSegment(), and that finding
empty slot in Mappings, etc.? Not sure that'll work, but it's a bit
error-prone if a struct is modified from multiple places like this.

7) We should remember which segments got to use huge pages and which
did not. And we should make it optional for each segment. Although,
maybe I'm just confused about the "segment" definition - if we only
have one, that's where huge pages are applied.

If we could have multiple segments for different segments (whatever
that means), not sure what we'll report for cases when some segments
get to use huge pages and others don't. Either because we don't want
to use that for some segments, or because we happen to run out of
the available huge pages.

8) It seems PGSharedMemoryDetach got some significant changes, but the
comment was not modified at all. I'd guess that means the comment is
perhaps stale, or maybe there's something we should mention.

9) I doubt the Assert on GetConfigOption needs to be repeated for all
segments (in CreateSharedMemoryAndSemaphores).

10) Why do we have the Mapping and Segments indexed in different ways?
I mean, Mappings seem to be filled in FIFO (just grab the next free
slot), while Segments are indexed by segment ID.

11) Actually, what's the difference between the contents of Mappings
and Segments? Isn't that the same thing, indexed in the same way?
Or could it be unified? Or are they conceptually different thing?

12) I believe we'll have a predefined list of segments, with fixed IDs,
so why not just have a MAX of those IDs as the capacity?

13) Would it be good to have some checks on shmem_segment values? That
it's valid with respect to defined segments, etc. An assert, maybe?
What about some asserts on the Mapping/Segment elements? To check
that the element is sensible, and that the arrays "match" (if we
need both).

14) Some of the lines got pretty long, e.g. in pg_get_shmem_allocations.
I suggest we define some macros to make this shorter, or something
like that.

15) I'd maybe rename ShmemSegment to PGShmemSegment, for consistency
with PGShmemHeader?

16) Is MAIN_SHMEM_SEGMENT something we want to expose in a public header
file? Seems very much like an internal thing, people should access
it only through APIs ...

v5-0005-Address-space-reservation-for-shared-memory.patch

1) Shouldn't reserved_offset and huge_pages_on really be in the segment
info? Or maybe even in mapping info? (again, maybe I'm confused
about what these structs store)

2) CreateSharedMemoryAndSemaphores comment is rather light on what it
does, considering it now reserves space and then carves is into
segments.

3) So ReserveAnonymousMemory is what makes decisions about huge pages,
for the whole reserved space / all segments in it. That's a bit
unfortunate with respect to the desirability of some segments
benefiting from huge pages and others not. Maybe we should have two
"reserved" areas, one with huge pages, one without?

I guess we don't want too many segments, because that might make
fork() more expensive, etc. Just guessing, though. Also, how would
this work with threading?

4) Any particular reason to define max_available_memory as
GUC_UNIT_BLOCKS and not GUC_UNIT_MB? Of course, if we change this
to have "max shared buffers limit" then it'd make sense to use
blocks, but "total limit" is not in blocks.

5) The general approach seems sound to me, but I'm not expert on this.
I wonder how portable this behavior is. I mean, will it work on other
Unix systems / Windows? Is it POSIX or Linux extension?

6) It might be a good idea to have Assert procedures to chech mappings
and segments (that it doesn't overflow reserved space, etc.). It
took me ages to realize I can change shared_buffers to >60% of the
limit, it'll happily oblige and then just crash with OOM when
calling mprotect().

v5-0006-Introduce-multiple-shmem-segments-for-shared-buff.patch

1) I suspect the SHMEM_RESIZE_RATIO is the wrong direction, because it
entirely ignores relationships between the parts. See the earlier
comment about this.

2) In fact, what happens if the user tries to resize to a value that is
too large for one of the segments? How would the system know before
starting the resize (and failing)?

3) It seems wrong to modify the BufferManagerShmemSize like this. It's
probably better to have a "...SegmentSize" function for individual
segments, and let BufferManagerShmemSize() to still return a sum of
all segments.

4) I think MaxAvailableMemory is the wrong abstraction, because that's
not what people specify. See earlier comment.

5) Let's say we change the shared memory size (ALTER SYSTEM), trigger
the config reload (pg_reload_conf). But then we find that we can't
actually shrink the buffers, for some unpredictable reason (e.g.
there's pinned buffers). How do we "undo" the change? We can't
really undo the ALTER SYSTEM, that's already written in the .conf
and we don't know the old value, IIRC. Is it reasonable to start
killing backends from the assign_hook or something? Seems weird.

v5-0007-Allow-to-resize-shared-memory-without-restart.patch

1) Why would AdjustShmemSize be needed? Isn't that a sign of a bug
somewhere in the resizing?

2) Isn't the pg_memory_barrier() in CoordinateShmemResize a bit weird?
Why is it needed, exactly? If it's to flush stuff for processes
consuming EmitProcSignalBarrier, it's that too late? What if a
process consumes the barrier between the emit and memory barrier?

3) WaitOnShmemBarrier seem a bit under-documented.

4) Is this actually adding buffers to the freelist? I see buf_init only
links the new buffers by seeting freeNext, but where are the new
buffers added to the existing freelist?

5) The issue with a new backend seeing an old NBuffers value reminds me
of the "support enabling checksums online" thread, where we ran into
similar race conditions. See message [1]/messages/by-id/3372a09c-d1f6-4974-ad60-eec15ee0c734@vondra.me, the part about race #2
(the other race might be relevant too, not sure). It's been a while,
but I think our conclusion ini that thread was that the "best" fix
would be to change the order of steps in InitPostgres(), i.e. setup
the ProcSignal stuff first, and only then "copy" the NBuffers value.
And handle the possibility that we receive a "duplicate" barriers.

6) In fact, the online checksums thread seems like a possible source of
inspiration for some of the issues, because it needs to do similar
stuff (e.g. make sure all backends follow steps in a synchronized
way, etc.). And it didn't need new types of Barrier to do that.

7) Also, this seems like a perfect match for testing using injection
points. In fact, there's not a single test in the whole patch series.
Or a single line of .sgml docs, for that matter. It took me a while
to realize I'm supposed to change the size by ALTER SYSTEM + reload
the config.

v5-0008-Support-shrinking-shared-buffers.patch

1) Why is ShmemCtrl->evictor_pid reset in AnonymousShmemResize? Isn't
there a place starting it and waiting for it to complete? Why
shouldn't it do EvictExtraBuffers itself?

2) Isn't the change to BufferManagerShmemInit wrong? How do we know the
last buffer is still at the end of the freelist? Seems unlikely.

3) Seems a bit strange to do it from a random backend. Shouldn't it
be the responsibility of a process like checkpointer/bgwriter, or
maybe a dedicated dynamic bgworker? Can we even rely on a backend
to be available?

4) Unsolved issues with buffers pinned for a long time. Could be an
issue if the buffer is pinned indefinitely (e.g. cursor in idle
connection), and the resizing blocks some activity (new connections
or stuff like that).

5) Funny that "AI suggests" something, but doesn't the block fail to
reset nextVictimBuffer of the clocksweep? It may point to a buffer
we're removing, and it'll be invalid, no?

6) It's not clear to me in what situations this triggers (in the call
to BufferManagerShmemInit)

if (FirstBufferToInit < NBuffers) ...

v5-0009-Reinitialize-StrategyControl-after-resizing-buffe.patch

1) IMHO this should be included in the earlier resize/shrink patches,
I don't see a reason to keep it separate (assuming this is the
correct way, and the "init" is not).

2) Doesn't StrategyPurgeFreeList already do some of this for the case
of shrinking memory?

3) Not great adding a bunch of static variables to bufmgr.c. Why do we
need to make "everything" static global? Isn't it enough to make
only the "valid" flag global? The rest can stay local, no?

If everything needs to be global for some reason, could we at least
make it a struct, to group the fields, not just separate random
variables? And maybe at the top, not half-way throught the file?

4) Isn't the name BgBufferSyncAdjust misleading? It's not adjusting
anything, it's just invalidating the info about past runs.

5) I don't quite understand why BufferSync needs to do the dance with
delay_shmem_resize. I mean, we certainly should not run BufferSync
from the code that resizes buffers, right? Certainly not after the
eviction, from the part that actually rebuilds shmem structs etc.
So perhaps something could trigger resize while we're running the
BufferSync()? Isn't that a bit strange? If this flag is needed, it
seems more like a band-aid for some issue in the architecture.

6) Also, why should it be fine to get into situation that some of the
buffers might not be valid, during shrinking? I mean, why should
this check (pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers).
It seems better to ensure we never get into "sync" in a way that
might lead some of the buffers invalid. Seems way too lowlevel to
care about whether resize is happening.

7) I don't understand the new condition for "Execute the LRU scan".
Won't this stop LRU scan even in cases when we want it to happen?
Don't we want to scan the buffers in the remaining part (after
shrinking), for example? Also, we already checked this shmem flag at
the beginning of the function - sure, it could change (if some other
process modifies it), but does that make sense? Wouldn't it cause
problems if it can change at an arbitrary point while running the
BufferSync? IMHO just another sign it may not make sense to allow
this, i.e. buffer sync should not run during the "actual" resize.

v5-0010-Additional-validation-for-buffer-in-the-ring.patch

1) So the problem is we might create a ring before shrinking shared
buffers, and then GetBufferFromRing will see bogus buffers? OK, but
we should be more careful with these checks, otherwise we'll miss
real issues when we incorrectly get an invalid buffer. Can't the
backends do this only when they for sure know we did shrink the
shared buffers? Or maybe even handle that during the barrier?

2) IMHO a sign there's the "transitions" between different NBuffers
values may not be clear enough, and we're allowing stuff to happen
in the "blurry" area. I think that's likely to cause bugs (it did
cause issues for the online checksums patch, I think).

[1]: /messages/by-id/3372a09c-d1f6-4974-ad60-eec15ee0c734@vondra.me
/messages/by-id/3372a09c-d1f6-4974-ad60-eec15ee0c734@vondra.me

[2]: /messages/by-id/CA+hUKGL5hW3i_pk5y_gcbF_C5kP-pWFjCuM8bAyCeHo3xUaH8g@mail.gmail.com
/messages/by-id/CA+hUKGL5hW3i_pk5y_gcbF_C5kP-pWFjCuM8bAyCeHo3xUaH8g@mail.gmail.com

[3]: /messages/by-id/12add41a-7625-4639-a394-a5563e349322@eisentraut.org
/messages/by-id/12add41a-7625-4639-a394-a5563e349322@eisentraut.org

[4]: /messages/by-id/CA+TgmoZFfn0E+EkUAjnv_QM_00eUJPkgCJKzm3n1G4itJKMSsA@mail.gmail.com
/messages/by-id/CA+TgmoZFfn0E+EkUAjnv_QM_00eUJPkgCJKzm3n1G4itJKMSsA@mail.gmail.com

[5]: /messages/by-id/cnthxg2eekacrejyeonuhiaezc7vd7o2uowlsbenxqfkjwgvwj@qgzu6eoqrglb
/messages/by-id/cnthxg2eekacrejyeonuhiaezc7vd7o2uowlsbenxqfkjwgvwj@qgzu6eoqrglb

[6]: /messages/by-id/CAEze2WiMkmXUWg10y+_oGhJzXirZbYHB5bw0=VWte+YHwSBa=A@mail.gmail.com
/messages/by-id/CAEze2WiMkmXUWg10y+_oGhJzXirZbYHB5bw0=VWte+YHwSBa=A@mail.gmail.com

[7]: /messages/by-id/397218.1732844567@sss.pgh.pa.us

[8]: /messages/by-id/gzhuqq3eszx7w46j5de5jehycygipsy7zmfrtdkhfbj5utl6zh@sxyejudixdfe
/messages/by-id/gzhuqq3eszx7w46j5de5jehycygipsy7zmfrtdkhfbj5utl6zh@sxyejudixdfe

[9]: https://lore.kernel.org/linux-mm/pr7zggtdgjqjwyrfqzusih2suofszxvlfxdptbo2smneixkp7i@nrmtbhemy3is/
https://lore.kernel.org/linux-mm/pr7zggtdgjqjwyrfqzusih2suofszxvlfxdptbo2smneixkp7i@nrmtbhemy3is/

[10]: /messages/by-id/3qzw5fhhb3eqwl3huqabyxechbz7frxs2vk3hx3tb3h7euyvul@pc2rmhehuglc
/messages/by-id/3qzw5fhhb3eqwl3huqabyxechbz7frxs2vk3hx3tb3h7euyvul@pc2rmhehuglc

[11]: /messages/by-id/CA+hUKGJ-RfwSe3=ZS2HRV9rvgrZTJJButfE8Kh5C6Ta2Eb+mPQ@mail.gmail.com
/messages/by-id/CA+hUKGJ-RfwSe3=ZS2HRV9rvgrZTJJButfE8Kh5C6Ta2Eb+mPQ@mail.gmail.com

[12]: /messages/by-id/94B56B9C-025A-463F-BC57-DF5B15B8E808@anarazel.de
/messages/by-id/94B56B9C-025A-463F-BC57-DF5B15B8E808@anarazel.de

[13]: /messages/by-id/CA+hUKGLQhsZ1dEf5Zo6JuPbs6n-qX=cTGy49feKf1iFA_TBP1g@mail.gmail.com
/messages/by-id/CA+hUKGLQhsZ1dEf5Zo6JuPbs6n-qX=cTGy49feKf1iFA_TBP1g@mail.gmail.com

--
Tomas Vondra

#79

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Tomas Vondra (#78)

Re: Changing shared_buffers without restart

On Fri, Jul 04, 2025 at 02:06:16AM +0200, Tomas Vondra wrote:
I took a look at this patch, because it's somewhat related to the NUMA
patch series I posted a couple days ago, and I've been wondering if
it makes some of the NUMA stuff harder or simpler.

Thanks a lot for the review! It's a plenty of feedback, and I'll
probably take time to answer all of it, but I still want to address
couple of most important topics quickly.

But I'm getting a bit lost in how exactly this interacts with things
like overcommit, system memory accounting / OOM killer and this sort of
stuff. I went through the thread and it seems to me the reserve+map
approach works OK in this regard (and the messages on linux-mm seem to
confirm this). But this information is scattered over many messages and
it's hard to say for sure, because some of this might be relevant for
an earlier approach, or a subtly different variant of it.

A similar question is portability. The comments and commit messages
seem to suggest most of this is linux-specific, and other platforms just
don't have these capabilities. But there's a bunch of messages (mostly
by Thomas Munro) that hint FreeBSD might be capable of this too, even if
to some limited extent. And possibly even Windows/EXEC_BACKEND, although
that seems much trickier.

[...]

So I think it'd be very helpful to write a README, explaining the
currnent design/approach, and summarizing all these aspects in a single
place. Including things like portability, interaction with the OS
accounting, OOM killer, this kind of stuff. Some of this stuff may be
already mentioned in code comments, but you it's hard to find those.

Especially worth documenting are the states the processes need to go
through (using the barriers), and the transacitons between them (i.e.
what is allowed in each phase, what blocks can be visible, etc.).

Agree, I'll add some comprehensive readme in the next version. Note,
that on the topic of portability the latest version implements a new
approach suggested by Thomas Munro, which reduces problematic parts to
memfd_create only, which is mentioned as Linux specific in the
documentation, but AFAICT has FreeBSD counterparts.

1) no user docs

There are no user .sgml docs, and maybe it's time to write some,
explaining how to use this thing - how to configure it, how to trigger
the resizing, etc. It took me a while to realize I need to do ALTER
SYSTEM + pg_reload_conf() to kick this off.

It should also document the user-visible limitations, e.g. what activity
is blocked during the resizing, etc.

While the user interface is still under discussion, I agree, it makes
sense to capture this information in sgml docs.

2) pending GUC changes

[...]

It also seems a bit strange that the "switch" gets to be be driven by a
randomly selected backend (unless I'm misunderstanding this bit). It
seems to be true for the buffer eviction during shrinking, at least.

The resize itself is coordinated by the postmaster alone, not by a
randomly selected backend. But looks like buffer eviction indeed can
happen anywhere, which is what we were discussing in the previous
messages.

Perhaps this should be a separate utility command, or maybe even just
a new ALTER SYSTEM variant? Or even just a function, similar to what
the "online checksums" patch did, possibly combined with a bgworder
(but probably not needed, there are no db-specific tasks to do).

This is one topic we still actively discuss, but haven't had much
feedback otherwise. The pros and cons seem to be clear:

* Utilizing the existing GUC mechanism would allow treating
shared_buffers as any other configuration, meaning that potential
users of this feature don't have to do anything new to use it -- they
still can use whatever method they prefer to apply new configuration
(pg_reload_conf, pg_ctr reload, maybe even sending SIGHUP directly).

I'm also wondering if it's only shared_buffers, or some other options
could use similar approach.

* Having a separate utility command is a mighty simplification, which
helps avoiding problems you've described above.

So far we've got two against one in favour of simple utility command, so
we can as well go with that.

3) max_available_memory

Speaking of GUCs, I dislike how max_available_memory works. It seems a
bit backwards to me. I mean, we're specifying shared_buffers (and some
other parameters), and the system calculates the amount of shared memory
needed. But the limit determines the total limit?

The reason it's so backwards is that it's coming from the need to
specify how much memory we would like to reserve, and what would be the
upper boundary for increasing shared_buffers. My intention is eventually
to get rid of this GUC and figure its value at runtime as a function of
the total available memory.

I think the GUC should specify the maximum shared_buffers we want to
allow, and then we'd work out the total to pre-allocate? Considering
we're only allowing to resize shared_buffers, that should be pretty
trivial. Yes, it might happen that the "total limit" happens to exceed
the available memory or something, but we already have the problem
with shared_buffers. Seems fine if we explain this in the docs, and
perhaps print the calculated memory limit on start.

Somehow I'm not following what you suggest here. You mean having the
maximum shared_buffers specified, but not as a separate GUC?

4) SHMEM_RESIZE_RATIO

The SHMEM_RESIZE_RATIO thing seems a bit strange too. There's no way
these ratios can make sense. For example, BLCKSZ is 8192 but the buffer
descriptor is 64B. That's 128x difference, but the ratios says 0.6 and
0.1, so 6x. Sure, we'll actually allocate only the memory we need, and
the rest is only "reserved".

SHMEM_RESIZE_RATIO is a temporary hack, waiting for more decent
solution, nothing more. I probably have to mention that in the
commentaries.

Moreover, all of the above is for mappings sized based on NBuffers. But
if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the
moment someone increases of max_connection, max_locks_per_transaction
and possibly some other stuff?

Can you elaborate, what do you mean by that? Increasing max_connection,
etc. leading to increased memory consumption in the MAIN_SHMEM_SEGMENT,
but the ratio is for memory reservation only.

5) no tests

I mentioned no "user docs", but the patch has 0 tests too. Which seems
a bit strange for a patch of this age.

A really serious part of the patch series seems to be the coordination
of processes when going through the phases, enforced by the barriers.
This seems like a perfect match for testing using injection points, and
I know we did something like this in the online checksums patch, which
needs to coordinate processes in a similar way.

Exactly what we're talking about recently, figuring out how to use
injections points for testing. Keep in mind, that the scope of this work
turned out to be huge, and with just two people on board we're
addressing one thing at the time.

But even just a simple TAP test that does a bunch of (random?) resizes
while running a pgbench seem better than no tests. (That's what I did
manually, and it crashed right away.)

This is the type of testing I was doing before posting the series. I
assume you've crashed it on buffers shrinking, singe you've got SIGBUS
which would indicate that the memory is not available anymore. Before we
go into debugging, just to be on the safe side I would like to make sure
you were testing the latest patch version (there are some signs that
it's not the case, about that later)?

10) what to do about stuck resize?

AFAICS the resize can get stuck for various reasons, e.g. because it
can't evict pinned buffers, possibly indefinitely. Not great, it's not
clear to me if there's a way out (canceling the resize) after a timeout,
or something like that? Not great to start an "online resize" only to
get stuck with all activity blocked for indefinite amount of time, and
get to restart anyway.

Seems related to Thomas' message [2], but AFAICS the patch does not do
anything about this yet, right? What's the plan here?

It's another open discussion right now, with an idea to eventually allow
canceling after a timeout. I think canceling when stuck on buffer
eviction should be pretty straightforward (the evition must take place
before actual shared memory resize, so we know nothing has changed yet),
but in some other failure scenarios it would be harder (e.g. if one
backend is stuck resizing, while other have succeeded -- this would
require another round of synchronization and some way to figure out what
is the current status).

11) preparatory actions?

Even if it doesn't get stuck, some of the actions can take a while, like
evicting dirty buffers before shrinking, etc. This is similar to what
happens on restart, when the shutdown checkpoint can take a while, while
the system is (partly) unavailable.

The common mitigation is to do an explicit checkpoint right before the
restart, to make the shutdown checkpoint cheap. Could we do something
similar for the shrinking, e.g. flush buffers from the part to be
removed before actually starting the resize?

Yeah, that's a good idea, we will try to explore it.

12) does this affect e.g. fork() costs?

I wonder if this affects the cost of fork() in some undesirable way?
Could it make fork() more measurably more expensive?

The number of new mappings is quite limited, so I would not expect that.
But I can measure the impact.

14) interesting messages from the thread

While reading through the thread, I noticed a couple messages that I
think are still relevant:

Right, I'm aware there is a lot of not yet addressed feedback, even more
than you've mentioned below. None of this feedback was ignored, we're
just solving large problems step by step. So far the focus was on how to
do memory reservation and to coordinate resize, and everybody is more
than welcome to join. But thanks for collecting the list, I probably
need to start tracking what was addressed and what was not.

- Robert asked [5] if Linux might abruptly break this, but I find that
unlikely. We'd point out we rely on this, and they'd likely rethink.
This would be made safer if this was specified by POSIX - taking that
away once implemented seems way harder than for custom extensions.
It's likely they'd not take away the feature without an alternative
way to achieve the same effect, I think (yes, harder to maintain).
Tom suggests [7] this is not in POSIX.

This conversation was related to the original implementation, which was
based on mremap and slicing of mappings. As I've mentioned, the new
approach doesn't have most of those controversial points, it uses
memfd_create and regular compatible mmap -- I don't see any of those
changing their behavior any time soon.

- Andres had an interesting comment about how overcommit interacts with
MAP_NORESERVE. AFAIK it means we need the flag to not break overcommit
accounting. There's also some comments about from linux-mm people [9].

The new implementation uses MAP_NORESERVE for the mapping.

- There seem to be some issues with releasing memory backing a mapping
with hugetlb [10]. With the fd (and truncating the file), this seems
to release the memory, but it's linux-specific? But most of this stuff
is specific to linux, it seems. So is this a problem? With this it
should be working even for hugetlb ...

Again, the new implementation got rid of problematic bits here, and I
haven't found any weak points related to hugetlb in testing so far.

- It seems FreeBSD has MFD_HUGETLB [11], so maybe we could use this and
make the hugetlb stuff work just like on Linux? Unclear. Also, I
thought the mfd stuff is linux-specific ... or am I confused?

Yep, probably.

- Thomas asked [13] why we need to stop all the backends, instead of
just waiting for them to acknowledge the new (smaller) NBuffers value
and then let them continue. I also don't quite see why this should
not work, and it'd limit the disruption when we have to wait for
eviction of buffers pinned by paused cursors, etc.

I think I've replied to that one, the idea so far was to eliminate any
chance of accessing to-be-truncated buffers and make it easier to reason
about correctness of the implementation this way. I don't see any other
way how to prevent backends from accessing buffers that may disappear
without adding overhead on the read path, but if you folks have some
ideas -- please share!

v5-0001-Process-config-reload-in-AIO-workers.patch

1) Hmmm, so which other workers may need such explicit handling? Do all
other processes participate in procsignal stuff, or does anything
need an explicit handling?

So far I've noticed the issue only with io_workers and the checkpointer.

v5-0003-Introduce-pss_barrierReceivedGeneration.patch

1) Do we actually need this? Isn't it enough to just have two barriers?
Or a barrier + condition variable, or something like that.

The issue with two barriers is that they do not prevent disjoint groups,
i.e. one backend joins the barrier, finishes the work and detaches from
the barrier, then another backends joins. I'm not familiar with how this
was solved for online checkums patch though, will take a look. Having a
barrier and a condition variable would be possible, but it's hard to
figure out for how many backends to wait. All in all, a small extention
to the ProcSignalBarrier feels to me much more elegant.

2) The comment talks about "coordinated way" when processing messages,
but it's not very clear to me. It should explain what is needed and
not possible with the current barrier code.

Yeah, I need to work on the commentaries across the patch. Here in
particular it means any coordinated way, whatever that could be. I can
add an example to clarify that part.

v5-0004-Allow-to-use-multiple-shared-memory-mappings.patch

Most of the commentaries here and in the following patches are obviously
reasonable and I'll incorporate them into the next version.

5) I'm a bit confused about the segment/mapping difference. The patch
seems to randomly mix those, or maybe I'm just confused. I mean,
we are creating just shmem segment, and the pieces are mappings,
right? So why do we index them by "shmem_segment"?

Indeed, the patch uses "segment" and "mapping" interchangeably, I need
to tighten it up. The relation is still one to one, thus are multiple
segments as well as mappings.

7) We should remember which segments got to use huge pages and which
did not. And we should make it optional for each segment. Although,
maybe I'm just confused about the "segment" definition - if we only
have one, that's where huge pages are applied.

If we could have multiple segments for different segments (whatever
that means), not sure what we'll report for cases when some segments
get to use huge pages and others don't.

Exactly to avoid solving this, I've consciously decided to postpone
implementing possibility to mix huge and regular pages so far. Any
opinions, should a single reported value be removed and this information
is instead represented as part of an informational view about shared
memory (the one you were suggesting in this thread)?

11) Actually, what's the difference between the contents of Mappings
and Segments? Isn't that the same thing, indexed in the same way?
Or could it be unified? Or are they conceptually different thing?

Unless I'm mixing something badly, the content is the same. The relation
is a segment as a structure "contains" a mapping.

v5-0005-Address-space-reservation-for-shared-memory.patch

1) Shouldn't reserved_offset and huge_pages_on really be in the segment
info? Or maybe even in mapping info? (again, maybe I'm confused
about what these structs store)

I don't think there is reserved_offset variable in the latest version
anymore, can you please confirm you use it instead of ther one I've
posted in April?

3) So ReserveAnonymousMemory is what makes decisions about huge pages,
for the whole reserved space / all segments in it. That's a bit
unfortunate with respect to the desirability of some segments
benefiting from huge pages and others not. Maybe we should have two
"reserved" areas, one with huge pages, one without?

Again, there is no ReserveAnonymousMemory anymore, the new approach is
to reserve the memory via separate mappings.

I guess we don't want too many segments, because that might make
fork() more expensive, etc. Just guessing, though. Also, how would
this work with threading?

I assume multithreading will render it unnecessary to use shared memory
favoring some other types of memory usage, but the mechanism around it
could still be the same.

5) The general approach seems sound to me, but I'm not expert on this.
I wonder how portable this behavior is. I mean, will it work on other
Unix systems / Windows? Is it POSIX or Linux extension?

Don't know yet, it's a topic for investigation.

v5-0006-Introduce-multiple-shmem-segments-for-shared-buff.patch

2) In fact, what happens if the user tries to resize to a value that is
too large for one of the segments? How would the system know before
starting the resize (and failing)?

This type of situation is handled (doing hard stop) in the latest
version, because all the necessary information is present in the mapping
structure.

v5-0007-Allow-to-resize-shared-memory-without-restart.patch

1) Why would AdjustShmemSize be needed? Isn't that a sign of a bug
somewhere in the resizing?

When coordination with barriers kicks in, there is a cut off line after
which any newly spawned backend will not be able to take part in it
(e.g. it was too slow to init ProcSignal infrastructure).
AdjustShmemSize is used to handle this cases.

2) Isn't the pg_memory_barrier() in CoordinateShmemResize a bit weird?
Why is it needed, exactly? If it's to flush stuff for processes
consuming EmitProcSignalBarrier, it's that too late? What if a
process consumes the barrier between the emit and memory barrier?

I think it's not needed, a leftover after code modifications.

v5-0008-Support-shrinking-shared-buffers.patch
v5-0009-Reinitialize-StrategyControl-after-resizing-buffe.patch
v5-0010-Additional-validation-for-buffer-in-the-ring.patch

This reminds me I still need to review those, so Ashutosh probably can
answer those questions better than I.

#80

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Dmitry Dolgov (#79)

Re: Changing shared_buffers without restart

On 7/4/25 16:41, Dmitry Dolgov wrote:

On Fri, Jul 04, 2025 at 02:06:16AM +0200, Tomas Vondra wrote:
I took a look at this patch, because it's somewhat related to the NUMA
patch series I posted a couple days ago, and I've been wondering if
it makes some of the NUMA stuff harder or simpler.

Thanks a lot for the review! It's a plenty of feedback, and I'll
probably take time to answer all of it, but I still want to address
couple of most important topics quickly.

But I'm getting a bit lost in how exactly this interacts with things
like overcommit, system memory accounting / OOM killer and this sort of
stuff. I went through the thread and it seems to me the reserve+map
approach works OK in this regard (and the messages on linux-mm seem to
confirm this). But this information is scattered over many messages and
it's hard to say for sure, because some of this might be relevant for
an earlier approach, or a subtly different variant of it.

A similar question is portability. The comments and commit messages
seem to suggest most of this is linux-specific, and other platforms just
don't have these capabilities. But there's a bunch of messages (mostly
by Thomas Munro) that hint FreeBSD might be capable of this too, even if
to some limited extent. And possibly even Windows/EXEC_BACKEND, although
that seems much trickier.

[...]

So I think it'd be very helpful to write a README, explaining the
currnent design/approach, and summarizing all these aspects in a single
place. Including things like portability, interaction with the OS
accounting, OOM killer, this kind of stuff. Some of this stuff may be
already mentioned in code comments, but you it's hard to find those.

Especially worth documenting are the states the processes need to go
through (using the barriers), and the transacitons between them (i.e.
what is allowed in each phase, what blocks can be visible, etc.).

Agree, I'll add some comprehensive readme in the next version. Note,
that on the topic of portability the latest version implements a new
approach suggested by Thomas Munro, which reduces problematic parts to
memfd_create only, which is mentioned as Linux specific in the
documentation, but AFAICT has FreeBSD counterparts.

OK. It's not entirely clear to me if this README should be temporary, or
if it should eventually get committed. I'd probably vote to have a
proper README explaining the basic design / resizing processes etc. It
probably should not discuss portability in too much detail, that can get
stale pretty quick.

1) no user docs

There are no user .sgml docs, and maybe it's time to write some,
explaining how to use this thing - how to configure it, how to trigger
the resizing, etc. It took me a while to realize I need to do ALTER
SYSTEM + pg_reload_conf() to kick this off.

It should also document the user-visible limitations, e.g. what activity
is blocked during the resizing, etc.

While the user interface is still under discussion, I agree, it makes
sense to capture this information in sgml docs.

Yeah. Spelling out the "official" way to use something is helpful.

2) pending GUC changes

[...]

It also seems a bit strange that the "switch" gets to be be driven by a
randomly selected backend (unless I'm misunderstanding this bit). It
seems to be true for the buffer eviction during shrinking, at least.

The resize itself is coordinated by the postmaster alone, not by a
randomly selected backend. But looks like buffer eviction indeed can
happen anywhere, which is what we were discussing in the previous
messages.

Perhaps this should be a separate utility command, or maybe even just
a new ALTER SYSTEM variant? Or even just a function, similar to what
the "online checksums" patch did, possibly combined with a bgworder
(but probably not needed, there are no db-specific tasks to do).

This is one topic we still actively discuss, but haven't had much
feedback otherwise. The pros and cons seem to be clear:

* Utilizing the existing GUC mechanism would allow treating
shared_buffers as any other configuration, meaning that potential
users of this feature don't have to do anything new to use it -- they
still can use whatever method they prefer to apply new configuration
(pg_reload_conf, pg_ctr reload, maybe even sending SIGHUP directly).

I'm also wondering if it's only shared_buffers, or some other options
could use similar approach.

I don't know. What are the "potential users" of this feature? I don't
recall any, but there may be some. How do we know the new pending flag
will work for them too?

* Having a separate utility command is a mighty simplification, which
helps avoiding problems you've described above.

So far we've got two against one in favour of simple utility command, so
we can as well go with that.

Not sure voting is a good way to make design decisions ...

3) max_available_memory

Speaking of GUCs, I dislike how max_available_memory works. It seems a
bit backwards to me. I mean, we're specifying shared_buffers (and some
other parameters), and the system calculates the amount of shared memory
needed. But the limit determines the total limit?

The reason it's so backwards is that it's coming from the need to
specify how much memory we would like to reserve, and what would be the
upper boundary for increasing shared_buffers. My intention is eventually
to get rid of this GUC and figure its value at runtime as a function of
the total available memory.

I understand why it's like this. It's simple, and people do want to
limit the memory the instance will allocate. That's understandable. The
trouble is it makes it very unclear what's the implied limit on shared
buffers size. Maybe if there was a sensible way to expose that, we could
keep the max_available_memory.

But I don't think you can get rid of the GUC, at least not entirely. You
need to leave some memory aside for queries, people may start multiple
instances at once, ...

I think the GUC should specify the maximum shared_buffers we want to
allow, and then we'd work out the total to pre-allocate? Considering
we're only allowing to resize shared_buffers, that should be pretty
trivial. Yes, it might happen that the "total limit" happens to exceed
the available memory or something, but we already have the problem
with shared_buffers. Seems fine if we explain this in the docs, and
perhaps print the calculated memory limit on start.

Somehow I'm not following what you suggest here. You mean having the
maximum shared_buffers specified, but not as a separate GUC?

My suggestion was to have a guc max_shared_buffers. Based on that you
can easily calculate the size of all other segments dependent on
NBuffers, and reserve memory for that.

4) SHMEM_RESIZE_RATIO

The SHMEM_RESIZE_RATIO thing seems a bit strange too. There's no way
these ratios can make sense. For example, BLCKSZ is 8192 but the buffer
descriptor is 64B. That's 128x difference, but the ratios says 0.6 and
0.1, so 6x. Sure, we'll actually allocate only the memory we need, and
the rest is only "reserved".

SHMEM_RESIZE_RATIO is a temporary hack, waiting for more decent
solution, nothing more. I probably have to mention that in the
commentaries.

Moreover, all of the above is for mappings sized based on NBuffers. But
if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the
moment someone increases of max_connection, max_locks_per_transaction
and possibly some other stuff?

Can you elaborate, what do you mean by that? Increasing max_connection,
etc. leading to increased memory consumption in the MAIN_SHMEM_SEGMENT,
but the ratio is for memory reservation only.

Stuff like PGPROC, fast-path locks etc. are allocated as part of
MAIN_SHMEM_SEGMENT, right? Yet the ratio assigns 10% of the maximum
space for that. If I significantly increase GUCs like max_connections or
max_locks_per_transaction, how do you know it didn't exceed the 10%?

5) no tests

I mentioned no "user docs", but the patch has 0 tests too. Which seems
a bit strange for a patch of this age.

A really serious part of the patch series seems to be the coordination
of processes when going through the phases, enforced by the barriers.
This seems like a perfect match for testing using injection points, and
I know we did something like this in the online checksums patch, which
needs to coordinate processes in a similar way.

Exactly what we're talking about recently, figuring out how to use
injections points for testing. Keep in mind, that the scope of this work
turned out to be huge, and with just two people on board we're
addressing one thing at the time.

Sure.

But even just a simple TAP test that does a bunch of (random?) resizes
while running a pgbench seem better than no tests. (That's what I did
manually, and it crashed right away.)

This is the type of testing I was doing before posting the series. I
assume you've crashed it on buffers shrinking, singe you've got SIGBUS
which would indicate that the memory is not available anymore. Before we
go into debugging, just to be on the safe side I would like to make sure
you were testing the latest patch version (there are some signs that
it's not the case, about that later)?

Maybe, I don't remember. But I also see crashes while expanding the
buffers, with assert failure here:

#4 0x0000556f159c43d1 in ExceptionalCondition
(conditionName=0x556f15c00e00 "node->prev != INVALID_PROC_NUMBER ||
list->head == procno", fileName=0x556f15c00ce0
"../../../../src/include/storage/proclist.h", lineNumber=163) at assert.c:66
#5 0x0000556f157a9831 in proclist_contains_offset (list=0x7f296333ce24,
procno=140, node_offset=100) at
../../../../src/include/storage/proclist.h:163
#6 0x0000556f157a9add in ConditionVariableTimedSleep
(cv=0x7f296333ce20, timeout=-1, wait_event_info=134217782) at
condition_variable.c:184
#7 0x0000556f157a99c9 in ConditionVariableSleep (cv=0x7f296333ce20,
wait_event_info=134217782) at condition_variable.c:98
#8 0x0000556f157902df in BarrierArriveAndWait (barrier=0x7f296333ce08,
wait_event_info=134217782) at barrier.c:191
#9 0x0000556f156d1226 in ProcessBarrierShmemResize
(barrier=0x7f296333ce08) at pg_shmem.c:1201

10) what to do about stuck resize?

AFAICS the resize can get stuck for various reasons, e.g. because it
can't evict pinned buffers, possibly indefinitely. Not great, it's not
clear to me if there's a way out (canceling the resize) after a timeout,
or something like that? Not great to start an "online resize" only to
get stuck with all activity blocked for indefinite amount of time, and
get to restart anyway.

Seems related to Thomas' message [2], but AFAICS the patch does not do
anything about this yet, right? What's the plan here?

It's another open discussion right now, with an idea to eventually allow
canceling after a timeout. I think canceling when stuck on buffer
eviction should be pretty straightforward (the evition must take place
before actual shared memory resize, so we know nothing has changed yet),
but in some other failure scenarios it would be harder (e.g. if one
backend is stuck resizing, while other have succeeded -- this would
require another round of synchronization and some way to figure out what
is the current status).

I think it'll be crucial to structure it so that it can't get stuck
while resizing.

11) preparatory actions?

Even if it doesn't get stuck, some of the actions can take a while, like
evicting dirty buffers before shrinking, etc. This is similar to what
happens on restart, when the shutdown checkpoint can take a while, while
the system is (partly) unavailable.

The common mitigation is to do an explicit checkpoint right before the
restart, to make the shutdown checkpoint cheap. Could we do something
similar for the shrinking, e.g. flush buffers from the part to be
removed before actually starting the resize?

Yeah, that's a good idea, we will try to explore it.

12) does this affect e.g. fork() costs?

I wonder if this affects the cost of fork() in some undesirable way?
Could it make fork() more measurably more expensive?

The number of new mappings is quite limited, so I would not expect that.
But I can measure the impact.

14) interesting messages from the thread

While reading through the thread, I noticed a couple messages that I
think are still relevant:

Right, I'm aware there is a lot of not yet addressed feedback, even more
than you've mentioned below. None of this feedback was ignored, we're
just solving large problems step by step. So far the focus was on how to
do memory reservation and to coordinate resize, and everybody is more
than welcome to join. But thanks for collecting the list, I probably
need to start tracking what was addressed and what was not.

- Robert asked [5] if Linux might abruptly break this, but I find that
unlikely. We'd point out we rely on this, and they'd likely rethink.
This would be made safer if this was specified by POSIX - taking that
away once implemented seems way harder than for custom extensions.
It's likely they'd not take away the feature without an alternative
way to achieve the same effect, I think (yes, harder to maintain).
Tom suggests [7] this is not in POSIX.

This conversation was related to the original implementation, which was
based on mremap and slicing of mappings. As I've mentioned, the new
approach doesn't have most of those controversial points, it uses
memfd_create and regular compatible mmap -- I don't see any of those
changing their behavior any time soon.

- Andres had an interesting comment about how overcommit interacts with
MAP_NORESERVE. AFAIK it means we need the flag to not break overcommit
accounting. There's also some comments about from linux-mm people [9].

The new implementation uses MAP_NORESERVE for the mapping.

- There seem to be some issues with releasing memory backing a mapping
with hugetlb [10]. With the fd (and truncating the file), this seems
to release the memory, but it's linux-specific? But most of this stuff
is specific to linux, it seems. So is this a problem? With this it
should be working even for hugetlb ...

Again, the new implementation got rid of problematic bits here, and I
haven't found any weak points related to hugetlb in testing so far.

- It seems FreeBSD has MFD_HUGETLB [11], so maybe we could use this and
make the hugetlb stuff work just like on Linux? Unclear. Also, I
thought the mfd stuff is linux-specific ... or am I confused?

Yep, probably.

- Thomas asked [13] why we need to stop all the backends, instead of
just waiting for them to acknowledge the new (smaller) NBuffers value
and then let them continue. I also don't quite see why this should
not work, and it'd limit the disruption when we have to wait for
eviction of buffers pinned by paused cursors, etc.

I think I've replied to that one, the idea so far was to eliminate any
chance of accessing to-be-truncated buffers and make it easier to reason
about correctness of the implementation this way. I don't see any other
way how to prevent backends from accessing buffers that may disappear
without adding overhead on the read path, but if you folks have some
ideas -- please share!

v5-0001-Process-config-reload-in-AIO-workers.patch

1) Hmmm, so which other workers may need such explicit handling? Do all
other processes participate in procsignal stuff, or does anything
need an explicit handling?

So far I've noticed the issue only with io_workers and the checkpointer.

v5-0003-Introduce-pss_barrierReceivedGeneration.patch

1) Do we actually need this? Isn't it enough to just have two barriers?
Or a barrier + condition variable, or something like that.

The issue with two barriers is that they do not prevent disjoint groups,
i.e. one backend joins the barrier, finishes the work and detaches from
the barrier, then another backends joins. I'm not familiar with how this
was solved for online checkums patch though, will take a look. Having a
barrier and a condition variable would be possible, but it's hard to
figure out for how many backends to wait. All in all, a small extention
to the ProcSignalBarrier feels to me much more elegant.

2) The comment talks about "coordinated way" when processing messages,
but it's not very clear to me. It should explain what is needed and
not possible with the current barrier code.

Yeah, I need to work on the commentaries across the patch. Here in
particular it means any coordinated way, whatever that could be. I can
add an example to clarify that part.

v5-0004-Allow-to-use-multiple-shared-memory-mappings.patch

Most of the commentaries here and in the following patches are obviously
reasonable and I'll incorporate them into the next version.

5) I'm a bit confused about the segment/mapping difference. The patch
seems to randomly mix those, or maybe I'm just confused. I mean,
we are creating just shmem segment, and the pieces are mappings,
right? So why do we index them by "shmem_segment"?

Indeed, the patch uses "segment" and "mapping" interchangeably, I need
to tighten it up. The relation is still one to one, thus are multiple
segments as well as mappings.

7) We should remember which segments got to use huge pages and which
did not. And we should make it optional for each segment. Although,
maybe I'm just confused about the "segment" definition - if we only
have one, that's where huge pages are applied.

If we could have multiple segments for different segments (whatever
that means), not sure what we'll report for cases when some segments
get to use huge pages and others don't.

Exactly to avoid solving this, I've consciously decided to postpone
implementing possibility to mix huge and regular pages so far. Any
opinions, should a single reported value be removed and this information
is instead represented as part of an informational view about shared
memory (the one you were suggesting in this thread)?

11) Actually, what's the difference between the contents of Mappings
and Segments? Isn't that the same thing, indexed in the same way?
Or could it be unified? Or are they conceptually different thing?

Unless I'm mixing something badly, the content is the same. The relation
is a segment as a structure "contains" a mapping.

Then, why do we need to track it in two places? Doesn't it just increase
the likelihood that someone misses updating one of them?

v5-0005-Address-space-reservation-for-shared-memory.patch

1) Shouldn't reserved_offset and huge_pages_on really be in the segment
info? Or maybe even in mapping info? (again, maybe I'm confused
about what these structs store)

I don't think there is reserved_offset variable in the latest version
anymore, can you please confirm you use it instead of ther one I've
posted in April?

3) So ReserveAnonymousMemory is what makes decisions about huge pages,
for the whole reserved space / all segments in it. That's a bit
unfortunate with respect to the desirability of some segments
benefiting from huge pages and others not. Maybe we should have two
"reserved" areas, one with huge pages, one without?

Again, there is no ReserveAnonymousMemory anymore, the new approach is
to reserve the memory via separate mappings.

Will check. These may indeed be stale comments, from looking at the
earlier version of the patch (the last one from Ashutosh).

I guess we don't want too many segments, because that might make
fork() more expensive, etc. Just guessing, though. Also, how would
this work with threading?

I assume multithreading will render it unnecessary to use shared memory
favoring some other types of memory usage, but the mechanism around it
could still be the same.

5) The general approach seems sound to me, but I'm not expert on this.
I wonder how portable this behavior is. I mean, will it work on other
Unix systems / Windows? Is it POSIX or Linux extension?

Don't know yet, it's a topic for investigation.

v5-0006-Introduce-multiple-shmem-segments-for-shared-buff.patch

2) In fact, what happens if the user tries to resize to a value that is
too large for one of the segments? How would the system know before
starting the resize (and failing)?

This type of situation is handled (doing hard stop) in the latest
version, because all the necessary information is present in the mapping
structure.

I don't know, but crashing the instance (I assume that's what you mean
by hard stop) does not seem like something we want to do. AFAIK the GUC
hook should be able to determine if the value is too large, and reject
it at that point. Not proceed and crash everything.

v5-0007-Allow-to-resize-shared-memory-without-restart.patch

1) Why would AdjustShmemSize be needed? Isn't that a sign of a bug
somewhere in the resizing?

When coordination with barriers kicks in, there is a cut off line after
which any newly spawned backend will not be able to take part in it
(e.g. it was too slow to init ProcSignal infrastructure).
AdjustShmemSize is used to handle this cases.

2) Isn't the pg_memory_barrier() in CoordinateShmemResize a bit weird?
Why is it needed, exactly? If it's to flush stuff for processes
consuming EmitProcSignalBarrier, it's that too late? What if a
process consumes the barrier between the emit and memory barrier?

I think it's not needed, a leftover after code modifications.

v5-0008-Support-shrinking-shared-buffers.patch
v5-0009-Reinitialize-StrategyControl-after-resizing-buffe.patch
v5-0010-Additional-validation-for-buffer-in-the-ring.patch

This reminds me I still need to review those, so Ashutosh probably can
answer those questions better than I.

--
Tomas Vondra

#81

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Tomas Vondra (#80)

Re: Changing shared_buffers without restart

On Fri, Jul 04, 2025 at 05:23:29PM +0200, Tomas Vondra wrote:

2) pending GUC changes

Perhaps this should be a separate utility command, or maybe even just
a new ALTER SYSTEM variant? Or even just a function, similar to what
the "online checksums" patch did, possibly combined with a bgworder
(but probably not needed, there are no db-specific tasks to do).

This is one topic we still actively discuss, but haven't had much
feedback otherwise. The pros and cons seem to be clear:

* Utilizing the existing GUC mechanism would allow treating
shared_buffers as any other configuration, meaning that potential
users of this feature don't have to do anything new to use it -- they
still can use whatever method they prefer to apply new configuration
(pg_reload_conf, pg_ctr reload, maybe even sending SIGHUP directly).

I'm also wondering if it's only shared_buffers, or some other options
could use similar approach.

I don't know. What are the "potential users" of this feature? I don't
recall any, but there may be some. How do we know the new pending flag
will work for them too?

It could be potentialy useful for any GUC that controls a resource
shared between backend, and requires restart. To make this GUC
changeable online, every backend has to perform some action, and they
have to coordinate to make sure things are consistent -- exactly the use
case we're trying to address, shared_buffers is just happened to be one
of such resources. While I agree that the currently implemented
interface is wrong (e.g. it doesn't prevent pending GUCs from being
stored in PG_AUTOCONF_FILENAME, this has to happen only when the new
value is actually applied), it still makes sense to me to allow more
flexible lifecycle for certain GUC.

An example I could think of is shared_preload_libraries. If we ever want
to do a hot reload of libraries, this will follow the procedure above:
every backend has to do something like dlclose / dlopen and make sure
that other backends have the same version of the library. Another maybe
less far fetched example is max_worker_processes, which AFAICT is mostly
used to control number of slots in shared memory (altough it's also
stored in the control file, which makes things more complicated).

* Having a separate utility command is a mighty simplification, which
helps avoiding problems you've described above.

So far we've got two against one in favour of simple utility command, so
we can as well go with that.

Not sure voting is a good way to make design decisions ...

I'm somewhat torn between those two options myself. The more I think
about this topic, the more I convinced that pending GUC makes sense, but
the more work I see needed to implement that. Maybe a good middle ground
is to go with a simple utility command, as Ashutosh was suggesting, and
keep pending GUC infrastructure on top of that as an optional patch.

3) max_available_memory

I think the GUC should specify the maximum shared_buffers we want to
allow, and then we'd work out the total to pre-allocate? Considering
we're only allowing to resize shared_buffers, that should be pretty
trivial. Yes, it might happen that the "total limit" happens to exceed
the available memory or something, but we already have the problem
with shared_buffers. Seems fine if we explain this in the docs, and
perhaps print the calculated memory limit on start.

Somehow I'm not following what you suggest here. You mean having the
maximum shared_buffers specified, but not as a separate GUC?

My suggestion was to have a guc max_shared_buffers. Based on that you
can easily calculate the size of all other segments dependent on
NBuffers, and reserve memory for that.

Got it, ok.

Moreover, all of the above is for mappings sized based on NBuffers. But
if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the
moment someone increases of max_connection, max_locks_per_transaction
and possibly some other stuff?

Can you elaborate, what do you mean by that? Increasing max_connection,
etc. leading to increased memory consumption in the MAIN_SHMEM_SEGMENT,
but the ratio is for memory reservation only.

Stuff like PGPROC, fast-path locks etc. are allocated as part of
MAIN_SHMEM_SEGMENT, right? Yet the ratio assigns 10% of the maximum
space for that. If I significantly increase GUCs like max_connections or
max_locks_per_transaction, how do you know it didn't exceed the 10%?

Still don't see the problem. The 10% we're talking about is the reserved
space, thus it affects only shared memory resizing operation and nothing
else. The real memory allocated is less than or equal to the reserved
size, but is allocated and managed completely in the same way as without
the patch, including size calculations. If some GUCs are increased and
drive real memory usage high, it will be handled as before. Are we on
the same page about this?

11) Actually, what's the difference between the contents of Mappings
and Segments? Isn't that the same thing, indexed in the same way?
Or could it be unified? Or are they conceptually different thing?

Unless I'm mixing something badly, the content is the same. The relation
is a segment as a structure "contains" a mapping.

Then, why do we need to track it in two places? Doesn't it just increase
the likelihood that someone misses updating one of them?

To clarify, under "contents" I mean the shared memory content (the
actual data) behind both "segment" and the "mapping", maybe you had
something else in mind.

On the surface of it those are two different data structures that have
mostly different, but related, fields: a shared memory segment contains
stuff needed for working with memory (header, base, end, lock), mapping
has more lower level details, e.g. reserved space, fd, IPC key. The
only common fields are size and address, maybe I can factor them out to
not repeat.

2) In fact, what happens if the user tries to resize to a value that is
too large for one of the segments? How would the system know before
starting the resize (and failing)?

This type of situation is handled (doing hard stop) in the latest
version, because all the necessary information is present in the mapping
structure.

I don't know, but crashing the instance (I assume that's what you mean
by hard stop) does not seem like something we want to do. AFAIK the GUC
hook should be able to determine if the value is too large, and reject
it at that point. Not proceed and crash everything.

I see, you're pointing out that it would be good to have more validation
at the GUC level, right?

#82

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Dmitry Dolgov (#79)

Re: Changing shared_buffers without restart

On Fri, Jul 04, 2025 at 04:41:51PM +0200, Dmitry Dolgov wrote:

v5-0003-Introduce-pss_barrierReceivedGeneration.patch

1) Do we actually need this? Isn't it enough to just have two barriers?
Or a barrier + condition variable, or something like that.

The issue with two barriers is that they do not prevent disjoint groups,
i.e. one backend joins the barrier, finishes the work and detaches from
the barrier, then another backends joins. I'm not familiar with how this
was solved for online checkums patch though, will take a look. Having a
barrier and a condition variable would be possible, but it's hard to
figure out for how many backends to wait. All in all, a small extention
to the ProcSignalBarrier feels to me much more elegant.

After quickly checking how online checksums patch is dealing with the
coordination, I've realized my answer here about the disjoint groups is
not quite correct. You were asking about ProcSignalBarrier, I was
answering about the barrier within the resizing logic. Here is how it
looks like to me:

* We could follow the same way as the online checksums, launch a
coordinator worker (Ashutosh was suggesting that, but no
implementation has materialized yet) and fire two ProcSignalBarriers,
one to kick off resizing and another one to finish it. Maybe it could
even be three phases, one extra to tell backends to not pull in new
buffers into the pool to help buffer eviction process.

* This way any backend between the ProcSignalBarriers will be able
proceed with whatever it's doing, and there is need to make sure it
will not access buffers that will soon disappear. A suggestion so far
was to get all backends agree to not allocate any new buffers in the
to-be-truncated range, but accessing already existing buffers that
will soon go away is a problem as well. As far as I can tell there is
no rock solid method to make sure a backend doesn't have a reference
to such a buffer somewhere (this was discussed earlier in thre
thread), meaning that either a backend has to wait or buffers have to
be checked every time on access.

* Since the latter adds a performance overhead, we went with the former
(making backends wait). And here is where all the complexity comes
from, because waiting backends cannot reply on a ProcSignalBarrier and
thus require some other approach. If I've overlooked any other
alternative to backends waiting, let me know.

It also seems a bit strange that the "switch" gets to be be driven by
a randomly selected backend (unless I'm misunderstanding this bit). It
seems to be true for the buffer eviction during shrinking, at least.

But looks like the eviction could be indeed improved via a new
coordinator worker. Before resizing shared memory such a worker will
first tell all the backends to not allocate new buffers via
ProcSignalBarrier, then will do buffer eviction. Since backends don't
need to be waiting after this type of ProcSignalBarrier, it should work
and establish only one worker to do the eviction. But the second
ProcSignalBarrier for resizing would still follow the current procedure
with everybody waiting.

Does it make sense to you folks?

#83

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Dmitry Dolgov (#82)

Re: Changing shared_buffers without restart

On Sun, Jul 06, 2025 at 03:01:34PM +0200, Dmitry Dolgov wrote:
* This way any backend between the ProcSignalBarriers will be able
proceed with whatever it's doing, and there is need to make sure it
will not access buffers that will soon disappear. A suggestion so far
was to get all backends agree to not allocate any new buffers in the
to-be-truncated range, but accessing already existing buffers that
will soon go away is a problem as well. As far as I can tell there is
no rock solid method to make sure a backend doesn't have a reference
to such a buffer somewhere (this was discussed earlier in thre
thread), meaning that either a backend has to wait or buffers have to
be checked every time on access.

And sure enough, after I wrote this I've realized there should be no
such references after the buffer eviction and prohibiting new buffer
allocation. I still need to check it though, because not only buffers,
but other shared memory structures (which number depends on NBuffers)
will be truncated. But if they will also be handled by the eviction,
then maybe everything is just fine.

#84

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Dmitry Dolgov (#81)

Re: Changing shared_buffers without restart

On 7/5/25 12:35, Dmitry Dolgov wrote:

On Fri, Jul 04, 2025 at 05:23:29PM +0200, Tomas Vondra wrote:

2) pending GUC changes

Perhaps this should be a separate utility command, or maybe even just
a new ALTER SYSTEM variant? Or even just a function, similar to what
the "online checksums" patch did, possibly combined with a bgworder
(but probably not needed, there are no db-specific tasks to do).

This is one topic we still actively discuss, but haven't had much
feedback otherwise. The pros and cons seem to be clear:

* Utilizing the existing GUC mechanism would allow treating
shared_buffers as any other configuration, meaning that potential
users of this feature don't have to do anything new to use it -- they
still can use whatever method they prefer to apply new configuration
(pg_reload_conf, pg_ctr reload, maybe even sending SIGHUP directly).

I'm also wondering if it's only shared_buffers, or some other options
could use similar approach.

I don't know. What are the "potential users" of this feature? I don't
recall any, but there may be some. How do we know the new pending flag
will work for them too?

It could be potentialy useful for any GUC that controls a resource
shared between backend, and requires restart. To make this GUC
changeable online, every backend has to perform some action, and they
have to coordinate to make sure things are consistent -- exactly the use
case we're trying to address, shared_buffers is just happened to be one
of such resources. While I agree that the currently implemented
interface is wrong (e.g. it doesn't prevent pending GUCs from being
stored in PG_AUTOCONF_FILENAME, this has to happen only when the new
value is actually applied), it still makes sense to me to allow more
flexible lifecycle for certain GUC.

An example I could think of is shared_preload_libraries. If we ever want
to do a hot reload of libraries, this will follow the procedure above:
every backend has to do something like dlclose / dlopen and make sure
that other backends have the same version of the library. Another maybe
less far fetched example is max_worker_processes, which AFAICT is mostly
used to control number of slots in shared memory (altough it's also
stored in the control file, which makes things more complicated).

Not sure. My concern is the config reload / GUC assign hook was not
designed with this use case in mind, and we'll run into issues. I also
dislike the "async" nature of this, which makes it harder to e.g. abort
the change, etc.

* Having a separate utility command is a mighty simplification, which
helps avoiding problems you've described above.

So far we've got two against one in favour of simple utility command, so
we can as well go with that.

Not sure voting is a good way to make design decisions ...

I'm somewhat torn between those two options myself. The more I think
about this topic, the more I convinced that pending GUC makes sense, but
the more work I see needed to implement that. Maybe a good middle ground
is to go with a simple utility command, as Ashutosh was suggesting, and
keep pending GUC infrastructure on top of that as an optional patch.

What about a simple function? Probably not as clean as a proper utility
command, and it implies a transaction - not sure if that could be a
problem for some part of this.

3) max_available_memory

I think the GUC should specify the maximum shared_buffers we want to
allow, and then we'd work out the total to pre-allocate? Considering
we're only allowing to resize shared_buffers, that should be pretty
trivial. Yes, it might happen that the "total limit" happens to exceed
the available memory or something, but we already have the problem
with shared_buffers. Seems fine if we explain this in the docs, and
perhaps print the calculated memory limit on start.

Somehow I'm not following what you suggest here. You mean having the
maximum shared_buffers specified, but not as a separate GUC?

My suggestion was to have a guc max_shared_buffers. Based on that you
can easily calculate the size of all other segments dependent on
NBuffers, and reserve memory for that.

Got it, ok.

Moreover, all of the above is for mappings sized based on NBuffers. But
if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the
moment someone increases of max_connection, max_locks_per_transaction
and possibly some other stuff?

Can you elaborate, what do you mean by that? Increasing max_connection,
etc. leading to increased memory consumption in the MAIN_SHMEM_SEGMENT,
but the ratio is for memory reservation only.

Stuff like PGPROC, fast-path locks etc. are allocated as part of
MAIN_SHMEM_SEGMENT, right? Yet the ratio assigns 10% of the maximum
space for that. If I significantly increase GUCs like max_connections or
max_locks_per_transaction, how do you know it didn't exceed the 10%?

Still don't see the problem. The 10% we're talking about is the reserved
space, thus it affects only shared memory resizing operation and nothing
else. The real memory allocated is less than or equal to the reserved
size, but is allocated and managed completely in the same way as without
the patch, including size calculations. If some GUCs are increased and
drive real memory usage high, it will be handled as before. Are we on
the same page about this?

How do you know reserving 10% is sufficient? Imagine I set

max_available_memory = '256MB'
max_connections = 1000000
max_locks_per_transaction = 10000

How do you know it's not more than 10% of the available memory?

FWIW if I add a simple assert to CreateAnonymousSegment

Assert(mapping->shmem_reserved >= allocsize);

it crashes even with just the max_available_memory=256MB

#4 0x0000000000b74fbd in ExceptionalCondition (conditionName=0xe25920
"mapping->shmem_reserved >= allocsize", fileName=0xe251e7 "pg_shmem.c",
lineNumber=878) at assert.c:66

because we happen to execute it with this:

mapping->shmem_reserved 26845184 allocsize 125042688

I think I mentioned a similar crash earlier, not sure if that's the same
issue or a different one.

11) Actually, what's the difference between the contents of Mappings
and Segments? Isn't that the same thing, indexed in the same way?
Or could it be unified? Or are they conceptually different thing?

Unless I'm mixing something badly, the content is the same. The relation
is a segment as a structure "contains" a mapping.

Then, why do we need to track it in two places? Doesn't it just increase
the likelihood that someone misses updating one of them?

To clarify, under "contents" I mean the shared memory content (the
actual data) behind both "segment" and the "mapping", maybe you had
something else in mind.

On the surface of it those are two different data structures that have
mostly different, but related, fields: a shared memory segment contains
stuff needed for working with memory (header, base, end, lock), mapping
has more lower level details, e.g. reserved space, fd, IPC key. The
only common fields are size and address, maybe I can factor them out to
not repeat.

OK, I think I'm just confused by the ambiguous definitions of
segment/mapping. It'd be good to document/explain this in a comment
somewhere.

2) In fact, what happens if the user tries to resize to a value that is
too large for one of the segments? How would the system know before
starting the resize (and failing)?

This type of situation is handled (doing hard stop) in the latest
version, because all the necessary information is present in the mapping
structure.

I don't know, but crashing the instance (I assume that's what you mean
by hard stop) does not seem like something we want to do. AFAIK the GUC
hook should be able to determine if the value is too large, and reject
it at that point. Not proceed and crash everything.

I see, you're pointing out that it would be good to have more validation
at the GUC level, right?

Well, that'd be a starting point. We definitely should not allow setting
a value that end up crashing an instance (it does not matter if it's
because of FATAL or hitting a segfault/sigbut somewhere).

cheers

--
Tomas Vondra

#85

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Tomas Vondra (#84)

Re: Changing shared_buffers without restart

On Mon, Jul 07, 2025 at 01:57:42PM +0200, Tomas Vondra wrote:

It could be potentialy useful for any GUC that controls a resource
shared between backend, and requires restart. To make this GUC
changeable online, every backend has to perform some action, and they
have to coordinate to make sure things are consistent -- exactly the use
case we're trying to address, shared_buffers is just happened to be one
of such resources. While I agree that the currently implemented
interface is wrong (e.g. it doesn't prevent pending GUCs from being
stored in PG_AUTOCONF_FILENAME, this has to happen only when the new
value is actually applied), it still makes sense to me to allow more
flexible lifecycle for certain GUC.

An example I could think of is shared_preload_libraries. If we ever want
to do a hot reload of libraries, this will follow the procedure above:
every backend has to do something like dlclose / dlopen and make sure
that other backends have the same version of the library. Another maybe
less far fetched example is max_worker_processes, which AFAICT is mostly
used to control number of slots in shared memory (altough it's also
stored in the control file, which makes things more complicated).

Not sure. My concern is the config reload / GUC assign hook was not
designed with this use case in mind, and we'll run into issues. I also
dislike the "async" nature of this, which makes it harder to e.g. abort
the change, etc.

Yes, GUC assing hook was not designed for that. That's why the idea is
to extend the design and see if it will be good enough.

I'm somewhat torn between those two options myself. The more I think
about this topic, the more I convinced that pending GUC makes sense, but
the more work I see needed to implement that. Maybe a good middle ground
is to go with a simple utility command, as Ashutosh was suggesting, and
keep pending GUC infrastructure on top of that as an optional patch.

What about a simple function? Probably not as clean as a proper utility
command, and it implies a transaction - not sure if that could be a
problem for some part of this.

I'm currently inclined towards this and a new one worker to coordinate
the process, with everything else provided as an optional follow-up
step. Will try this out unless there are any objections.

Stuff like PGPROC, fast-path locks etc. are allocated as part of
MAIN_SHMEM_SEGMENT, right? Yet the ratio assigns 10% of the maximum
space for that. If I significantly increase GUCs like max_connections or
max_locks_per_transaction, how do you know it didn't exceed the 10%?

Still don't see the problem. The 10% we're talking about is the reserved
space, thus it affects only shared memory resizing operation and nothing
else. The real memory allocated is less than or equal to the reserved
size, but is allocated and managed completely in the same way as without
the patch, including size calculations. If some GUCs are increased and
drive real memory usage high, it will be handled as before. Are we on
the same page about this?

How do you know reserving 10% is sufficient? Imagine I set

I see, I was convinced you're talking about changing something at
runtime, which will hit the reservation boundary. But you mean all of
that at simply the start, and yes, of course it will fail -- see the
point about SHMEM_RATIO being just a temporary hack.

#86

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

6 months ago

In reply to: Dmitry Dolgov (#85)

Re: Changing shared_buffers without restart

On Mon, Jul 7, 2025 at 6:36 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Mon, Jul 07, 2025 at 01:57:42PM +0200, Tomas Vondra wrote:

It could be potentialy useful for any GUC that controls a resource
shared between backend, and requires restart. To make this GUC
changeable online, every backend has to perform some action, and they
have to coordinate to make sure things are consistent -- exactly the use
case we're trying to address, shared_buffers is just happened to be one
of such resources. While I agree that the currently implemented
interface is wrong (e.g. it doesn't prevent pending GUCs from being
stored in PG_AUTOCONF_FILENAME, this has to happen only when the new
value is actually applied), it still makes sense to me to allow more
flexible lifecycle for certain GUC.

An example I could think of is shared_preload_libraries. If we ever want
to do a hot reload of libraries, this will follow the procedure above:
every backend has to do something like dlclose / dlopen and make sure
that other backends have the same version of the library. Another maybe
less far fetched example is max_worker_processes, which AFAICT is mostly
used to control number of slots in shared memory (altough it's also
stored in the control file, which makes things more complicated).

Not sure. My concern is the config reload / GUC assign hook was not
designed with this use case in mind, and we'll run into issues. I also
dislike the "async" nature of this, which makes it harder to e.g. abort
the change, etc.

Yes, GUC assing hook was not designed for that. That's why the idea is
to extend the design and see if it will be good enough.

I'm somewhat torn between those two options myself. The more I think
about this topic, the more I convinced that pending GUC makes sense, but
the more work I see needed to implement that. Maybe a good middle ground
is to go with a simple utility command, as Ashutosh was suggesting, and
keep pending GUC infrastructure on top of that as an optional patch.

What about a simple function? Probably not as clean as a proper utility
command, and it implies a transaction - not sure if that could be a
problem for some part of this.

I'm currently inclined towards this and a new one worker to coordinate
the process, with everything else provided as an optional follow-up
step. Will try this out unless there are any objections.

I will reply to the questions but let me summarise my offlist
discussion with Andres.

I had proposed ALTER SYSTEM ... UPDATE ... approach in pgconf.dev for
any system wide GUC change such as this. However, Andres pointed out
that any UI proposal has to honour the current ability to edit
postgresql.conf and trigger the change in a running server. ALTER
SYSTEM ... UDPATE ... does not allow that. So, I think we have to
build something similar or on top of the current ALTER SYSTEM ... SET
+ pg_reload_conf().

My current proposal is ALTER SYSTEM ... SET + pg_reload_conf() with
pending mark + pg_apply_pending_conf(<name of GUC>, <more
parameters>). The third function would take a GUC name as parameter
and complete the pending application change. If the proposed change is
not valid, it will throw an error. If there are problems completing
the change it will throw an error and keep the pending mark intact.
Further the function can take GUC specific parameters which control
the application process. E.g. for example it could tell whether to
wait for a backend to unpin a buffer or cancel that query or kill the
backend or abort the application itself. If the operation takes too
long, a user may want to cancel the function execution just like
cancelling a query. Running two concurrent instances of the function,
both applying the same GUC won't be allowed.

Does that look good?

--
Best Wishes,
Ashutosh Bapat

#87

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Ashutosh Bapat (#86)

Re: Changing shared_buffers without restart

On Mon, Jul 07, 2025 at 07:12:50PM +0530, Ashutosh Bapat wrote:

My current proposal is ALTER SYSTEM ... SET + pg_reload_conf() with
pending mark + pg_apply_pending_conf(<name of GUC>, <more
parameters>). The third function would take a GUC name as parameter
and complete the pending application change. If the proposed change is
not valid, it will throw an error. If there are problems completing
the change it will throw an error and keep the pending mark intact.
Further the function can take GUC specific parameters which control
the application process. E.g. for example it could tell whether to
wait for a backend to unpin a buffer or cancel that query or kill the
backend or abort the application itself. If the operation takes too
long, a user may want to cancel the function execution just like
cancelling a query. Running two concurrent instances of the function,
both applying the same GUC won't be allowed.

Yeah, it can look like this, but it's a large chunk of work as well as
improving the current implementation. I'm still convinced that using GUC
mechanism one or another way is the right choice here, but maybe better
as a follow-up step I was mentioning above -- simply to limit the scope
and move step by step. How does it sound?

Regarding the proposal, I'm somehow uncomfortable with the fact that
between those two function call the system will be in an awkward state
for some time, and how long would it take will not be controlled by
the resizing logic anymore. But otherwise it seems to be equivalent of
what we want to achieve in many other apspects.

#88

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Dmitry Dolgov (#83)

Re: Changing shared_buffers without restart

On Sun, Jul 06, 2025 at 03:21:08PM +0200, Dmitry Dolgov wrote:

On Sun, Jul 06, 2025 at 03:01:34PM +0200, Dmitry Dolgov wrote:
* This way any backend between the ProcSignalBarriers will be able
proceed with whatever it's doing, and there is need to make sure it
will not access buffers that will soon disappear. A suggestion so far
was to get all backends agree to not allocate any new buffers in the
to-be-truncated range, but accessing already existing buffers that
will soon go away is a problem as well. As far as I can tell there is
no rock solid method to make sure a backend doesn't have a reference
to such a buffer somewhere (this was discussed earlier in thre
thread), meaning that either a backend has to wait or buffers have to
be checked every time on access.

And sure enough, after I wrote this I've realized there should be no
such references after the buffer eviction and prohibiting new buffer
allocation. I still need to check it though, because not only buffers,
but other shared memory structures (which number depends on NBuffers)
will be truncated. But if they will also be handled by the eviction,
then maybe everything is just fine.

Pondering more about this topic, I've realized there was one more
problematic case mentioned by Robert early in the thread, which is
relatively easy to construct:

* When increasing shared buffers from NBuffers_small to NBuffers_large
it's possible that one backend already has applied NBuffers_large,
then allocated a buffer B from (NBuffer_small, NBuffers_large] and put
it into the buffer lookup table.

* In the meantime another backend still has NBuffers_small, but got
buffer B from the lookup table.

Currently it's being addressed via every backend waiting for each other,
but I guess it could be as well managed via handling the freelist, so
that only "available" buffers will be inserted into the lookup table.

It's probably the only such case, but I can't tell that for sure (hard
to say, maybe there are more tricky cases with the latest async io). If
you folks have some other examples that may break, let me know. The
idea behind making everyone wait was to be rock solid that no similar
but unknown scenarios could damage the resize procedure.

As for other structures, BufferBlocks, BufferDescriptors and
BufferIOCVArray are all buffer indexed, so making sure shared memory
resizing works for buffers should automatically mean the same for the
rest. But CkptBufferIds is a different case, as it collects buffers to
sync and process them at later point in time -- it has to be explicitely
handled when shrinking shared memory I guess.

Long story short, in the next version of the patch I'll try to
experiment with a simplified design: a simple function to trigger
resizing, launching a coordinator worker, with backends not waiting for
each other and buffers first allocated and then marked as "available to
use".

#89

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

6 months ago

In reply to: Dmitry Dolgov (#88)

Re: Changing shared_buffers without restart

On Mon, Jul 14, 2025 at 12:07 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Sun, Jul 06, 2025 at 03:21:08PM +0200, Dmitry Dolgov wrote:

On Sun, Jul 06, 2025 at 03:01:34PM +0200, Dmitry Dolgov wrote:
* This way any backend between the ProcSignalBarriers will be able
proceed with whatever it's doing, and there is need to make sure it
will not access buffers that will soon disappear. A suggestion so far
was to get all backends agree to not allocate any new buffers in the
to-be-truncated range, but accessing already existing buffers that
will soon go away is a problem as well. As far as I can tell there is
no rock solid method to make sure a backend doesn't have a reference
to such a buffer somewhere (this was discussed earlier in thre
thread), meaning that either a backend has to wait or buffers have to
be checked every time on access.

And sure enough, after I wrote this I've realized there should be no
such references after the buffer eviction and prohibiting new buffer
allocation. I still need to check it though, because not only buffers,
but other shared memory structures (which number depends on NBuffers)
will be truncated. But if they will also be handled by the eviction,
then maybe everything is just fine.

Pondering more about this topic, I've realized there was one more
problematic case mentioned by Robert early in the thread, which is
relatively easy to construct:

* When increasing shared buffers from NBuffers_small to NBuffers_large
it's possible that one backend already has applied NBuffers_large,
then allocated a buffer B from (NBuffer_small, NBuffers_large] and put
it into the buffer lookup table.

* In the meantime another backend still has NBuffers_small, but got
buffer B from the lookup table.

Currently it's being addressed via every backend waiting for each other,
but I guess it could be as well managed via handling the freelist, so
that only "available" buffers will be inserted into the lookup table.

I didn't get how can that be managed by freelist? Buffers are also
allocated through clocksweep, which needs to be managed as well.

Long story short, in the next version of the patch I'll try to
experiment with a simplified design: a simple function to trigger
resizing, launching a coordinator worker, with backends not waiting for
each other and buffers first allocated and then marked as "available to
use".

Should all the backends wait between buffer allocation and them being
marked as "available"? I assume that marking them as available means
"declaring the new NBuffers". What about when shrinking the buffers?
Do you plan to make all the backends wait while the coordinator is
evicting buffers?

--
Best Wishes,
Ashutosh Bapat

#90

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Ashutosh Bapat (#89)

Re: Changing shared_buffers without restart

On Mon, Jul 14, 2025 at 10:25:51AM +0530, Ashutosh Bapat wrote:

Currently it's being addressed via every backend waiting for each other,
but I guess it could be as well managed via handling the freelist, so
that only "available" buffers will be inserted into the lookup table.

I didn't get how can that be managed by freelist? Buffers are also
allocated through clocksweep, which needs to be managed as well.

The way it is implemented in the patch right now is new buffers are
added into the freelist right away, when they're initialized by the
virtue of nextFree. What I have in mind is to do this as the last step,
when all backends have confirmed shared memory signal was absorbed. This
would mean that StrategyControll will not return a buffer id from the
freshly allocated range until everything is done, and no such buffer
will be inserted into the buffer lookup table.

You're right of course, a buffer id could be returned from the
ClockSweep and from the custom strategy buffer ring. Buf from what I see
those are picking a buffer from the set of already utilized buffers,
meaning that for a buffer to land there it first has to go through
StrategyControl->firstFreeBuffer, and hence the idea above will be a
requirement for those as well.

Long story short, in the next version of the patch I'll try to
experiment with a simplified design: a simple function to trigger
resizing, launching a coordinator worker, with backends not waiting for
each other and buffers first allocated and then marked as "available to
use".

Should all the backends wait between buffer allocation and them being
marked as "available"? I assume that marking them as available means
"declaring the new NBuffers".

Yep, making buffers available would be equivalent to declaring the new
NBuffers. What I think is important here is to note, that we use two
mechanisms for coordination: the shared structure ShmemControl that
shares the state of operation, and ProcSignal that tells backends to do
something (change the memory mapping). Declaring the new NBuffers could
be done via ShmemControl, atomically applying the new value, instead of
sending a ProcSignal -- this way there is no need for backends to wait,
but StrategyControl would need to use the ShmemControl instead of local
copy of NBuffers. Does it make sense to you?

What about when shrinking the buffers? Do you plan to make all the
backends wait while the coordinator is evicting buffers?

No, it was never planned like that, since it could easily end up with
coordinator waiting for the backend to unpin a buffer, and the backend
to wait for a signal from the coordinator.

#91

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

6 months ago

In reply to: Dmitry Dolgov (#90)

Re: Changing shared_buffers without restart

On Mon, Jul 14, 2025 at 1:40 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Mon, Jul 14, 2025 at 10:25:51AM +0530, Ashutosh Bapat wrote:

Currently it's being addressed via every backend waiting for each other,
but I guess it could be as well managed via handling the freelist, so
that only "available" buffers will be inserted into the lookup table.

I didn't get how can that be managed by freelist? Buffers are also
allocated through clocksweep, which needs to be managed as well.

The way it is implemented in the patch right now is new buffers are
added into the freelist right away, when they're initialized by the
virtue of nextFree. What I have in mind is to do this as the last step,
when all backends have confirmed shared memory signal was absorbed. This
would mean that StrategyControll will not return a buffer id from the
freshly allocated range until everything is done, and no such buffer
will be inserted into the buffer lookup table.

You're right of course, a buffer id could be returned from the
ClockSweep and from the custom strategy buffer ring. Buf from what I see
those are picking a buffer from the set of already utilized buffers,
meaning that for a buffer to land there it first has to go through
StrategyControl->firstFreeBuffer, and hence the idea above will be a
requirement for those as well.

That isn't true. A buffer which was never in the free list can still
be picked up by clock sweep. But you are raising a relevant point
about StrategyControl below

Long story short, in the next version of the patch I'll try to
experiment with a simplified design: a simple function to trigger
resizing, launching a coordinator worker, with backends not waiting for
each other and buffers first allocated and then marked as "available to
use".

Should all the backends wait between buffer allocation and them being
marked as "available"? I assume that marking them as available means
"declaring the new NBuffers".

Yep, making buffers available would be equivalent to declaring the new
NBuffers. What I think is important here is to note, that we use two
mechanisms for coordination: the shared structure ShmemControl that
shares the state of operation, and ProcSignal that tells backends to do
something (change the memory mapping). Declaring the new NBuffers could
be done via ShmemControl, atomically applying the new value, instead of
sending a ProcSignal -- this way there is no need for backends to wait,
but StrategyControl would need to use the ShmemControl instead of local
copy of NBuffers. Does it make sense to you?

When expanding buffers, letting StrategyControl continue with the old
NBuffers may work. When propagating the new buffer value we have to
reinitialize StrategyControl to use new NBuffers. But when shrinking,
the StrategyControl needs to be initialized with the new NBuffers,
lest it picks a victim from buffers being shrunk. And then if the
operation fails, we have to reinitialize the StrategyControl again
with the old NBuffers.

What about when shrinking the buffers? Do you plan to make all the
backends wait while the coordinator is evicting buffers?

No, it was never planned like that, since it could easily end up with
coordinator waiting for the backend to unpin a buffer, and the backend
to wait for a signal from the coordinator.

I agree with the deadlock situation. How do we prevent the backends
from picking or continuing to work with a buffer from buffers being
shrunk then? Each backend then has to do something about their
respective pinned buffers.

--
Best Wishes,
Ashutosh Bapat

#92

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Ashutosh Bapat (#91)

Re: Changing shared_buffers without restart

On Mon, Jul 14, 2025 at 01:55:39PM +0530, Ashutosh Bapat wrote:

You're right of course, a buffer id could be returned from the
ClockSweep and from the custom strategy buffer ring. Buf from what I see
those are picking a buffer from the set of already utilized buffers,
meaning that for a buffer to land there it first has to go through
StrategyControl->firstFreeBuffer, and hence the idea above will be a
requirement for those as well.

That isn't true. A buffer which was never in the free list can still
be picked up by clock sweep.

How's that?

Yep, making buffers available would be equivalent to declaring the new
NBuffers. What I think is important here is to note, that we use two
mechanisms for coordination: the shared structure ShmemControl that
shares the state of operation, and ProcSignal that tells backends to do
something (change the memory mapping). Declaring the new NBuffers could
be done via ShmemControl, atomically applying the new value, instead of
sending a ProcSignal -- this way there is no need for backends to wait,
but StrategyControl would need to use the ShmemControl instead of local
copy of NBuffers. Does it make sense to you?

When expanding buffers, letting StrategyControl continue with the old
NBuffers may work. When propagating the new buffer value we have to
reinitialize StrategyControl to use new NBuffers. But when shrinking,
the StrategyControl needs to be initialized with the new NBuffers,
lest it picks a victim from buffers being shrunk. And then if the
operation fails, we have to reinitialize the StrategyControl again
with the old NBuffers.

Right, those two cases will become more asymmetrical: for expanding
number of available buffers would have to be propagated to the backends
at the end, when they're ready; for shrinking number of available
buffers would have to be propagated at the start, so that backends will
stop allocating unavailable buffers.

What about when shrinking the buffers? Do you plan to make all the
backends wait while the coordinator is evicting buffers?

No, it was never planned like that, since it could easily end up with
coordinator waiting for the backend to unpin a buffer, and the backend
to wait for a signal from the coordinator.

I agree with the deadlock situation. How do we prevent the backends
from picking or continuing to work with a buffer from buffers being
shrunk then? Each backend then has to do something about their
respective pinned buffers.

The idea I've got so far is stop allocating buffers from the unavailable
range and wait until backends will unpin all unavailable buffers. We
either wait unconditionally until it happens, or bail out after certain
timeout.

It's probably possible to force backends to unpin buffers they work
with, but it sounds much more problematic to me. What do you think?

#93

Thom Brown

thom@linux.com

6 months ago

In reply to: Dmitry Dolgov (#92)

Re: Changing shared_buffers without restart

On Mon, 14 Jul 2025, 09:54 Dmitry Dolgov, <9erthalion6@gmail.com> wrote:

On Mon, Jul 14, 2025 at 01:55:39PM +0530, Ashutosh Bapat wrote:

You're right of course, a buffer id could be returned from the
ClockSweep and from the custom strategy buffer ring. Buf from what I

see

those are picking a buffer from the set of already utilized buffers,
meaning that for a buffer to land there it first has to go through
StrategyControl->firstFreeBuffer, and hence the idea above will be a
requirement for those as well.

That isn't true. A buffer which was never in the free list can still
be picked up by clock sweep.

How's that?

Isn't it its job to find usable buffers from the used buffer list when no
free ones are available? The next victim buffer can be selected (and
cleaned if dirty) and then immediately used without touching the free list.

Thom

#94

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Thom Brown (#93)

Re: Changing shared_buffers without restart

On Mon, Jul 14, 2025 at 10:24:50AM +0100, Thom Brown wrote:
On Mon, 14 Jul 2025, 09:54 Dmitry Dolgov, <9erthalion6@gmail.com> wrote:

On Mon, Jul 14, 2025 at 01:55:39PM +0530, Ashutosh Bapat wrote:

You're right of course, a buffer id could be returned from the
ClockSweep and from the custom strategy buffer ring. Buf from what I

see

those are picking a buffer from the set of already utilized buffers,
meaning that for a buffer to land there it first has to go through
StrategyControl->firstFreeBuffer, and hence the idea above will be a
requirement for those as well.

That isn't true. A buffer which was never in the free list can still
be picked up by clock sweep.

How's that?

Isn't it its job to find usable buffers from the used buffer list when no
free ones are available? The next victim buffer can be selected (and
cleaned if dirty) and then immediately used without touching the free list.

Ah, I see what you mean folks. But I'm talking here only about buffers
which will be allocated after extending shared memory -- they must go
through the freelist first (I don't see why not, any other options?),
and clock sweep will have a chance to pick them up only afterwards. That
makes the freelist sort of an entry point for those buffers.

#95

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Dmitry Dolgov (#94)

Re: Changing shared_buffers without restart

Hi,

On 2025-07-14 11:32:25 +0200, Dmitry Dolgov wrote:

On Mon, Jul 14, 2025 at 10:24:50AM +0100, Thom Brown wrote:
On Mon, 14 Jul 2025, 09:54 Dmitry Dolgov, <9erthalion6@gmail.com> wrote:

On Mon, Jul 14, 2025 at 01:55:39PM +0530, Ashutosh Bapat wrote:

You're right of course, a buffer id could be returned from the
ClockSweep and from the custom strategy buffer ring. Buf from what I

see

those are picking a buffer from the set of already utilized buffers,
meaning that for a buffer to land there it first has to go through
StrategyControl->firstFreeBuffer, and hence the idea above will be a
requirement for those as well.

That isn't true. A buffer which was never in the free list can still
be picked up by clock sweep.

How's that?

Isn't it its job to find usable buffers from the used buffer list when no
free ones are available? The next victim buffer can be selected (and
cleaned if dirty) and then immediately used without touching the free list.

Ah, I see what you mean folks. But I'm talking here only about buffers
which will be allocated after extending shared memory -- they must go
through the freelist first (I don't see why not, any other options?),
and clock sweep will have a chance to pick them up only afterwards. That
makes the freelist sort of an entry point for those buffers.

Clock sweep can find any buffer, independent of whether it's on the freelist.

#96

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Andres Freund (#95)

Re: Changing shared_buffers without restart

On Mon, Jul 14, 2025 at 08:56:56AM -0400, Andres Freund wrote:

Ah, I see what you mean folks. But I'm talking here only about buffers
which will be allocated after extending shared memory -- they must go
through the freelist first (I don't see why not, any other options?),
and clock sweep will have a chance to pick them up only afterwards. That
makes the freelist sort of an entry point for those buffers.

Clock sweep can find any buffer, independent of whether it's on the freelist.

It does the search based on nextVictimBuffer, where the actual buffer
will be a modulo of NBuffers, right? If that's correct and I get
everything else right, that would mean as long as NBuffers stays the
same (which is the case for the purposes of the current discussion) new
buffers, allocated on top of NBuffers after shared memory increase, will
not be picked by the clock sweep.

#97

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Dmitry Dolgov (#96)

Re: Changing shared_buffers without restart

Hi,

On 2025-07-14 15:08:28 +0200, Dmitry Dolgov wrote:

On Mon, Jul 14, 2025 at 08:56:56AM -0400, Andres Freund wrote:

Ah, I see what you mean folks. But I'm talking here only about buffers
which will be allocated after extending shared memory -- they must go
through the freelist first (I don't see why not, any other options?),
and clock sweep will have a chance to pick them up only afterwards. That
makes the freelist sort of an entry point for those buffers.

Clock sweep can find any buffer, independent of whether it's on the freelist.

It does the search based on nextVictimBuffer, where the actual buffer
will be a modulo of NBuffers, right? If that's correct and I get
everything else right, that would mean as long as NBuffers stays the
same (which is the case for the purposes of the current discussion) new
buffers, allocated on top of NBuffers after shared memory increase, will
not be picked by the clock sweep.

Are you tell me that you'd put "new" buffers onto the freelist, before you
increase NBuffers? That doesn't make sense.

Orthogonaly - there's discussion about simply removing the freelist.

Greetings,

Andres Freund

#98

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Andres Freund (#97)

Re: Changing shared_buffers without restart

On Mon, Jul 14, 2025 at 09:14:26AM -0400, Andres Freund wrote:

Clock sweep can find any buffer, independent of whether it's on the freelist.

It does the search based on nextVictimBuffer, where the actual buffer
will be a modulo of NBuffers, right? If that's correct and I get
everything else right, that would mean as long as NBuffers stays the
same (which is the case for the purposes of the current discussion) new
buffers, allocated on top of NBuffers after shared memory increase, will
not be picked by the clock sweep.

Are you tell me that you'd put "new" buffers onto the freelist, before you
increase NBuffers? That doesn't make sense.

Why?

Orthogonaly - there's discussion about simply removing the freelist.

Good to know, will take a look at that thread, thanks.

#99

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Dmitry Dolgov (#98)

Re: Changing shared_buffers without restart

Hi,

On 2025-07-14 15:20:03 +0200, Dmitry Dolgov wrote:

On Mon, Jul 14, 2025 at 09:14:26AM -0400, Andres Freund wrote:

Clock sweep can find any buffer, independent of whether it's on the freelist.

It does the search based on nextVictimBuffer, where the actual buffer
will be a modulo of NBuffers, right? If that's correct and I get
everything else right, that would mean as long as NBuffers stays the
same (which is the case for the purposes of the current discussion) new
buffers, allocated on top of NBuffers after shared memory increase, will
not be picked by the clock sweep.

Are you tell me that you'd put "new" buffers onto the freelist, before you
increase NBuffers? That doesn't make sense.

Why?

I think it basically boils down to "That's not how it supposed to work".

If you have buffers that are not in the clock sweep they'll get unfairly high
usage counts, as their usecount won't be decremented by the clock
sweep. Resulting in those buffers potentially being overly sticky after the
s_b resize completed.

It breaks the entirely reasonable check to verify that a buffer returned by
StrategyGetBuffer() is within the buffer pool.

Obviously, if we remove the freelist, not having the clock sweep find the
buffer would mean it's unreachable.

What on earth would be the point of putting a buffer on the freelist but not
make it reachable by the clock sweep? To me that's just nonsensical.

Greetings,

Andres Freund

#100

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Andres Freund (#99)

Re: Changing shared_buffers without restart

On Mon, Jul 14, 2025 at 09:42:46AM -0400, Andres Freund wrote:
What on earth would be the point of putting a buffer on the freelist but not
make it reachable by the clock sweep? To me that's just nonsensical.

To clarify, we're not talking about this scenario as "that's how it
would work after the resize". The point is that to expand shared buffers
they need to be initialized, included into the whole buffer machinery
(freelist, clock sweep, etc.) and NBuffers has to be updated. Those
steps are separated in time, and I'm currently trying to understand what
are the consequences of performing them in different order and whether
there are possible concurrency issues under various scenarios. Does this
make more sense, or still not?

#101

Burd, Greg

greg@burd.me

6 months ago

In reply to: Dmitry Dolgov (#100)

Re: Changing shared_buffers without restart

On Jul 14, 2025, at 10:01 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Mon, Jul 14, 2025 at 09:42:46AM -0400, Andres Freund wrote:
What on earth would be the point of putting a buffer on the freelist but not
make it reachable by the clock sweep? To me that's just nonsensical.

To clarify, we're not talking about this scenario as "that's how it
would work after the resize". The point is that to expand shared buffers
they need to be initialized, included into the whole buffer machinery
(freelist, clock sweep, etc.) and NBuffers has to be updated. Those
steps are separated in time, and I'm currently trying to understand what
are the consequences of performing them in different order and whether
there are possible concurrency issues under various scenarios. Does this
make more sense, or still not?

Hello, first off thanks for working on the intricate issues related to resizing
shared_buffers.

Second, I'm new in this code so take that in account but I'm the person trying
to remove the freelist entirely [1]/messages/by-id/flat/E2D6FCDC-BE98-4F95-B45E-699C3E17BA10@burd.me so I have reviewed this code recently.

I'd initialize them, expand BufferDescriptors, and adjust NBuffers. The
clock-sweep algorithm will eventually find them and make use of them. The
buf->freeNext should be FREENEXT_NOT_IN_LIST so that StrategyFreeBuffer() will
do the work required to append it the freelist after use. AFAICT there is no
need to add to the freelist up front.

best.

-greg

[1]: /messages/by-id/flat/E2D6FCDC-BE98-4F95-B45E-699C3E17BA10@burd.me

#102

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Dmitry Dolgov (#100)

Re: Changing shared_buffers without restart

Hi,

On 2025-07-14 16:01:50 +0200, Dmitry Dolgov wrote:

On Mon, Jul 14, 2025 at 09:42:46AM -0400, Andres Freund wrote:
What on earth would be the point of putting a buffer on the freelist but not
make it reachable by the clock sweep? To me that's just nonsensical.

To clarify, we're not talking about this scenario as "that's how it
would work after the resize". The point is that to expand shared buffers
they need to be initialized, included into the whole buffer machinery
(freelist, clock sweep, etc.) and NBuffers has to be updated.

It seems pretty obvious to that the order has to be

1) initialize buffer headers
2) update NBuffers
3) put them onto the freelist

(with 3) hopefully becoming obsolete)

Those steps are separated in time, and I'm currently trying to understand
what are the consequences of performing them in different order and whether
there are possible concurrency issues under various scenarios. Does this
make more sense, or still not?

I still don't understand why it'd ever make sense to put a buffer onto the
freelist before updating NBuffers first.

Greetings,

Andres Freund

#103

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Andres Freund (#102)

Re: Changing shared_buffers without restart

On Mon, Jul 14, 2025 at 10:23:23AM -0400, Andres Freund wrote:

Those steps are separated in time, and I'm currently trying to understand
what are the consequences of performing them in different order and whether
there are possible concurrency issues under various scenarios. Does this
make more sense, or still not?

I still don't understand why it'd ever make sense to put a buffer onto the
freelist before updating NBuffers first.

Depending on how NBuffers is updated, different backends may have
different value of NBuffers for a short time frame. In that case a
scenario I'm trying to address is when one backend with the new NBuffers
value allocates a new buffer and puts it into the buffer lookup table,
where it could become reachable by another backend, which still has the
old NBuffer value. Correct me if I'm wrong, but initializing buffer
headers + updating NBuffers means clock sweep can now return one of
those new buffers, opening the scenario above, right?

#104

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Burd, Greg (#101)

Re: Changing shared_buffers without restart

On Mon, Jul 14, 2025 at 10:22:17AM -0400, Burd, Greg wrote:
I'd initialize them, expand BufferDescriptors, and adjust NBuffers. The
clock-sweep algorithm will eventually find them and make use of them. The
buf->freeNext should be FREENEXT_NOT_IN_LIST so that StrategyFreeBuffer() will
do the work required to append it the freelist after use. AFAICT there is no
need to add to the freelist up front.

Yep, thanks. I think this approach may lead to a problem I'm trying to
address with the buffer lookup table (just have described it in the
message above). But if I'm wrong, that of course would be the way to go.

#105

Jack Ng

Jack.Ng@huawei.com

6 months ago

In reply to: Andres Freund (#102)

RE: Changing shared_buffers without restart

If I understanding correctly, putting a new buffer in the freelist before updating NBuffers could break existing logic that calls BufferIsValid(bufnum) and asserts bufnum <= NBuffers? (since a backend can grab the new buffer and checks its validity before the coordinator can add it to the freelist.)

But it seems updating NBuffers before adding new elements to the freelist could be problematic too? Like if a new buffer is already chosen as a victim and then the coordinator adds it to the freelist, would that lead to "double-use"? (seems possible at least with current logic and serialization in StrategyGetBuffer). If that's a valid concern, would something like this work?

1) initialize buffer headers, with a new state/flag to indicate "add-pending"
2) update NBuffers
-- add a check in clock-sweep logic for "add-pending" and skip them
3) put them onto the freelist
4) when a new element is grabbed from freelist, check for and reset add-pending flag.

This ensure the new element is always obtained from the freelist first I think.

Jack

Show quoted text

-----Original Message-----
From: Andres Freund <andres@anarazel.de>
Sent: Monday, July 14, 2025 10:23 AM
To: Dmitry Dolgov <9erthalion6@gmail.com>
Cc: Thom Brown <thom@linux.com>; Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com>; Tomas Vondra <tomas@vondra.me>;
Thomas Munro <thomas.munro@gmail.com>; PostgreSQL-development <pgsql-
hackers@postgresql.org>; Jack Ng <Jack.Ng@huawei.com>; Ni Ku
<jakkuniku@gmail.com>
Subject: Re: Changing shared_buffers without restart

Hi,

On 2025-07-14 16:01:50 +0200, Dmitry Dolgov wrote:

On Mon, Jul 14, 2025 at 09:42:46AM -0400, Andres Freund wrote:
What on earth would be the point of putting a buffer on the freelist
but not make it reachable by the clock sweep? To me that's just nonsensical.

To clarify, we're not talking about this scenario as "that's how it
would work after the resize". The point is that to expand shared
buffers they need to be initialized, included into the whole buffer
machinery (freelist, clock sweep, etc.) and NBuffers has to be updated.

It seems pretty obvious to that the order has to be

1) initialize buffer headers
2) update NBuffers
3) put them onto the freelist

(with 3) hopefully becoming obsolete)

Those steps are separated in time, and I'm currently trying to
understand what are the consequences of performing them in different
order and whether there are possible concurrency issues under various
scenarios. Does this make more sense, or still not?

I still don't understand why it'd ever make sense to put a buffer onto the freelist
before updating NBuffers first.

Greetings,

Andres Freund

#106

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Dmitry Dolgov (#103)

Re: Changing shared_buffers without restart

Hi,

On July 14, 2025 10:39:33 AM EDT, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Mon, Jul 14, 2025 at 10:23:23AM -0400, Andres Freund wrote:

Those steps are separated in time, and I'm currently trying to understand
what are the consequences of performing them in different order and whether
there are possible concurrency issues under various scenarios. Does this
make more sense, or still not?

I still don't understand why it'd ever make sense to put a buffer onto the
freelist before updating NBuffers first.

Depending on how NBuffers is updated, different backends may have
different value of NBuffers for a short time frame. In that case a
scenario I'm trying to address is when one backend with the new NBuffers
value allocates a new buffer and puts it into the buffer lookup table,
where it could become reachable by another backend, which still has the
old NBuffer value. Correct me if I'm wrong, but initializing buffer
headers + updating NBuffers means clock sweep can now return one of
those new buffers, opening the scenario above, right?

The same is true if you put buffers into the freelist.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#107

Jack Ng

Jack.Ng@huawei.com

6 months ago

In reply to: Andres Freund (#106)

RE: Changing shared_buffers without restart

Just brain-storming here... would moving NBuffers to shared memory solve this specific issue? Though I'm pretty sure that would open up a new set of synchronization issues elsewhere, so I'm not sure if there's a net gain.

Jack

Show quoted text

-----Original Message-----
From: Andres Freund <andres@anarazel.de>
Sent: Monday, July 14, 2025 11:12 AM
To: Dmitry Dolgov <9erthalion6@gmail.com>
Cc: Thom Brown <thom@linux.com>; Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com>; Tomas Vondra <tomas@vondra.me>;
Thomas Munro <thomas.munro@gmail.com>; PostgreSQL-development <pgsql-
hackers@postgresql.org>; Jack Ng <Jack.Ng@huawei.com>; Ni Ku
<jakkuniku@gmail.com>
Subject: Re: Changing shared_buffers without restart

Hi,

On July 14, 2025 10:39:33 AM EDT, Dmitry Dolgov <9erthalion6@gmail.com>
wrote:

On Mon, Jul 14, 2025 at 10:23:23AM -0400, Andres Freund wrote:

Those steps are separated in time, and I'm currently trying to
understand what are the consequences of performing them in
different order and whether there are possible concurrency issues
under various scenarios. Does this make more sense, or still not?

I still don't understand why it'd ever make sense to put a buffer
onto the freelist before updating NBuffers first.

Depending on how NBuffers is updated, different backends may have
different value of NBuffers for a short time frame. In that case a
scenario I'm trying to address is when one backend with the new
NBuffers value allocates a new buffer and puts it into the buffer
lookup table, where it could become reachable by another backend, which
still has the old NBuffer value. Correct me if I'm wrong, but
initializing buffer headers + updating NBuffers means clock sweep can
now return one of those new buffers, opening the scenario above, right?

The same is true if you put buffers into the freelist.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#108

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Andres Freund (#106)

Re: Changing shared_buffers without restart

On Mon, Jul 14, 2025 at 11:11:36AM -0400, Andres Freund wrote:
Hi,

On July 14, 2025 10:39:33 AM EDT, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Mon, Jul 14, 2025 at 10:23:23AM -0400, Andres Freund wrote:

Those steps are separated in time, and I'm currently trying to understand
what are the consequences of performing them in different order and whether
there are possible concurrency issues under various scenarios. Does this
make more sense, or still not?

I still don't understand why it'd ever make sense to put a buffer onto the
freelist before updating NBuffers first.

Depending on how NBuffers is updated, different backends may have
different value of NBuffers for a short time frame. In that case a
scenario I'm trying to address is when one backend with the new NBuffers
value allocates a new buffer and puts it into the buffer lookup table,
where it could become reachable by another backend, which still has the
old NBuffer value. Correct me if I'm wrong, but initializing buffer
headers + updating NBuffers means clock sweep can now return one of
those new buffers, opening the scenario above, right?

The same is true if you put buffers into the freelist.

Yep, but the question about clock sweep still stays. Anyway, thanks for
the input, let me digest it and come up with more questions & patch
series.

#109

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Jack Ng (#107)

Re: Changing shared_buffers without restart

On Mon, Jul 14, 2025 at 03:18:10PM +0000, Jack Ng wrote:
Just brain-storming here... would moving NBuffers to shared memory solve this specific issue? Though I'm pretty sure that would open up a new set of synchronization issues elsewhere, so I'm not sure if there's a net gain.

It's in fact already happening, there is a shared structure that
described the resize status. But if I get everything right, it doesn't
solve all the problems.

#110

Jim Nasby

jnasby@upgrade.com

6 months ago

In reply to: Dmitry Dolgov (#79)

Re: Changing shared_buffers without restart

On Fri, Jul 4, 2025 at 9:42 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Fri, Jul 04, 2025 at 02:06:16AM +0200, Tomas Vondra wrote:

...

10) what to do about stuck resize?

AFAICS the resize can get stuck for various reasons, e.g. because it
can't evict pinned buffers, possibly indefinitely. Not great, it's not
clear to me if there's a way out (canceling the resize) after a timeout,
or something like that? Not great to start an "online resize" only to
get stuck with all activity blocked for indefinite amount of time, and
get to restart anyway.

Seems related to Thomas' message [2], but AFAICS the patch does not do
anything about this yet, right? What's the plan here?

It's another open discussion right now, with an idea to eventually allow
canceling after a timeout. I think canceling when stuck on buffer
eviction should be pretty straightforward (the evition must take place
before actual shared memory resize, so we know nothing has changed yet),
but in some other failure scenarios it would be harder (e.g. if one
backend is stuck resizing, while other have succeeded -- this would
require another round of synchronization and some way to figure out what
is the current status).

From a user standpoint, I would expect any kind of resize like this to be
an online operation that happens in the background. If this is driven by a
GUC I don't see how it could be anything else, but if something else is
decided on I think it'd just be pain to require a session to stay connected
until a resize was complete. (Of course we'd need to provide some means of
monitoring a resize that was in-process, perhaps via a pg_stat_progress
view or a system function.)

Also, while I haven't fully followed discussion about how to synchronize
backends, I will say that I don't think it's at all unreasonable if a
resize doesn't take full effect until every backend has at minimum ended
any running transaction, or potentially even returned back to the
equivalent of `PostgresMain()` for that type of backend. Obviously it'd be
nicer to be more responsive than that, but I don't think the first version
of the feature has to accomplish that.

For that matter, I also feel it'd be fine if the first version didn't even
support shrinking shared buffers.

Finally, while shared buffers is the most visible target here, there are
other shared memory settings that have a *much* smaller surface area, and
in my experience are going to be much more valuable from a tuning
perspective; notably wal_buffers and the MXID SLRUs (and possibly CLOG and
subtrans). I say that because unless you're running a workload that
entirely fits in shared buffers, or a *really* small shared buffers
compared to system memory, increasing shared buffers quickly gets into
diminishing returns. But since the default size for the other fixed sized
areas is so much smaller than normal values for shared_buffers, increasing
those areas can have a much, much larger impact on performance. (Especially
for something like the MXID SLRUs.) I would certainly consider focusing on
one of those areas before trying to tackle shared buffers.

#111

Jack Ng

Jack.Ng@huawei.com

6 months ago

In reply to: Dmitry Dolgov (#109)

RE: Changing shared_buffers without restart

On Mon, Jul 14, 2025 at 03:18:10PM +0000, Jack Ng wrote:
Just brain-storming here... would moving NBuffers to shared memory solve

this specific issue? Though I'm pretty sure that would open up a new set of
synchronization issues elsewhere, so I'm not sure if there's a net gain.

It's in fact already happening, there is a shared structure that described the
resize status. But if I get everything right, it doesn't solve all the problems.

Hi Dmitry,

Just to clarify, you're not only referring to the ShmemControl::NSharedBuffers
and related logic in the current patches, but actually getting rid of per-process
NBuffers completely and use ShmemControl::NSharedBuffers everywhere instead (or
something along those lines)? So that when the coordinator updates
ShmemControl::NSharedBuffers, everyone sees the new value right away.
I guess this is part of the "simplified design" you mentioned several posts earlier?

I also thought about that approach more, and there seems to be new synchronization
issues we would need to deal with, like:

1. Mid-execution change of NBuffers in functions like BufferSync and BgBufferSync,
which could cause correctness and performance issues. I suppose most of them
are solvable with atomics and shared r/w locks etc, but at the cost of higher
performance overheads.

2. NBuffers becomes inconsistent with the underlying shared memory mappings for a
period of time for each process. Currently both are updated in AnonymousShmemResize
and AdjustShmemSize "atomically" for a process, so I wonder if letting them get
out-of-sync (even for a brief period) could be problematic.

I agree it doesn't seem to solve all the problems. It can simplify certain aspects
of the design, but may also introduce new issues. Overall not a "silver bullet" :)

Jack

#112

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Jack Ng (#111)

Re: Changing shared_buffers without restart

On Tue, Jul 15, 2025 at 10:52:01PM +0000, Jack Ng wrote:

On Mon, Jul 14, 2025 at 03:18:10PM +0000, Jack Ng wrote:
Just brain-storming here... would moving NBuffers to shared memory solve

this specific issue? Though I'm pretty sure that would open up a new set of
synchronization issues elsewhere, so I'm not sure if there's a net gain.

It's in fact already happening, there is a shared structure that described the
resize status. But if I get everything right, it doesn't solve all the problems.

Just to clarify, you're not only referring to the ShmemControl::NSharedBuffers
and related logic in the current patches, but actually getting rid of per-process
NBuffers completely and use ShmemControl::NSharedBuffers everywhere instead (or
something along those lines)? So that when the coordinator updates
ShmemControl::NSharedBuffers, everyone sees the new value right away.
I guess this is part of the "simplified design" you mentioned several posts earlier?

I was thinking more about something like NBuffersAvailable, which would
control how victim buffers are getting picked, but there is a spectrum
of different options to experiment with.

I also thought about that approach more, and there seems to be new synchronization
issues we would need to deal with, like:

Potentially tricky change of NBuffers already happens in the current
patch set, e.g. NBuffers is getting updated in ProcessProcSignalBarrier,
which is called at the end of BufferSync loop iteration. By itself I
don't see any obvious problems here except remembering buffer id in
CkptBufferIds (I've mentioned this few messages above).

#113

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Jim Nasby (#110)

Re: Changing shared_buffers without restart

On Mon, Jul 14, 2025 at 05:55:13PM -0500, Jim Nasby wrote:

Finally, while shared buffers is the most visible target here, there are
other shared memory settings that have a *much* smaller surface area, and
in my experience are going to be much more valuable from a tuning
perspective; notably wal_buffers and the MXID SLRUs (and possibly CLOG and
subtrans). I say that because unless you're running a workload that
entirely fits in shared buffers, or a *really* small shared buffers
compared to system memory, increasing shared buffers quickly gets into
diminishing returns. But since the default size for the other fixed sized
areas is so much smaller than normal values for shared_buffers, increasing
those areas can have a much, much larger impact on performance. (Especially
for something like the MXID SLRUs.) I would certainly consider focusing on
one of those areas before trying to tackle shared buffers.

That's an interesting idea, thanks for sharing. The reason I'm
concentrating on shared buffers is that it was frequently called out as
a problem when trying to tune PostgreSQL automatically. In this context
shared buffers is usually one of the most impactful knobs, yet one of
the most painful to manage as well. But if the amount of complexity
around resizable shared buffers will be proved unsurmountable, yeah, it
would make sense to consider simpler targets using the same mechanism.

#114

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Jim Nasby (#110)

Re: Changing shared_buffers without restart

Hi,

On 2025-07-14 17:55:13 -0500, Jim Nasby wrote:

I say that because unless you're running a workload that entirely fits in
shared buffers, or a *really* small shared buffers compared to system
memory, increasing shared buffers quickly gets into diminishing returns.

I don't think that's true, at all, today. And it certainly won't be true in a
world where we will be able to use direct_io for real workloads.

Particularly for write heavy workloads, the difference between a small buffer
pool and a large one can be *dramatic*, because the large buffer pool allows
most writes to be done by checkpointer (and thus largely sequentially) or by
backends and bgwriter (and thus largely randomly). Doing more writes
sequentially helps with short-term performance, but *particularly* helps with
sustained performance on SSDs. A larger buffer pool also reduces the *total*
number of writes dramatically, because the same buffer will often be dirtied
repeatedly within one checkpoint window.

r/w/ pgbench is a workload that *undersells* the benefit of a larger
shared_buffers, as each transaction is uncommonly small, making WAL flushes
much more of a bottleneck (the access pattern is too uniform, too). But even
for that the difference can be massive:

A scale 500 pgbench with 48 clients:
s_b= 512MB:
averages 390MB/s of writes in steady state
average TPS: 25072
s_b=8192MB:
averages 48MB/s of writes in steady state
average TPS: 47901
Nearly an order of magnitude difference in writes and nearly a 2x difference
in TPS.

25%, the advice we give for shared_buffers, is literally close to the worst
possible value. The only thing it maximizes is double buffering. While
removing information useful about what to cache for how long from both
postgres and the OS, leading to reduced cache hit rates.

But since the default size for the other fixed sized areas is so much
smaller than normal values for shared_buffers, increasing those areas can
have a much, much larger impact on performance. (Especially for something
like the MXID SLRUs.) I would certainly consider focusing on one of those
areas before trying to tackle shared buffers.

I think that'd be a bad idea. There's simply no point in having the complexity
in place to allow for dynamically resizing a few megabytes of buffers. You
just configure them large enough (including probalby increasing some of the
defaults one of these years). Whereas you can't just do that for
shared_buffers, as we're talking really memory. Ahead of time you do not know
how much memory backends themselves need and the amount of memory in the
system may change.

Resizing shared_buffers is particularly important because it's becoming more
important to be able to dynamically increase/decrease the resources of a
running postgres instance to adjust for system load. Memory and CPUs can be
hot added/removed from VMs, but we need to utilize them...

Greetings,

Andres Freund

#115

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

4 months ago

In reply to: Tomas Vondra (#78)

Re: Changing shared_buffers without restart

Hi Tomas,
Thanks for your detailed feedback. Sorry for replying late.

On Fri, Jul 4, 2025 at 5:36 AM Tomas Vondra <tomas@vondra.me> wrote:

v5-0008-Support-shrinking-shared-buffers.patch

1) Why is ShmemCtrl->evictor_pid reset in AnonymousShmemResize? Isn't
there a place starting it and waiting for it to complete? Why
shouldn't it do EvictExtraBuffers itself?

3) Seems a bit strange to do it from a random backend. Shouldn't it
be the responsibility of a process like checkpointer/bgwriter, or
maybe a dedicated dynamic bgworker? Can we even rely on a backend
to be available?

I will answer these two together. I don't think we should rely on a
random backend. But that's what the rest of the patches did and
patches to support shrinking followed them. But AFAIK, Dmitry is
working on a set of changes which will make a non-postmaster backend
to be a coordinator for buffer pool resizing process. When that
happens the same backend which initializes the expanded memory when
expanding the buffer pool should also be responsible for evicting the
buffers when shrinking the buffer pool. Will wait for Dmitry's next
set of patches before making this change.

4) Unsolved issues with buffers pinned for a long time. Could be an
issue if the buffer is pinned indefinitely (e.g. cursor in idle
connection), and the resizing blocks some activity (new connections
or stuff like that).

In such cases we should cancel the operation or kill that backend (per
user preference) after a timeout with (user specified) timeout >= 0.
We haven't yet figured out the details. I think the first version of
the feature would just cancel the operation, if it encounters a pinned
buffer.

2) Isn't the change to BufferManagerShmemInit wrong? How do we know the
last buffer is still at the end of the freelist? Seems unlikely.
6) It's not clear to me in what situations this triggers (in the call
to BufferManagerShmemInit)

if (FirstBufferToInit < NBuffers) ...

Will answer these two together. As the comment says FirstBufferToInit
< NBuffers indicates two situations: When FirstBufferToInit = 0, it's
the first time the buffer pool is being initialized. Otherwise it
indicates expanding the buffer pool, in which case the last buffer
will be a newly initialized buffer. All newly initialized buffers are
linked into the freelist one after the other in the increasing order
of their buffer ids by code a few lines above. Now that the free
buffer list has been removed, we don't need to worry about it. In the
next set of patches, I have removed this code.

v5-0009-Reinitialize-StrategyControl-after-resizing-buffe.patch

1) IMHO this should be included in the earlier resize/shrink patches,
I don't see a reason to keep it separate (assuming this is the
correct way, and the "init" is not).

These patches are separate just because me and Dmitry developed them
respectively. Once they are reviewed by Dmitry, we will squash them
into a single patch. I am expecting that Dmitry's next patchset which
will do significant changes to the synchronization will have a single
patch for all code related to and consequential to resizing.

5) Funny that "AI suggests" something, but doesn't the block fail to
reset nextVictimBuffer of the clocksweep? It may point to a buffer
we're removing, and it'll be invalid, no?

The TODO no more applies. There's code to reset the clocksweep in a
separate patch. Sorry for not removing it earlier. It will be removed
in the next set of patches.

2) Doesn't StrategyPurgeFreeList already do some of this for the case
of shrinking memory?

3) Not great adding a bunch of static variables to bufmgr.c. Why do we
need to make "everything" static global? Isn't it enough to make
only the "valid" flag global? The rest can stay local, no?

If everything needs to be global for some reason, could we at least
make it a struct, to group the fields, not just separate random
variables? And maybe at the top, not half-way throught the file?

4) Isn't the name BgBufferSyncAdjust misleading? It's not adjusting
anything, it's just invalidating the info about past runs.

I think there's a bit of refactoring possible here. Setting up the
BgBufferSync state, resetting it when bgwriter_lru_maxpages <= 0 and
then re initializing it when bgwriter_lru_maxpages > 0, and actually
performing the buffer sync is all packed into the same function
BgBufferSync() right now. It makes this function harder to read. I
think these functionalities should be separated into their own
functions and use the appropriate one instead of BgBufferSyncAdjust(),
whose name is misleading. The static global variables should all be
packed into a structure which is passed as an argument to these
functions. I need more time to study the code and refactor it that
way. For now I have added a note to the commit message of this patch
so that I will revisit it. I have renamed BgBufferSyncAdjust() to
BgBufferSyncReset().

5) I don't quite understand why BufferSync needs to do the dance with
delay_shmem_resize. I mean, we certainly should not run BufferSync
from the code that resizes buffers, right? Certainly not after the
eviction, from the part that actually rebuilds shmem structs etc.

Right. But let me answer all three questions together.

So perhaps something could trigger resize while we're running the
BufferSync()? Isn't that a bit strange? If this flag is needed, it
seems more like a band-aid for some issue in the architecture.

6) Also, why should it be fine to get into situation that some of the
buffers might not be valid, during shrinking? I mean, why should
this check (pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers).
It seems better to ensure we never get into "sync" in a way that
might lead some of the buffers invalid. Seems way too lowlevel to
care about whether resize is happening.

7) I don't understand the new condition for "Execute the LRU scan".
Won't this stop LRU scan even in cases when we want it to happen?
Don't we want to scan the buffers in the remaining part (after
shrinking), for example? Also, we already checked this shmem flag at
the beginning of the function - sure, it could change (if some other
process modifies it), but does that make sense? Wouldn't it cause
problems if it can change at an arbitrary point while running the
BufferSync? IMHO just another sign it may not make sense to allow
this, i.e. buffer sync should not run during the "actual" resize.

ProcessBarrierShmemResize() which does the resizing is part of
ProcessProcSignalBarrier() which in turn gets called from
CHECK_FOR_INTERRUPTS(), which is called from multiple places, even
from elog(). I am not able to find a call stack linking BgBufferSync()
and ProcessProcSignalBarrier(). But I couldn't convince myself that it
is true and will remain true in the future. I mean, the function loops
through a large number of buffers and performs IO, both avenues to
call CHECK_FOR_INTERRUPTS(). Hence that flag. Do you know what (part
of code) guarantees that ProcessProcSignalBarrier() will never be
called from BgBufferSync()?

Note, resizing can not begin till delay_shmem_resize is cleared, so
while BgBufferSync is executing, no buffer can be invalidated or no
new buffers could be added. But at the cost of all other backends to
wait till BgBufferSync finishes. We want to avoid that. The idea here
is to make BgBufferSync stop as soon as it realises that the buffer
resizing is "about to begin". But I think the condition looks wrong. I
think the right condition would be NBufferPending != NBuffers or
NBuffersOld. AFAIK, Dmitry is working on consolidating NBuffers*
variables as you have requested elsewhere. Better even if we could
somehow set a flag in shared memory indicating that the buffer
resizing is "about to begin" and BgBufferSync() checks that flag. So I
will wait for him to make that change and then change this condition.

v5-0010-Additional-validation-for-buffer-in-the-ring.patch

1) So the problem is we might create a ring before shrinking shared
buffers, and then GetBufferFromRing will see bogus buffers? OK, but
we should be more careful with these checks, otherwise we'll miss
real issues when we incorrectly get an invalid buffer. Can't the
backends do this only when they for sure know we did shrink the
shared buffers? Or maybe even handle that during the barrier?

2) IMHO a sign there's the "transitions" between different NBuffers
values may not be clear enough, and we're allowing stuff to happen
in the "blurry" area. I think that's likely to cause bugs (it did
cause issues for the online checksums patch, I think).

I think you are right, that this might hide some bugs. Just like we
remove buffers to be shrunk from freelist only once, I wanted each
backend to remove them buffer rings only once. But I couldn't find a
way to make all the buffer rings for a given backend available to the
barrier handling code. The rings are stored in Scan objects, which
seem local to the executor nodes. Is there a way to make them
available to barrier handling code (even if it has to walk an
execution tree, let's say)?

If there would have been only one scan, we could have set a flag after
shrinking, let GetBufferFromRing() purge all invalid buffers once when
flag is true and reset the flag. But there can be more than one scan
happening and we don't know how many there are and when all of them
have finished calling GetBufferFromRing() after shrinking. Suggestions
to do this only once are welcome.

I will send the next set of patches with my next email.

--
Best Wishes,
Ashutosh Bapat

#116

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

4 months ago

In reply to: Ashutosh Bapat (#74)

16 attachment(s)

Re: Changing shared_buffers without restart

On Mon, Jun 16, 2025 at 6:09 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Buffer lookup table resizing
------------------------------------

I looked at the interaction of shared buffer lookup table with buffer
resizing as per the patches in [0]/messages/by-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2. Here's list of my findings, issues
and fixes.

1. The basic structure of buffer lookup table (directory and control
area etc.) is allocated in a shared memory segment dedicated to the
buffer lookup table. However, the entries are allocated in the shared
memory using ShmemAllocNoError() which allocates the entries in the
main memory segment. In order for ShmemAllocNoError() to allocate
entries in the dedicated shared memory segment, it should know the
shared memory segment. We could do that by setting the segment number
in element_alloc() before calling hashp->alloc(). This is similar to
how ShmemAllocNoError() knows the memory context in which to allocate
the entries on heap. But read on ...

2. When the buffer pool is expanded, an "out of shared memory" error
is thrown when more entries are added to the buffer look up table. We
could temporarily adjust that flag and allocate more entries. But the
directory also needs to be expanded proportionately otherwise it may
lead to more contention. Expanding directory is non-trivial since it's
a contiguous chunk of memory, followed by other data structures.
Further, expanding directory would require rehashing all the existing
entries, which may impact the time taken by the resizing operation and
how long other backends remain blocked.

3. When the buffer pool is shrunk, there is no way to free the extra
entries in such a way that a contiguous chunk of shared memory can be
given back to the OS. In case we implement it, we will need some way
to compact the shrunk entries in contiguous chunk of memory and unmap
remaining chunk. That's some significant code.

Given these things, I think we should set up the buffer lookup table
to hold maximum entries required to expand the buffer pool to its
maximum, right at the beginning. The maximum size to which buffer pool
can grow is given by GUC max_available_memory (which is a misnomer and
should be renamed to max_shared_buffers or something), introduced by
previous set of patches [0]/messages/by-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2. We don't shrink or expand the buffer
lookup table as we shrink and expand the buffer pool. With that the
buffer lookup table can be located in the main memory segment itself
and we don't have to fix ShmemAllocNoError().

This has two side effects:
1. larger hash table makes hash table operations slower [2]https://ashutoshpg.blogspot.com/2025/07/efficiency-of-sparse-hash-table.html. Its
impact on actual queries needs to be studied.
2. There's increase in the total shared memory allocated upfront.
Currently we allocate 150MB memory with all default GUC values. With
this change we will allocate 250MB memory since max_available_memory
(or rather max_shared_buffers) defaults to allow 524288 shared
buffers. If we make max_shared_buffers to default to shared_buffers,
it won't be a problem. However, when a user sets max_shared_buffers
themselves, they have to be conscious of the fact that it will
allocate more memory than necessary with given shared_buffers value.

This fix is part of patch 0015.

The patchset contains more fixes and improvements as described below.

Per TODO in the prologue of CalculateShmemSize(), more than necessary
shared memory was mapped and allocated in the buffer manager related
memory segments because of an error in that function; the amount of
memory to be allocated in the main shared memory segment was added to
every other shared memory segment. Thus shrinking those memory
segments didn't actually affect the objects allocated in those.
Because of that, we were not seeing SIGBUS even when the objects
supposedly shrunk were accessed, masking bugs in the patches. In this
patchset I have a working fix for CalculateShmemSize(). With that fix
in place we see server crashing with SIGBUS in some resizing
operations. Those cases need to be investigated. The fix changes its
minions to a. return size of shared memory objects to be allocated in
the main memory segment and b. add sizes of the shared memory objects
to be allocated in other memory segments in the respective
AnonymousMapping structures. This assymetry between main segment and
other segment exists so as not to change a lot the minions of
CalculateShmemSize(). But I think we should eliminate the assymetry
and change every minion to add sizes in the respective segment's
AnonymousMapping structure. The patch proposed at [3]https://commitfest.postgresql.org/patch/5997/ would simplify
CalculateShmemSize() which should help eliminating the assymetry.
Along with refactoring CalculateShmemSize() I have added small fixes
to update the total size and end address of shared memory mapping
after resizing them and also to update the new allocated_sizes of
resized structures in ShmemIndex entry. Patch 0009 includes these
changes.

I found that the shared memory resizing synchronization is triggered
even before setting up the shared buffers the first time after
starting the server. That's not required and also can lead to issues
because of trying to resize shared buffers which do not exist. A WIP
fix is included as patch 0012. A TODO in the patch needs to be
addressed. It should be squashed into an earlier patch 0011 when
appropriate.

While debugging the above mentioned issues, I found it useful to have
an insight into the contents of buffer lookup table. Hence I added a
system view exposing the contents of the buffer lookup table. This is
added as patch 0001 in the attached patchset. I think it's useful to
have this independent of this patchset to investigate inconsistencies
between the contents of shared buffer pool and buffer lookup table.

Again for debugging purposes, I have added a new column "segment" in
pg_shmem_allocations reporting the shared memory segment in which the
given allocation has happened. I have also added another view
pg_shmem_segments to provide information about the shared memory
segments. This view definition will change as we design shared memory
mappings and shared memory segments better. So it's WIP and needs doc
changes as well. I have included it in the patchset as patch 0011
since it will be helpful to debug issues found in the patch when
testing. The patch should be merged into patch 0007.

Last but not the least, patch 0016 contains two tests a. stress test
to run buffer resizing while pgbench is running, b. a SQL test to test
the sizes of segments and shared memory allocations after resizing.
The stress test polls "show shared_buffers" output to know when the
resizing is finished. I think we need a better interface to know when
resizing has finished. Thanks a lot my colleague Palak Chaturvedi for
providing initial draft of the test case.

The patches are rebased on top of the latest master, which includes
changes to remove free buffer list. That led to removing all the code
in these patches dealing with free buffer list.

I am intentionally keeping my changes (patches 0001, 0008 to 0012,
0012 to 0016) separate from Dmitry's changes so that Dmitry can review
them easily. The patches are arranged so that my patches are nearer to
Dmitry's patches, into which, they should be squashed.

Dmitry,
I found that max_available_memory is PGC_SIGHUP. Is that intentional?
I thought it's PGC_POSTMASTER since we can not reserve more address
space without restarting postmaster. Left a TODO for this. I think we
also need to change the name and description to better reflect its
actual functionality.

[0]: /messages/by-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2
[1]: /messages/by-id/CAExHW5v0jh3F_wj86yC=qBfWk0uiT94qy=Z41uzAHLHh0SerRA@mail.gmail.com
[2]: https://ashutoshpg.blogspot.com/2025/07/efficiency-of-sparse-hash-table.html
[3]: https://commitfest.postgresql.org/patch/5997/

--
Best Wishes,
Ashutosh Bapat

Attachments:

0004-Introduce-pss_barrierReceivedGeneration-20250918.patchapplication/x-patch; name=0004-Introduce-pss_barrierReceivedGeneration-20250918.patchDownload

From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 4 Apr 2025 21:46:14 +0200
Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration

Currently WaitForProcSignalBarrier allows to make sure the message sent
via EmitProcSignalBarrier was processed by all ProcSignal mechanism
participants.

Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
which will be updated when a process has received the message, but not
processed it yet. This makes it possible to support a new mode of
waiting, when ProcSignal participants want to synchronize message
processing. To do that, a participant can wait via
WaitForProcSignalBarrierReceived when processing a message, effectively
making sure that all processes are going to start processing
ProcSignalBarrier simultaneously.
---
 src/backend/storage/ipc/procsignal.c | 67 ++++++++++++++++++++++------
 src/include/storage/procsignal.h     |  1 +
 2 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 087821311cc..eb3ceaae809 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -58,7 +58,10 @@
  * of it. For such use cases, we set a bit in pss_barrierCheckMask and then
  * increment the current "barrier generation"; when the new barrier generation
  * (or greater) appears in the pss_barrierGeneration flag of every process,
- * we know that the message has been received everywhere.
+ * we know that the message has been received and processed everywhere. In case
+ * if we only need to know only that the message was received everywhere (e.g.
+ * receiving processes need to handle the message in a coordinated fashion)
+ * use pss_barrierReceivedGeneration in the same way.
  */
 typedef struct
 {
@@ -70,6 +73,7 @@ typedef struct
 
 	/* Barrier-related fields (not protected by pss_mutex) */
 	pg_atomic_uint64 pss_barrierGeneration;
+	pg_atomic_uint64 pss_barrierReceivedGeneration;
 	pg_atomic_uint32 pss_barrierCheckMask;
 	ConditionVariable pss_barrierCV;
 } ProcSignalSlot;
@@ -152,6 +156,8 @@ ProcSignalShmemInit(void)
 			slot->pss_cancel_key_len = 0;
 			MemSet(slot->pss_signalFlags, 0, sizeof(slot->pss_signalFlags));
 			pg_atomic_init_u64(&slot->pss_barrierGeneration, PG_UINT64_MAX);
+			pg_atomic_init_u64(&slot->pss_barrierReceivedGeneration,
+							   PG_UINT64_MAX);
 			pg_atomic_init_u32(&slot->pss_barrierCheckMask, 0);
 			ConditionVariableInit(&slot->pss_barrierCV);
 		}
@@ -199,6 +205,8 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
 	barrier_generation =
 		pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration);
 	pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation);
+	pg_atomic_write_u64(&slot->pss_barrierReceivedGeneration,
+						barrier_generation);
 
 	if (cancel_key_len > 0)
 		memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
@@ -263,6 +271,7 @@ CleanupProcSignalState(int status, Datum arg)
 	 * no barrier waits block on it.
 	 */
 	pg_atomic_write_u64(&slot->pss_barrierGeneration, PG_UINT64_MAX);
+	pg_atomic_write_u64(&slot->pss_barrierReceivedGeneration, PG_UINT64_MAX);
 
 	SpinLockRelease(&slot->pss_mutex);
 
@@ -416,12 +425,8 @@ EmitProcSignalBarrier(ProcSignalBarrierType type)
 	return generation;
 }
 
-/*
- * WaitForProcSignalBarrier - wait until it is guaranteed that all changes
- * requested by a specific call to EmitProcSignalBarrier() have taken effect.
- */
-void
-WaitForProcSignalBarrier(uint64 generation)
+static void
+WaitForProcSignalBarrierInternal(uint64 generation, bool receivedOnly)
 {
 	Assert(generation <= pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration));
 
@@ -436,12 +441,17 @@ WaitForProcSignalBarrier(uint64 generation)
 		uint64		oldval;
 
 		/*
-		 * It's important that we check only pss_barrierGeneration here and
-		 * not pss_barrierCheckMask. Bits in pss_barrierCheckMask get cleared
-		 * before the barrier is actually absorbed, but pss_barrierGeneration
+		 * It's important that we check only pss_barrierGeneration &
+		 * pss_barrierGeneration here and not pss_barrierCheckMask. Bits in
+		 * pss_barrierCheckMask get cleared before the barrier is actually
+		 * absorbed, but pss_barrierGeneration & pss_barrierReceivedGeneration
 		 * is updated only afterward.
 		 */
-		oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+		if (receivedOnly)
+			oldval = pg_atomic_read_u64(&slot->pss_barrierReceivedGeneration);
+		else
+			oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+
 		while (oldval < generation)
 		{
 			if (ConditionVariableTimedSleep(&slot->pss_barrierCV,
@@ -450,7 +460,11 @@ WaitForProcSignalBarrier(uint64 generation)
 				ereport(LOG,
 						(errmsg("still waiting for backend with PID %d to accept ProcSignalBarrier",
 								(int) pg_atomic_read_u32(&slot->pss_pid))));
-			oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+
+			if (receivedOnly)
+				oldval = pg_atomic_read_u64(&slot->pss_barrierReceivedGeneration);
+			else
+				oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
 		}
 		ConditionVariableCancelSleep();
 	}
@@ -464,12 +478,33 @@ WaitForProcSignalBarrier(uint64 generation)
 	 * The caller is probably calling this function because it wants to read
 	 * the shared state or perform further writes to shared state once all
 	 * backends are known to have absorbed the barrier. However, the read of
-	 * pss_barrierGeneration was performed unlocked; insert a memory barrier
-	 * to separate it from whatever follows.
+	 * pss_barrierGeneration & pss_barrierReceivedGeneration was performed
+	 * unlocked; insert a memory barrier to separate it from whatever follows.
 	 */
 	pg_memory_barrier();
 }
 
+/*
+ * WaitForProcSignalBarrier - wait until it is guaranteed that all changes
+ * requested by a specific call to EmitProcSignalBarrier() have taken effect.
+ */
+void
+WaitForProcSignalBarrier(uint64 generation)
+{
+	WaitForProcSignalBarrierInternal(generation, false);
+}
+
+/*
+ * WaitForProcSignalBarrierReceived - wait until it is guaranteed that all
+ * backends have observed the message sent by a specific call to
+ * EmitProcSignalBarrier().
+ */
+void
+WaitForProcSignalBarrierReceived(uint64 generation)
+{
+	WaitForProcSignalBarrierInternal(generation, true);
+}
+
 /*
  * Handle receipt of an interrupt indicating a global barrier event.
  *
@@ -523,6 +558,10 @@ ProcessProcSignalBarrier(void)
 	if (local_gen == shared_gen)
 		return;
 
+	/* The message is observed, record that */
+	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
+						shared_gen);
+
 	/*
 	 * Get and clear the flags that are set for this backend. Note that
 	 * pg_atomic_exchange_u32 is a full barrier, so we're guaranteed that the
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index afeeb1ca019..2733bbb8c5b 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -79,6 +79,7 @@ extern void SendCancelRequest(int backendPID, const uint8 *cancel_key, int cance
 
 extern uint64 EmitProcSignalBarrier(ProcSignalBarrierType type);
 extern void WaitForProcSignalBarrier(uint64 generation);
+extern void WaitForProcSignalBarrierReceived(uint64 generation);
 extern void ProcessProcSignalBarrier(void);
 
 extern void procsignal_sigusr1_handler(SIGNAL_ARGS);
-- 
2.34.1

0002-Process-config-reload-in-AIO-workers-20250918.patchapplication/x-patch; name=0002-Process-config-reload-in-AIO-workers-20250918.patchDownload

From d1ed934ccd02fca2c831e582b07a169e17d19f59 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 15:14:33 +0200
Subject: [PATCH 02/16] Process config reload in AIO workers

Currenly AIO workers process interrupts only via CHECK_FOR_INTERRUPTS,
which does not include ConfigReloadPending. Thus we need to check for it
explicitly.
---
 src/backend/storage/aio/method_worker.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index b5ac073a910..d1c6da89c4b 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -80,6 +80,7 @@ static void pgaio_worker_shmem_init(bool first_time);
 static bool pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh);
 static int	pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
 
+static void pgaio_worker_process_interrupts(void);
 
 const IoMethodOps pgaio_worker_ops = {
 	.shmem_size = pgaio_worker_shmem_size,
@@ -463,6 +464,8 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		int			nwakeups = 0;
 		int			worker;
 
+		pgaio_worker_process_interrupts();
+
 		/*
 		 * Try to get a job to do.
 		 *
@@ -592,3 +595,25 @@ pgaio_workers_enabled(void)
 {
 	return io_method == IOMETHOD_WORKER;
 }
+
+/*
+ * Process any new interrupts.
+ */
+static void
+pgaio_worker_process_interrupts(void)
+{
+	/*
+	 * Reloading config can trigger further signals, complicating interrupts
+	 * processing -- so let it run first.
+	 *
+	 * XXX: Is there any need in memory barrier after ProcessConfigFile?
+	 */
+	if (ConfigReloadPending)
+	{
+		ConfigReloadPending = false;
+		ProcessConfigFile(PGC_SIGHUP);
+	}
+
+	if (ProcSignalBarrierPending)
+		ProcessProcSignalBarrier();
+}
-- 
2.34.1

0003-Introduce-pending-flag-for-GUC-assign-hooks-20250918.patchapplication/x-patch; name=0003-Introduce-pending-flag-for-GUC-assign-hooks-20250918.patchDownload

From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sun, 6 Apr 2025 16:40:32 +0200
Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks

Currently an assing hook can perform some preprocessing of a new value,
but it cannot change the behavior, which dictates that the new value
will be applied immediately after the hook. Certain GUC options (like
shared_buffers, coming in subsequent patches) may need coordinating work
between backends to change, meaning we cannot apply it right away.

Add a new flag "pending" for an assign hook to allow the hook indicate
exactly that. If the pending flag is set after the hook, the new value
will not be applied and it's handling becomes the hook's implementation
responsibility.

Note, that this also requires changes in the way how GUCs are getting
reported, but the patch does not cover that yet.
---
 src/backend/access/transam/xlog.c    |  2 +-
 src/backend/commands/variable.c      |  6 +--
 src/backend/libpq/pqcomm.c           |  8 ++--
 src/backend/tcop/postgres.c          |  2 +-
 src/backend/utils/misc/guc.c         | 59 +++++++++++++++++++---------
 src/backend/utils/misc/stack_depth.c |  2 +-
 src/include/utils/guc.h              |  2 +-
 src/include/utils/guc_hooks.h        | 20 +++++-----
 8 files changed, 61 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0baf0ac6160..307ac31b19e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2197,7 +2197,7 @@ CalculateCheckpointSegments(void)
 }
 
 void
-assign_max_wal_size(int newval, void *extra)
+assign_max_wal_size(int newval, void *extra, bool *pending)
 {
 	max_wal_size_mb = newval;
 	CalculateCheckpointSegments();
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 608f10d9412..e40dae2ddf2 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1143,7 +1143,7 @@ check_cluster_name(char **newval, void **extra, GucSource source)
  * GUC assign_hook for maintenance_io_concurrency
  */
 void
-assign_maintenance_io_concurrency(int newval, void *extra)
+assign_maintenance_io_concurrency(int newval, void *extra, bool *pending)
 {
 	/*
 	 * Reconfigure recovery prefetching, because a setting it depends on
@@ -1161,12 +1161,12 @@ assign_maintenance_io_concurrency(int newval, void *extra)
  * they may be assigned in either order.
  */
 void
-assign_io_max_combine_limit(int newval, void *extra)
+assign_io_max_combine_limit(int newval, void *extra, bool *pending)
 {
 	io_combine_limit = Min(newval, io_combine_limit_guc);
 }
 void
-assign_io_combine_limit(int newval, void *extra)
+assign_io_combine_limit(int newval, void *extra, bool *pending)
 {
 	io_combine_limit = Min(io_max_combine_limit, newval);
 }
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 25f739a6a17..1726a7c0993 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -1951,7 +1951,7 @@ pq_settcpusertimeout(int timeout, Port *port)
  * GUC assign_hook for tcp_keepalives_idle
  */
 void
-assign_tcp_keepalives_idle(int newval, void *extra)
+assign_tcp_keepalives_idle(int newval, void *extra, bool *pending)
 {
 	/*
 	 * The kernel API provides no way to test a value without setting it; and
@@ -1984,7 +1984,7 @@ show_tcp_keepalives_idle(void)
  * GUC assign_hook for tcp_keepalives_interval
  */
 void
-assign_tcp_keepalives_interval(int newval, void *extra)
+assign_tcp_keepalives_interval(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_setkeepalivesinterval(newval, MyProcPort);
@@ -2007,7 +2007,7 @@ show_tcp_keepalives_interval(void)
  * GUC assign_hook for tcp_keepalives_count
  */
 void
-assign_tcp_keepalives_count(int newval, void *extra)
+assign_tcp_keepalives_count(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_setkeepalivescount(newval, MyProcPort);
@@ -2030,7 +2030,7 @@ show_tcp_keepalives_count(void)
  * GUC assign_hook for tcp_user_timeout
  */
 void
-assign_tcp_user_timeout(int newval, void *extra)
+assign_tcp_user_timeout(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_settcpusertimeout(newval, MyProcPort);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index d356830f756..8d4d6cc3f33 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3596,7 +3596,7 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 
 /* GUC assign hook for transaction_timeout */
 void
-assign_transaction_timeout(int newval, void *extra)
+assign_transaction_timeout(int newval, void *extra, bool *pending)
 {
 	if (IsTransactionState())
 	{
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 46fdefebe35..0d5e523aaf0 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1680,6 +1680,7 @@ InitializeOneGUCOption(struct config_generic *gconf)
 				struct config_int *conf = (struct config_int *) gconf;
 				int			newval = conf->boot_val;
 				void	   *extra = NULL;
+				bool 	   pending = false;
 
 				Assert(newval >= conf->min);
 				Assert(newval <= conf->max);
@@ -1688,9 +1689,13 @@ InitializeOneGUCOption(struct config_generic *gconf)
 					elog(FATAL, "failed to initialize %s to %d",
 						 conf->gen.name, newval);
 				if (conf->assign_hook)
-					conf->assign_hook(newval, extra);
-				*conf->variable = conf->reset_val = newval;
-				conf->gen.extra = conf->reset_extra = extra;
+					conf->assign_hook(newval, extra, &pending);
+
+				if (!pending)
+				{
+					*conf->variable = conf->reset_val = newval;
+					conf->gen.extra = conf->reset_extra = extra;
+				}
 				break;
 			}
 		case PGC_REAL:
@@ -2046,13 +2051,18 @@ ResetAllOptions(void)
 			case PGC_INT:
 				{
 					struct config_int *conf = (struct config_int *) gconf;
+					bool 			  pending = false;
 
 					if (conf->assign_hook)
 						conf->assign_hook(conf->reset_val,
-										  conf->reset_extra);
-					*conf->variable = conf->reset_val;
-					set_extra_field(&conf->gen, &conf->gen.extra,
-									conf->reset_extra);
+										  conf->reset_extra,
+										  &pending);
+					if (!pending)
+					{
+						*conf->variable = conf->reset_val;
+						set_extra_field(&conf->gen, &conf->gen.extra,
+										conf->reset_extra);
+					}
 					break;
 				}
 			case PGC_REAL:
@@ -2429,16 +2439,21 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
 							struct config_int *conf = (struct config_int *) gconf;
 							int			newval = newvalue.val.intval;
 							void	   *newextra = newvalue.extra;
+							bool 	    pending = false;
 
 							if (*conf->variable != newval ||
 								conf->gen.extra != newextra)
 							{
 								if (conf->assign_hook)
-									conf->assign_hook(newval, newextra);
-								*conf->variable = newval;
-								set_extra_field(&conf->gen, &conf->gen.extra,
-												newextra);
-								changed = true;
+									conf->assign_hook(newval, newextra, &pending);
+
+								if (!pending)
+								{
+									*conf->variable = newval;
+									set_extra_field(&conf->gen, &conf->gen.extra,
+													newextra);
+									changed = true;
+								}
 							}
 							break;
 						}
@@ -3855,18 +3870,24 @@ set_config_with_handle(const char *name, config_handle *handle,
 
 				if (changeVal)
 				{
+					bool pending = false;
+
 					/* Save old value to support transaction abort */
 					if (!makeDefault)
 						push_old_value(&conf->gen, action);
 
 					if (conf->assign_hook)
-						conf->assign_hook(newval, newextra);
-					*conf->variable = newval;
-					set_extra_field(&conf->gen, &conf->gen.extra,
-									newextra);
-					set_guc_source(&conf->gen, source);
-					conf->gen.scontext = context;
-					conf->gen.srole = srole;
+						conf->assign_hook(newval, newextra, &pending);
+
+					if (!pending)
+					{
+						*conf->variable = newval;
+						set_extra_field(&conf->gen, &conf->gen.extra,
+										newextra);
+						set_guc_source(&conf->gen, source);
+						conf->gen.scontext = context;
+						conf->gen.srole = srole;
+					}
 				}
 				if (makeDefault)
 				{
diff --git a/src/backend/utils/misc/stack_depth.c b/src/backend/utils/misc/stack_depth.c
index 8f7cf531fbc..ef59ae62008 100644
--- a/src/backend/utils/misc/stack_depth.c
+++ b/src/backend/utils/misc/stack_depth.c
@@ -156,7 +156,7 @@ check_max_stack_depth(int *newval, void **extra, GucSource source)
 
 /* GUC assign hook for max_stack_depth */
 void
-assign_max_stack_depth(int newval, void *extra)
+assign_max_stack_depth(int newval, void *extra, bool *pending)
 {
 	ssize_t		newval_bytes = newval * (ssize_t) 1024;
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index f21ec37da89..c3056cd2da8 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -187,7 +187,7 @@ typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource sourc
 typedef bool (*GucEnumCheckHook) (int *newval, void **extra, GucSource source);
 
 typedef void (*GucBoolAssignHook) (bool newval, void *extra);
-typedef void (*GucIntAssignHook) (int newval, void *extra);
+typedef void (*GucIntAssignHook) (int newval, void *extra, bool *pending);
 typedef void (*GucRealAssignHook) (double newval, void *extra);
 typedef void (*GucStringAssignHook) (const char *newval, void *extra);
 typedef void (*GucEnumAssignHook) (int newval, void *extra);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 82ac8646a8d..658c799419e 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -81,12 +81,12 @@ extern bool check_log_stats(bool *newval, void **extra, GucSource source);
 extern bool check_log_timezone(char **newval, void **extra, GucSource source);
 extern void assign_log_timezone(const char *newval, void *extra);
 extern const char *show_log_timezone(void);
-extern void assign_maintenance_io_concurrency(int newval, void *extra);
-extern void assign_io_max_combine_limit(int newval, void *extra);
-extern void assign_io_combine_limit(int newval, void *extra);
-extern void assign_max_wal_size(int newval, void *extra);
+extern void assign_maintenance_io_concurrency(int newval, void *extra, bool *pending);
+extern void assign_io_max_combine_limit(int newval, void *extra, bool *pending);
+extern void assign_io_combine_limit(int newval, void *extra, bool *pending);
+extern void assign_max_wal_size(int newval, void *extra, bool *pending);
 extern bool check_max_stack_depth(int *newval, void **extra, GucSource source);
-extern void assign_max_stack_depth(int newval, void *extra);
+extern void assign_max_stack_depth(int newval, void *extra, bool *pending);
 extern bool check_multixact_member_buffers(int *newval, void **extra,
 										   GucSource source);
 extern bool check_multixact_offset_buffers(int *newval, void **extra,
@@ -141,13 +141,13 @@ extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 extern void assign_syslog_facility(int newval, void *extra);
 extern void assign_syslog_ident(const char *newval, void *extra);
-extern void assign_tcp_keepalives_count(int newval, void *extra);
+extern void assign_tcp_keepalives_count(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_count(void);
-extern void assign_tcp_keepalives_idle(int newval, void *extra);
+extern void assign_tcp_keepalives_idle(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_idle(void);
-extern void assign_tcp_keepalives_interval(int newval, void *extra);
+extern void assign_tcp_keepalives_interval(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_interval(void);
-extern void assign_tcp_user_timeout(int newval, void *extra);
+extern void assign_tcp_user_timeout(int newval, void *extra, bool *pending);
 extern const char *show_tcp_user_timeout(void);
 extern bool check_temp_buffers(int *newval, void **extra, GucSource source);
 extern bool check_temp_tablespaces(char **newval, void **extra,
@@ -163,7 +163,7 @@ extern bool check_transaction_buffers(int *newval, void **extra, GucSource sourc
 extern bool check_transaction_deferrable(bool *newval, void **extra, GucSource source);
 extern bool check_transaction_isolation(int *newval, void **extra, GucSource source);
 extern bool check_transaction_read_only(bool *newval, void **extra, GucSource source);
-extern void assign_transaction_timeout(int newval, void *extra);
+extern void assign_transaction_timeout(int newval, void *extra, bool *pending);
 extern const char *show_unix_socket_permissions(void);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern bool check_wal_consistency_checking(char **newval, void **extra,
-- 
2.34.1

0005-Allow-to-use-multiple-shared-memory-mapping-20250918.patchapplication/x-patch; name=0005-Allow-to-use-multiple-shared-memory-mapping-20250918.patchDownload

From 63fe27340656c52b13f4eecebd9e73d24efe5e33 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 28 Feb 2025 19:54:47 +0100
Subject: [PATCH 05/16] Allow to use multiple shared memory mappings

Currently all the work with shared memory is done via a single anonymous
memory mapping, which limits ways how the shared memory could be organized.

Introduce possibility to allocate multiple shared memory mappings, where
a single mapping is associated with a specified shared memory segment.
There is only fixed amount of available segments, currently only one
main shared memory segment is allocated. A new shared memory API is
introduces, extended with a segment as a new parameter. As a path of
least resistance, the original API is kept in place, utilizing the main
shared memory segment.
---
 src/backend/port/posix_sema.c     |   4 +-
 src/backend/port/sysv_sema.c      |   4 +-
 src/backend/port/sysv_shmem.c     | 138 +++++++++++++++++++---------
 src/backend/port/win32_sema.c     |   2 +-
 src/backend/storage/ipc/ipc.c     |   4 +-
 src/backend/storage/ipc/ipci.c    |  63 +++++++------
 src/backend/storage/ipc/shmem.c   | 148 +++++++++++++++++++++---------
 src/backend/storage/lmgr/lwlock.c |  15 ++-
 src/include/storage/ipc.h         |   2 +-
 src/include/storage/pg_sema.h     |   2 +-
 src/include/storage/pg_shmem.h    |  18 ++++
 src/include/storage/shmem.h       |  11 +++
 12 files changed, 283 insertions(+), 128 deletions(-)

diff --git a/src/backend/port/posix_sema.c b/src/backend/port/posix_sema.c
index 269c7460817..401e1113fa1 100644
--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -193,7 +193,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * we don't have to expose the counters to other processes.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	struct stat statbuf;
 
@@ -220,7 +220,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSegment(PGSemaphoreShmemSize(maxSemas), shmem_segment);
 #endif
 
 	numSems = 0;
diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c
index 6ac83ea1a82..7bb363989c4 100644
--- a/src/backend/port/sysv_sema.c
+++ b/src/backend/port/sysv_sema.c
@@ -327,7 +327,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * have clobbered.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	struct stat statbuf;
 
@@ -348,7 +348,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSegment(PGSemaphoreShmemSize(maxSemas), shmem_segment);
 	numSharedSemas = 0;
 	maxSharedSemas = maxSemas;
 
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..56af0231d24 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -94,8 +94,19 @@ typedef enum
 unsigned long UsedShmemSegID = 0;
 void	   *UsedShmemSegAddr = NULL;
 
-static Size AnonymousShmemSize;
-static void *AnonymousShmem = NULL;
+typedef struct AnonymousMapping
+{
+	int shmem_segment;
+	Size shmem_size; 			/* Size of the mapping */
+	Pointer shmem; 				/* Pointer to the start of the mapped memory */
+	Pointer seg_addr; 			/* SysV shared memory for the header */
+	unsigned long seg_id; 		/* IPC key */
+} AnonymousMapping;
+
+static AnonymousMapping Mappings[ANON_MAPPINGS];
+
+/* Keeps track of used mapping segments */
+static int next_free_segment = 0;
 
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
@@ -104,6 +115,28 @@ static IpcMemoryState PGSharedMemoryAttach(IpcMemoryId shmId,
 										   void *attachAt,
 										   PGShmemHeader **addr);
 
+static const char*
+MappingName(int shmem_segment)
+{
+	switch (shmem_segment)
+	{
+		case MAIN_SHMEM_SEGMENT:
+			return "main";
+		default:
+			return "unknown";
+	}
+}
+
+static void
+DebugMappings()
+{
+	for(int i = 0; i < next_free_segment; i++)
+	{
+		AnonymousMapping m = Mappings[i];
+		elog(DEBUG1, "Mapping[%s]: addr %p, size %zu",
+			 MappingName(i), m.shmem, m.shmem_size);
+	}
+}
 
 /*
  *	InternalIpcMemoryCreate(memKey, size)
@@ -591,14 +624,13 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
 /*
  * Creates an anonymous mmap()ed shared memory segment.
  *
- * Pass the requested size in *size.  This function will modify *size to the
- * actual size of the allocation, if it ends up allocating a segment that is
- * larger than requested.
+ * This function will modify mapping size to the actual size of the allocation,
+ * if it ends up allocating a segment that is larger than requested.
  */
-static void *
-CreateAnonymousSegment(Size *size)
+static void
+CreateAnonymousSegment(AnonymousMapping *mapping)
 {
-	Size		allocsize = *size;
+	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
 
@@ -623,8 +655,11 @@ CreateAnonymousSegment(Size *size)
 				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
-			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-				 allocsize);
+		{
+			DebugMappings();
+			elog(DEBUG1, "segment[%s]: mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
+				 MappingName(mapping->shmem_segment), allocsize);
+		}
 	}
 #endif
 
@@ -642,7 +677,7 @@ CreateAnonymousSegment(Size *size)
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
-		allocsize = *size;
+		allocsize = mapping->shmem_size;
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
 				   PG_MMAP_FLAGS, -1, 0);
 		mmap_errno = errno;
@@ -651,8 +686,10 @@ CreateAnonymousSegment(Size *size)
 	if (ptr == MAP_FAILED)
 	{
 		errno = mmap_errno;
+		DebugMappings();
 		ereport(FATAL,
-				(errmsg("could not map anonymous shared memory: %m"),
+				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
+						MappingName(mapping->shmem_segment)),
 				 (mmap_errno == ENOMEM) ?
 				 errhint("This error usually means that PostgreSQL's request "
 						 "for a shared memory segment exceeded available memory, "
@@ -663,8 +700,8 @@ CreateAnonymousSegment(Size *size)
 						 allocsize) : 0));
 	}
 
-	*size = allocsize;
-	return ptr;
+	mapping->shmem = ptr;
+	mapping->shmem_size = allocsize;
 }
 
 /*
@@ -674,13 +711,18 @@ CreateAnonymousSegment(Size *size)
 static void
 AnonymousShmemDetach(int status, Datum arg)
 {
-	/* Release anonymous shared memory block, if any. */
-	if (AnonymousShmem != NULL)
+	for(int i = 0; i < next_free_segment; i++)
 	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		AnonymousMapping m = Mappings[i];
+
+		/* Release anonymous shared memory block, if any. */
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
 
@@ -705,6 +747,7 @@ PGSharedMemoryCreate(Size size,
 	PGShmemHeader *hdr;
 	struct stat statbuf;
 	Size		sysvsize;
+	AnonymousMapping *mapping = &Mappings[next_free_segment];
 
 	/*
 	 * We use the data directory's ID info (inode and device numbers) to
@@ -733,11 +776,15 @@ PGSharedMemoryCreate(Size size,
 
 	/* Room for a header? */
 	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
+	mapping->shmem_size = size;
+	mapping->shmem_segment = next_free_segment;
 
 	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
-		AnonymousShmem = CreateAnonymousSegment(&size);
-		AnonymousShmemSize = size;
+		/* On success, mapping data will be modified. */
+		CreateAnonymousSegment(mapping);
+
+		next_free_segment++;
 
 		/* Register on-exit routine to unmap the anonymous segment */
 		on_shmem_exit(AnonymousShmemDetach, (Datum) 0);
@@ -760,7 +807,7 @@ PGSharedMemoryCreate(Size size,
 	 * loop simultaneously.  (CreateDataDirLockFile() does not entirely ensure
 	 * that, but prefer fixing it over coping here.)
 	 */
-	NextShmemSegID = statbuf.st_ino;
+	NextShmemSegID = statbuf.st_ino + next_free_segment;
 
 	for (;;)
 	{
@@ -852,13 +899,13 @@ PGSharedMemoryCreate(Size size,
 	/*
 	 * Initialize space allocation status for segment.
 	 */
-	hdr->totalsize = size;
+	hdr->totalsize = mapping->shmem_size;
 	hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader));
 	*shim = hdr;
 
 	/* Save info for possible future use */
-	UsedShmemSegAddr = memAddress;
-	UsedShmemSegID = (unsigned long) NextShmemSegID;
+	mapping->seg_addr = memAddress;
+	mapping->seg_id = (unsigned long) NextShmemSegID;
 
 	/*
 	 * If AnonymousShmem is NULL here, then we're not using anonymous shared
@@ -866,10 +913,10 @@ PGSharedMemoryCreate(Size size,
 	 * block. Otherwise, the System V shared memory block is only a shim, and
 	 * we must return a pointer to the real block.
 	 */
-	if (AnonymousShmem == NULL)
+	if (mapping->shmem == NULL)
 		return hdr;
-	memcpy(AnonymousShmem, hdr, sizeof(PGShmemHeader));
-	return (PGShmemHeader *) AnonymousShmem;
+	memcpy(mapping->shmem, hdr, sizeof(PGShmemHeader));
+	return (PGShmemHeader *) mapping->shmem;
 }
 
 #ifdef EXEC_BACKEND
@@ -969,23 +1016,28 @@ PGSharedMemoryNoReAttach(void)
 void
 PGSharedMemoryDetach(void)
 {
-	if (UsedShmemSegAddr != NULL)
+	for(int i = 0; i < next_free_segment; i++)
 	{
-		if ((shmdt(UsedShmemSegAddr) < 0)
+		AnonymousMapping m = Mappings[i];
+
+		if (m.seg_addr != NULL)
+		{
+			if ((shmdt(m.seg_addr) < 0)
 #if defined(EXEC_BACKEND) && defined(__CYGWIN__)
-		/* Work-around for cygipc exec bug */
-			&& shmdt(NULL) < 0
+			/* Work-around for cygipc exec bug */
+				&& shmdt(NULL) < 0
 #endif
-			)
-			elog(LOG, "shmdt(%p) failed: %m", UsedShmemSegAddr);
-		UsedShmemSegAddr = NULL;
-	}
+				)
+				elog(LOG, "shmdt(%p) failed: %m", m.seg_addr);
+			m.seg_addr = NULL;
+		}
 
-	if (AnonymousShmem != NULL)
-	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
diff --git a/src/backend/port/win32_sema.c b/src/backend/port/win32_sema.c
index 5854ad1f54d..e7365ff8060 100644
--- a/src/backend/port/win32_sema.c
+++ b/src/backend/port/win32_sema.c
@@ -44,7 +44,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * process exits.
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	mySemSet = (HANDLE *) malloc(maxSemas * sizeof(HANDLE));
 	if (mySemSet == NULL)
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index 2704e80b3a7..1965b2d3eb4 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -61,6 +61,8 @@ static void proc_exit_prepare(int code);
  * but provide some additional features we need --- in particular,
  * we want to register callbacks to invoke when we are disconnecting
  * from a broken shared-memory context but not exiting the postmaster.
+ * Maximum number of such exit callbacks depends on the number of shared
+ * segments.
  *
  * Callback functions can take zero, one, or two args: the first passed
  * arg is the integer exitcode, the second is the Datum supplied when
@@ -68,7 +70,7 @@ static void proc_exit_prepare(int code);
  * ----------------------------------------------------------------
  */
 
-#define MAX_ON_EXITS 20
+#define MAX_ON_EXITS 40
 
 struct ONEXIT
 {
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..8b38e985327 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -86,7 +86,7 @@ RequestAddinShmemSpace(Size size)
  * required.
  */
 Size
-CalculateShmemSize(int *num_semaphores)
+CalculateShmemSize(int *num_semaphores, int shmem_segment)
 {
 	Size		size;
 	int			numSemas;
@@ -206,33 +206,38 @@ CreateSharedMemoryAndSemaphores(void)
 
 	Assert(!IsUnderPostmaster);
 
-	/* Compute the size of the shared-memory block */
-	size = CalculateShmemSize(&numSemas);
-	elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
-
-	/*
-	 * Create the shmem segment
-	 */
-	seghdr = PGSharedMemoryCreate(size, &shim);
-
-	/*
-	 * Make sure that huge pages are never reported as "unknown" while the
-	 * server is running.
-	 */
-	Assert(strcmp("unknown",
-				  GetConfigOption("huge_pages_status", false, false)) != 0);
-
-	InitShmemAccess(seghdr);
-
-	/*
-	 * Create semaphores
-	 */
-	PGReserveSemaphores(numSemas);
-
-	/*
-	 * Set up shared memory allocation mechanism
-	 */
-	InitShmemAllocation();
+	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
+	{
+		/* Compute the size of the shared-memory block */
+		size = CalculateShmemSize(&numSemas, segment);
+		elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
+
+		/*
+		 * Create the shmem segment.
+		 *
+		 * XXX: Do multiple shims are needed, one per segment?
+		 */
+		seghdr = PGSharedMemoryCreate(size, &shim);
+
+		/*
+		 * Make sure that huge pages are never reported as "unknown" while the
+		 * server is running.
+		 */
+		Assert(strcmp("unknown",
+					  GetConfigOption("huge_pages_status", false, false)) != 0);
+
+		InitShmemAccessInSegment(seghdr, segment);
+
+		/*
+		 * Create semaphores
+		 */
+		PGReserveSemaphores(numSemas, segment);
+
+		/*
+		 * Set up shared memory allocation mechanism
+		 */
+		InitShmemAllocationInSegment(segment);
+	}
 
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
@@ -363,7 +368,7 @@ InitializeShmemGUCs(void)
 	/*
 	 * Calculate the shared memory size and round up to the nearest megabyte.
 	 */
-	size_b = CalculateShmemSize(&num_semas);
+	size_b = CalculateShmemSize(&num_semas, MAIN_SHMEM_SEGMENT);
 	size_mb = add_size(size_b, (1024 * 1024) - 1) / (1024 * 1024);
 	sprintf(buf, "%zu", size_mb);
 	SetConfigOption("shared_memory_size", buf,
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index a0770e86796..f185ed28f95 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -76,19 +76,19 @@
 #include "utils/builtins.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
+static void *ShmemAllocRawInSegment(Size size, Size *allocated_size,
+								 int shmem_segment);
 
 /* shared memory global variables */
 
-static PGShmemHeader *ShmemSegHdr;	/* shared mem segment header */
+ShmemSegment Segments[ANON_MAPPINGS];
 
-static void *ShmemBase;			/* start address of shared memory */
-
-static void *ShmemEnd;			/* end+1 address of shared memory */
-
-slock_t    *ShmemLock;			/* spinlock for shared memory and LWLock
-								 * allocation */
-
-static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/*
+ * Primary index hashtable for shmem, for simplicity we use a single for all
+ * shared memory segments. There can be performance consequences of that, and
+ * an alternative option would be to have one index per shared memory segments.
+ */
+static HTAB *ShmemIndex = NULL;
 
 /* To get reliable results for NUMA inquiry we need to "touch pages" once */
 static bool firstNumaTouch = true;
@@ -101,9 +101,17 @@ Datum		pg_numa_available(PG_FUNCTION_ARGS);
 void
 InitShmemAccess(PGShmemHeader *seghdr)
 {
-	ShmemSegHdr = seghdr;
-	ShmemBase = seghdr;
-	ShmemEnd = (char *) ShmemBase + seghdr->totalsize;
+	InitShmemAccessInSegment(seghdr, MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAccessInSegment(PGShmemHeader *seghdr, int shmem_segment)
+{
+	PGShmemHeader *shmhdr = (PGShmemHeader *) seghdr;
+	ShmemSegment *seg = &Segments[shmem_segment];
+	seg->ShmemSegHdr = shmhdr;
+	seg->ShmemBase = (void *) shmhdr;
+	seg->ShmemEnd = (char *) seg->ShmemBase + shmhdr->totalsize;
 }
 
 /*
@@ -114,7 +122,13 @@ InitShmemAccess(PGShmemHeader *seghdr)
 void
 InitShmemAllocation(void)
 {
-	PGShmemHeader *shmhdr = ShmemSegHdr;
+	InitShmemAllocationInSegment(MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAllocationInSegment(int shmem_segment)
+{
+	PGShmemHeader *shmhdr = Segments[shmem_segment].ShmemSegHdr;
 	char	   *aligned;
 
 	Assert(shmhdr != NULL);
@@ -123,9 +137,9 @@ InitShmemAllocation(void)
 	 * Initialize the spinlock used by ShmemAlloc.  We must use
 	 * ShmemAllocUnlocked, since obviously ShmemAlloc can't be called yet.
 	 */
-	ShmemLock = (slock_t *) ShmemAllocUnlocked(sizeof(slock_t));
+	Segments[shmem_segment].ShmemLock = (slock_t *) ShmemAllocUnlockedInSegment(sizeof(slock_t), shmem_segment);
 
-	SpinLockInit(ShmemLock);
+	SpinLockInit(Segments[shmem_segment].ShmemLock);
 
 	/*
 	 * Allocations after this point should go through ShmemAlloc, which
@@ -150,11 +164,17 @@ InitShmemAllocation(void)
  */
 void *
 ShmemAlloc(Size size)
+{
+	return ShmemAllocInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocInSegment(Size size, int shmem_segment)
 {
 	void	   *newSpace;
 	Size		allocated_size;
 
-	newSpace = ShmemAllocRaw(size, &allocated_size);
+	newSpace = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
 	if (!newSpace)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -184,6 +204,12 @@ ShmemAllocNoError(Size size)
  */
 static void *
 ShmemAllocRaw(Size size, Size *allocated_size)
+{
+	return ShmemAllocRawInSegment(size, allocated_size, MAIN_SHMEM_SEGMENT);
+}
+
+static void *
+ShmemAllocRawInSegment(Size size, Size *allocated_size, int shmem_segment)
 {
 	Size		newStart;
 	Size		newFree;
@@ -203,22 +229,22 @@ ShmemAllocRaw(Size size, Size *allocated_size)
 	size = CACHELINEALIGN(size);
 	*allocated_size = size;
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-	SpinLockAcquire(ShmemLock);
+	SpinLockAcquire(Segments[shmem_segment].ShmemLock);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree <= ShmemSegHdr->totalsize)
+	if (newFree <= Segments[shmem_segment].ShmemSegHdr->totalsize)
 	{
-		newSpace = (char *) ShmemBase + newStart;
-		ShmemSegHdr->freeoffset = newFree;
+		newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
+		Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 	}
 	else
 		newSpace = NULL;
 
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[shmem_segment].ShmemLock);
 
 	/* note this assert is okay with newSpace == NULL */
 	Assert(newSpace == (void *) CACHELINEALIGN(newSpace));
@@ -236,6 +262,12 @@ ShmemAllocRaw(Size size, Size *allocated_size)
  */
 void *
 ShmemAllocUnlocked(Size size)
+{
+	return ShmemAllocUnlockedInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocUnlockedInSegment(Size size, int shmem_segment)
 {
 	Size		newStart;
 	Size		newFree;
@@ -246,19 +278,19 @@ ShmemAllocUnlocked(Size size)
 	 */
 	size = MAXALIGN(size);
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree > ShmemSegHdr->totalsize)
+	if (newFree > Segments[shmem_segment].ShmemSegHdr->totalsize)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of shared memory (%zu bytes requested)",
 						size)));
-	ShmemSegHdr->freeoffset = newFree;
+	Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 
-	newSpace = (char *) ShmemBase + newStart;
+	newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
 
 	Assert(newSpace == (void *) MAXALIGN(newSpace));
 
@@ -273,7 +305,13 @@ ShmemAllocUnlocked(Size size)
 bool
 ShmemAddrIsValid(const void *addr)
 {
-	return (addr >= ShmemBase) && (addr < ShmemEnd);
+	return ShmemAddrIsValidInSegment(addr, MAIN_SHMEM_SEGMENT);
+}
+
+bool
+ShmemAddrIsValidInSegment(const void *addr, int shmem_segment)
+{
+	return (addr >= Segments[shmem_segment].ShmemBase) && (addr < Segments[shmem_segment].ShmemEnd);
 }
 
 /*
@@ -334,6 +372,18 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 			  int64 max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
+{
+	return ShmemInitHashInSegment(name, init_size, max_size, infoP, hash_flags,
+							   MAIN_SHMEM_SEGMENT);
+}
+
+HTAB *
+ShmemInitHashInSegment(const char *name,		/* table string name for shmem index */
+			  long init_size,		/* initial table size */
+			  long max_size,		/* max size of the table */
+			  HASHCTL *infoP,		/* info about key and bucket size */
+			  int hash_flags,		/* info about infoP */
+			  int shmem_segment) 	/* in which segment to keep the table */
 {
 	bool		found;
 	void	   *location;
@@ -350,9 +400,9 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 	hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
 
 	/* look it up in the shmem index */
-	location = ShmemInitStruct(name,
+	location = ShmemInitStructInSegment(name,
 							   hash_get_shared_size(infoP, hash_flags),
-							   &found);
+							   &found, shmem_segment);
 
 	/*
 	 * if it already exists, attach to it rather than allocate and initialize
@@ -385,6 +435,13 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
  */
 void *
 ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+{
+	return ShmemInitStructInSegment(name, size, foundPtr, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
+					  int shmem_segment)
 {
 	ShmemIndexEnt *result;
 	void	   *structPtr;
@@ -393,7 +450,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	if (!ShmemIndex)
 	{
-		PGShmemHeader *shmemseghdr = ShmemSegHdr;
+		PGShmemHeader *shmemseghdr = Segments[shmem_segment].ShmemSegHdr;
 
 		/* Must be trying to create/attach to ShmemIndex itself */
 		Assert(strcmp(name, "ShmemIndex") == 0);
@@ -416,7 +473,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 			 * process can be accessing shared memory yet.
 			 */
 			Assert(shmemseghdr->index == NULL);
-			structPtr = ShmemAlloc(size);
+			structPtr = ShmemAllocInSegment(size, shmem_segment);
 			shmemseghdr->index = structPtr;
 			*foundPtr = false;
 		}
@@ -433,8 +490,8 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		LWLockRelease(ShmemIndexLock);
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("could not create ShmemIndex entry for data structure \"%s\"",
-						name)));
+				 errmsg("could not create ShmemIndex entry for data structure \"%s\" in segment %d",
+						name, shmem_segment)));
 	}
 
 	if (*foundPtr)
@@ -459,7 +516,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		Size		allocated_size;
 
 		/* It isn't in the table yet. allocate and initialize it */
-		structPtr = ShmemAllocRaw(size, &allocated_size);
+		structPtr = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
 		if (structPtr == NULL)
 		{
 			/* out of memory; remove the failed ShmemIndex entry */
@@ -478,14 +535,13 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	LWLockRelease(ShmemIndexLock);
 
-	Assert(ShmemAddrIsValid(structPtr));
+	Assert(ShmemAddrIsValidInSegment(structPtr, shmem_segment));
 
 	Assert(structPtr == (void *) CACHELINEALIGN(structPtr));
 
 	return structPtr;
 }
 
-
 /*
  * Add two Size values, checking for overflow
  */
@@ -542,10 +598,11 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 
 	/* output all allocated entries */
 	memset(nulls, 0, sizeof(nulls));
+	/* XXX: take all shared memory segments into account. */
 	while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
 	{
 		values[0] = CStringGetTextDatum(ent->key);
-		values[1] = Int64GetDatum((char *) ent->location - (char *) ShmemSegHdr);
+		values[1] = Int64GetDatum((char *) ent->location - (char *) Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr);
 		values[2] = Int64GetDatum(ent->size);
 		values[3] = Int64GetDatum(ent->allocated_size);
 		named_allocated += ent->allocated_size;
@@ -557,15 +614,15 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 	/* output shared memory allocated but not counted via the shmem index */
 	values[0] = CStringGetTextDatum("<anonymous>");
 	nulls[1] = true;
-	values[2] = Int64GetDatum(ShmemSegHdr->freeoffset - named_allocated);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset - named_allocated);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
 	/* output as-of-yet unused shared memory */
 	nulls[0] = true;
-	values[1] = Int64GetDatum(ShmemSegHdr->freeoffset);
+	values[1] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
 	nulls[1] = false;
-	values[2] = Int64GetDatum(ShmemSegHdr->totalsize - ShmemSegHdr->freeoffset);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->totalsize - Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
@@ -630,7 +687,12 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
 	 * this is not very likely, and moreover we have more entries, each of
 	 * them using only fraction of the total pages.
 	 */
-	shm_total_page_count = (ShmemSegHdr->totalsize / os_page_size) + 1;
+	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
+	{
+		PGShmemHeader *shmhdr = Segments[segment].ShmemSegHdr;
+		shm_total_page_count += (shmhdr->totalsize / os_page_size) + 1;
+	}
+
 	page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
 	pages_status = palloc(sizeof(int) * shm_total_page_count);
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 46c82c63ca5..93792a83af9 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -80,6 +80,8 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
+#include "postmaster/postmaster.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/proclist.h"
 #include "storage/procnumber.h"
@@ -594,12 +596,15 @@ LWLockNewTrancheId(const char *name)
 	/*
 	 * We use the ShmemLock spinlock to protect LWLockCounter and
 	 * LWLockTrancheNames.
+	 * 
+	 * XXX: Looks like this is the only use of Segments outside of shmem.c,
+	 * it's maybe worth it to reshape this part to hide Segments structure.
 	 */
-	SpinLockAcquire(ShmemLock);
+	SpinLockAcquire(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 
 	if (*LWLockCounter - LWTRANCHE_FIRST_USER_DEFINED >= MAX_NAMED_TRANCHES)
 	{
-		SpinLockRelease(ShmemLock);
+		SpinLockRelease(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 		ereport(ERROR,
 				(errmsg("maximum number of tranches already registered"),
 				 errdetail("No more than %d tranches may be registered.",
@@ -610,7 +615,7 @@ LWLockNewTrancheId(const char *name)
 	LocalLWLockCounter = *LWLockCounter;
 	strlcpy(LWLockTrancheNames[result - LWTRANCHE_FIRST_USER_DEFINED], name, NAMEDATALEN);
 
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 
 	return result;
 }
@@ -732,9 +737,9 @@ GetLWTrancheName(uint16 trancheId)
 	 */
 	if (trancheId >= LocalLWLockCounter)
 	{
-		SpinLockAcquire(ShmemLock);
+		SpinLockAcquire(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 		LocalLWLockCounter = *LWLockCounter;
-		SpinLockRelease(ShmemLock);
+		SpinLockRelease(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 
 		if (trancheId >= LocalLWLockCounter)
 			elog(ERROR, "tranche %d is not registered", trancheId);
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 3baf418b3d1..6ebda479ced 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -77,7 +77,7 @@ extern void check_on_shmem_exit_lists_are_empty(void);
 /* ipci.c */
 extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;
 
-extern Size CalculateShmemSize(int *num_semaphores);
+extern Size CalculateShmemSize(int *num_semaphores, int shmem_segment);
 extern void CreateSharedMemoryAndSemaphores(void);
 #ifdef EXEC_BACKEND
 extern void AttachSharedMemoryStructs(void);
diff --git a/src/include/storage/pg_sema.h b/src/include/storage/pg_sema.h
index fa6ca35a51f..8ae9637fcd0 100644
--- a/src/include/storage/pg_sema.h
+++ b/src/include/storage/pg_sema.h
@@ -41,7 +41,7 @@ typedef HANDLE PGSemaphore;
 extern Size PGSemaphoreShmemSize(int maxSemas);
 
 /* Module initialization (called during postmaster start or shmem reinit) */
-extern void PGReserveSemaphores(int maxSemas);
+extern void PGReserveSemaphores(int maxSemas, int shmem_segment);
 
 /* Allocate a PGSemaphore structure with initial count 1 */
 extern PGSemaphore PGSemaphoreCreate(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 5f7d4b83a60..2348c59b5a0 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -25,6 +25,7 @@
 #define PG_SHMEM_H
 
 #include "storage/dsm_impl.h"
+#include "storage/spin.h"
 
 typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 {
@@ -41,6 +42,20 @@ typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 #endif
 } PGShmemHeader;
 
+typedef struct ShmemSegment
+{
+	PGShmemHeader *ShmemSegHdr; 	/* shared mem segment header */
+	void *ShmemBase; 				/* start address of shared memory */
+	void *ShmemEnd; 				/* end+1 address of shared memory */
+	slock_t    *ShmemLock; 			/* spinlock for shared memory and LWLock
+									 * allocation */
+} ShmemSegment;
+
+/* Number of available segments for anonymous memory mappings */
+#define ANON_MAPPINGS 1
+
+extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -91,4 +106,7 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+/* The main segment, contains everything except buffer blocks and related data. */
+#define MAIN_SHMEM_SEGMENT 0
+
 #endif							/* PG_SHMEM_H */
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index cd683a9d2d9..910c43f54f4 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -30,15 +30,26 @@ extern PGDLLIMPORT slock_t *ShmemLock;
 typedef struct PGShmemHeader PGShmemHeader; /* avoid including
 											 * storage/pg_shmem.h here */
 extern void InitShmemAccess(PGShmemHeader *seghdr);
+extern void InitShmemAccessInSegment(struct PGShmemHeader *seghdr,
+									 int shmem_segment);
 extern void InitShmemAllocation(void);
+extern void InitShmemAllocationInSegment(int shmem_segment);
 extern void *ShmemAlloc(Size size);
+extern void *ShmemAllocInSegment(Size size, int shmem_segment);
 extern void *ShmemAllocNoError(Size size);
 extern void *ShmemAllocUnlocked(Size size);
+extern void *ShmemAllocUnlockedInSegment(Size size, int shmem_segment);
 extern bool ShmemAddrIsValid(const void *addr);
+extern bool ShmemAddrIsValidInSegment(const void *addr, int shmem_segment);
 extern void InitShmemIndex(void);
 extern HTAB *ShmemInitHash(const char *name, int64 init_size, int64 max_size,
 						   HASHCTL *infoP, int hash_flags);
+extern HTAB *ShmemInitHashInSegment(const char *name, long init_size,
+									long max_size, HASHCTL *infoP,
+									int hash_flags, int shmem_segment);
 extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
+extern void *ShmemInitStructInSegment(const char *name, Size size,
+									  bool *foundPtr, int shmem_segment);
 extern Size add_size(Size s1, Size s2);
 extern Size mul_size(Size s1, Size s2);
 
-- 
2.34.1

0001-Add-system-view-for-shared-buffer-lookup-ta-20250918.patchapplication/x-patch; name=0001-Add-system-view-for-shared-buffer-lookup-ta-20250918.patchDownload

From cc90e3f74fe4a14ba95e11664ac68b8daa3ba056 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Mon, 25 Aug 2025 19:23:50 +0530
Subject: [PATCH 01/16] Add system view for shared buffer lookup table

The view exposes the contents of the shared buffer lookup table for
debugging, testing and investigation.

TODO:

It is better to place this view in pg_buffercache. But it's added as a
system view since BufHashTable is not exposed outside buf_table.c. To
move it to pg_buffercache, we should move the function
pg_get_buffer_lookup_table() to pg_buffercache which invokes
BufTableGetContent() by passing it the tuple store and tuple descriptor.
BufTableGetContent fills the tuple store. The partitions are locked by
pg_get_buffer_lookup_table().

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
---
 doc/src/sgml/system-views.sgml         | 89 ++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql   |  7 ++
 src/backend/storage/buffer/buf_table.c | 61 ++++++++++++++++++
 src/include/catalog/pg_proc.dat        | 11 ++++
 src/test/regress/expected/rules.out    |  7 ++
 5 files changed, 175 insertions(+)

diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 4187191ea74..89be9bc333f 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -71,6 +71,11 @@
       <entry>backend memory contexts</entry>
      </row>
 
+     <row>
+      <entry><link linkend="view-pg-buffer-lookup-table"><structname>pg_buffer_lookup_table</structname></link></entry>
+      <entry>shared buffer lookup table</entry>
+     </row>
+
      <row>
       <entry><link linkend="view-pg-config"><structname>pg_config</structname></link></entry>
       <entry>compile-time configuration parameters</entry>
@@ -896,6 +901,90 @@ AND c1.path[c2.level] = c2.path[c2.level];
   </para>
  </sect1>
 
+ <sect1 id="view-pg-buffer-lookup-table">
+  <title><structname>pg_buffer_lookup_table</structname></title>
+  <indexterm>
+   <primary>pg_buffer_lookup_table</primary>
+  </indexterm>
+  <para>
+   The <structname>pg_buffer_lookup_table</structname> view exposes the current
+   contents of the shared buffer lookup table. Each row represents an entry in
+   the lookup table mapping a relation page to the ID of buffer in which it is
+   cached. The shared buffer lookup table is locked for a short duration while
+   reading so as to ensure consistency. This may affect performance if this view
+   is queried very frequently.
+  </para>
+  <table id="pg-buffer-lookup-table-view" xreflabel="pg_buffer_lookup_table">
+   <title><structname>pg_buffer_lookup_table</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>tablespace</structfield> <type>oid</type>
+      </para>
+      <para>
+       OID of the tablespace containing the relation
+      </para></entry>
+     </row>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>database</structfield> <type>oid</type>
+      </para>
+      <para>
+       OID of the database containing the relation (zero for shared relations)
+      </para></entry>
+     </row>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>relfilenode</structfield> <type>oid</type>
+      </para>
+      <para>
+       relfilenode identifying the relation
+      </para></entry>
+     </row>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>forknum</structfield> <type>int2</type>
+      </para>
+      <para>
+       Fork number within the relation (see <xref linkend="storage-file-layout"/>)
+      </para></entry>
+     </row>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>blocknum</structfield> <type>int8</type>
+      </para>
+      <para>
+       Block number within the relation
+      </para></entry>
+     </row>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>bufferid</structfield> <type>int4</type>
+      </para>
+      <para>
+       ID of the buffer caching the page 
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+  <para>
+   Access to this view is restricted to members of the
+   <literal>pg_read_all_stats</literal> role by default.
+  </para>
+ </sect1>
+
  <sect1 id="view-pg-config">
   <title><structname>pg_config</structname></title>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index c77fa0234bb..46fc28396de 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1420,3 +1420,10 @@ REVOKE ALL ON pg_aios FROM PUBLIC;
 GRANT SELECT ON pg_aios TO pg_read_all_stats;
 REVOKE EXECUTE ON FUNCTION pg_get_aios() FROM PUBLIC;
 GRANT EXECUTE ON FUNCTION pg_get_aios() TO pg_read_all_stats;
+
+CREATE VIEW pg_buffer_lookup_table AS
+    SELECT * FROM pg_get_buffer_lookup_table();
+REVOKE ALL ON pg_buffer_lookup_table FROM PUBLIC;
+GRANT SELECT ON pg_buffer_lookup_table TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_buffer_lookup_table() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_buffer_lookup_table() TO pg_read_all_stats;
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 9d256559bab..1f6e215a2ca 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -21,7 +21,12 @@
  */
 #include "postgres.h"
 
+#include "fmgr.h"
+#include "funcapi.h"
 #include "storage/buf_internals.h"
+#include "storage/lwlock.h"
+#include "utils/rel.h"
+#include "utils/builtins.h"
 
 /* entry for buffer lookup hashtable */
 typedef struct
@@ -159,3 +164,59 @@ BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
 	if (!result)				/* shouldn't happen */
 		elog(ERROR, "shared buffer hash table corrupted");
 }
+
+/*
+ * SQL callable function to report contents of the shared buffer lookup table.
+ */
+Datum
+pg_get_buffer_lookup_table(PG_FUNCTION_ARGS)
+{
+#define PG_GET_BUFFER_LOOKUP_TABLE_COLS 6
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	HASH_SEQ_STATUS hstat;
+	BufferLookupEnt *ent;
+	Datum		values[PG_GET_BUFFER_LOOKUP_TABLE_COLS];
+	bool		nulls[PG_GET_BUFFER_LOOKUP_TABLE_COLS];
+	int			i;
+
+	memset(nulls, 0, sizeof(nulls));
+
+	/*
+	 * We put all the tuples into a tuplestore in one scan of the hashtable.
+	 * This avoids any issue of the hashtable possibly changing between calls.
+	 */
+	InitMaterializedSRF(fcinfo, 0);
+
+	Assert(rsinfo->setDesc->natts == PG_GET_BUFFER_LOOKUP_TABLE_COLS);
+
+	/*
+	 * Lock all buffer mapping partitions to ensure a consistent view of the
+	 * hash table during the scan. Must grab LWLocks in partition-number order
+	 * to avoid LWLock deadlock.
+	 */
+	for (i = 0; i < NUM_BUFFER_PARTITIONS; i++)
+		LWLockAcquire(BufMappingPartitionLockByIndex(i), LW_SHARED);
+
+	hash_seq_init(&hstat, SharedBufHash);
+	while ((ent = (BufferLookupEnt *) hash_seq_search(&hstat)) != NULL)
+	{
+		values[0] = ObjectIdGetDatum(ent->key.spcOid);
+		values[1] = ObjectIdGetDatum(ent->key.dbOid);
+		values[2] = ObjectIdGetDatum(ent->key.relNumber);
+		values[3] = ObjectIdGetDatum(ent->key.forkNum);
+		values[4] = UInt32GetDatum(ent->key.blockNum);
+		values[5] = Int32GetDatum(ent->id);
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
+	}
+
+	/*
+	 * Release all buffer mapping partition locks in the reverse order so as
+	 * to avoid LWLock deadlock.
+	 */
+	for (i = NUM_BUFFER_PARTITIONS - 1; i >= 0; i--)
+		LWLockRelease(BufMappingPartitionLockByIndex(i));
+
+	return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 03e82d28c87..1e53b7a4ae5 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8592,6 +8592,17 @@
   proargmodes => '{o,o,o}', proargnames => '{name,type,size}',
   prosrc => 'pg_get_dsm_registry_allocations' },
 
+# buffer lookup table
+{ oid => '5102',
+  descr => 'shared buffer lookup table',
+  proname => 'pg_get_buffer_lookup_table', prorows => '6', proretset => 't',
+  provolatile => 'v', prorettype => 'record',
+  proargtypes => '', proallargtypes => '{oid,oid,oid,int2,int8,int4}',
+  proargmodes => '{o,o,o,o,o,o}',
+  proargnames => '{tablespace,database,relfilenode,forknum,blocknum,bufferid}',
+  prosrc => 'pg_get_buffer_lookup_table'
+},
+
 # memory context of local backend
 { oid => '2282',
   descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 35e8aad7701..760bb13fe95 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1330,6 +1330,13 @@ pg_backend_memory_contexts| SELECT name,
     free_chunks,
     used_bytes
    FROM pg_get_backend_memory_contexts() pg_get_backend_memory_contexts(name, ident, type, level, path, total_bytes, total_nblocks, free_bytes, free_chunks, used_bytes);
+pg_buffer_lookup_table| SELECT tablespace,
+    database,
+    relfilenode,
+    forknum,
+    blocknum,
+    bufferid
+   FROM pg_get_buffer_lookup_table() pg_get_buffer_lookup_table(tablespace, database, relfilenode, forknum, blocknum, bufferid);
 pg_config| SELECT name,
     setting
    FROM pg_config() pg_config(name, setting);

base-commit: 2e66cae935c2e0f7ce9bab6b65ddeb7806f4de7c
-- 
2.34.1

0006-Address-space-reservation-for-shared-memory-20250918.patchapplication/x-patch; name=0006-Address-space-reservation-for-shared-memory-20250918.patchDownload

From e2f48da8a8206711b24e34040d699431910fbf9c Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 11:47:04 +0200
Subject: [PATCH 06/16] Address space reservation for shared memory

Currently the shared memory layout is designed to pack everything tight
together, leaving no space between mappings for resizing. Here is how it
looks like for one mapping in /proc/$PID/maps, /dev/zero represents the
anonymous shared memory we talk about:

    00400000-00490000         /path/bin/postgres
    ...
    012d9000-0133e000         [heap]
    7f443a800000-7f470a800000 /dev/zero (deleted)
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34
    ...

Make the layout more dynamic via splitting every shared memory segment
into two parts:

* An anonymous file, which actually contains shared memory content. Such
  an anonymous file is created via memfd_create, it lives in memory,
  behaves like a regular file and semantically equivalent to an
  anonymous memory allocated via mmap with MAP_ANONYMOUS.

* A reservation mapping, which size is much larger than required shared
  segment size. This mapping is created with flags PROT_NONE (which
  makes sure the reserved space is not used), and MAP_NORESERVE (to not
  count the reserved space against memory limits). The anonymous file is
  mapped into this reservation mapping.

The resulting layout looks like this:

    00400000-00490000         /path/bin/postgres
    ...
    3f526000-3f590000 rw-p 		[heap]
    7fbd827fe000-7fbd8bdde000 rw-s 	/memfd:main (deleted) -- anon file
    7fbd8bdde000-7fbe82800000 ---s 	/memfd:main (deleted) -- reservation
    7fbe82800000-7fbe90670000 r--p 	/usr/lib/locale/locale-archive
    7fbe90800000-7fbe90941000 r-xp 	/usr/lib64/libstdc++.so.6.0.34

To resize a shared memory segment in this layout it's possible to use ftruncate
on the anonymous file, adjusting access permissions on the reserved space as
needed.

This approach also do not impact the actual memory usage as reported by
the kernel. Here is the output of /proc/$PID/status for the master
version with shared_buffers = 128 MB:

    // Peak virtual memory size, which is described as total pages
    // mapped in mm_struct. It corresponds to the mapped reserved space
    // and is the only number that grows with it.
    VmPeak:          2043192 kB
    // Size of memory portions. It contains RssAnon + RssFile + RssShmem
    VmRSS:             22908 kB
    // Size of resident anonymous memory
    RssAnon:             768 kB
    // Size of resident file mappings
    RssFile:           10364 kB
    // Size of resident shmem memory (includes SysV shm, mapping of tmpfs and
    // shared anonymous mappings)
    RssShmem:          11776 kB

Here is the same for the patch when reserving 20GB of space:

    VmPeak:         21255824 kB
    VmRSS:             25020 kB
    RssAnon:             768 kB
    RssFile:           10812 kB
    RssShmem:          13440 kB

Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup
was created with the memory limit 256 MB, then PostgreSQL was launched withing
this cgroup with shared_buffers = 128 MB:

    $ cd /sys/fs/cgroup
    $ mkdir postgres
    $ cd postres
    $ echo 268435456 > memory.max

    $ echo $MASTER_PID_SHELL > cgroup.procs
    # postgres from the master branch has being successfully launched
    #  from that shell
    $ cat memory.current
    17465344 (~16.6 MB)
    # stop postgres

    $ echo $PATCH_PID_SHELL > cgroup.procs
    # postgres from the patch has being successfully launched from that shell
    $ cat memory.current
    20770816 (~19.8 MB)

To control the amount of space reserved a new GUC max_available_memory
is introduced. Ideally it should be based on the maximum available
memory, hense the name.

There are also few unrelated advantages of using anon files:

* We've got a file descriptor, which could be used for regular file
  operations (modification, truncation, you name it).

* The file could be given a name, which improves readability when it
  comes to process maps.

* By default, Linux will not add file-backed shared mappings into a core dump,
  making it more convenient to work with them in PostgreSQL: no more huge dumps
  to process.

The downside is that memfd_create is Linux specific.
---
 src/backend/port/sysv_shmem.c             | 290 ++++++++++++++++++----
 src/backend/port/win32_shmem.c            |   2 +-
 src/backend/storage/ipc/ipci.c            |   5 +-
 src/backend/storage/ipc/shmem.c           |   2 +-
 src/backend/utils/init/globals.c          |   1 +
 src/backend/utils/misc/guc_parameters.dat |  12 +
 src/include/miscadmin.h                   |   1 +
 src/include/portability/mem.h             |   2 +-
 src/include/storage/pg_shmem.h            |   5 +-
 9 files changed, 260 insertions(+), 60 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 56af0231d24..363ddfd1fca 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -97,10 +97,12 @@ void	   *UsedShmemSegAddr = NULL;
 typedef struct AnonymousMapping
 {
 	int shmem_segment;
-	Size shmem_size; 			/* Size of the mapping */
+	Size shmem_size; 			/* Size of the actually used memory */
+	Size shmem_reserved; 		/* Size of the reserved mapping */
 	Pointer shmem; 				/* Pointer to the start of the mapped memory */
 	Pointer seg_addr; 			/* SysV shared memory for the header */
 	unsigned long seg_id; 		/* IPC key */
+	int segment_fd; 			/* fd for the backing anon file */
 } AnonymousMapping;
 
 static AnonymousMapping Mappings[ANON_MAPPINGS];
@@ -108,6 +110,49 @@ static AnonymousMapping Mappings[ANON_MAPPINGS];
 /* Keeps track of used mapping segments */
 static int next_free_segment = 0;
 
+/*
+ * Anonymous mapping layout we use looks like this:
+ *
+ * 00400000-00c2a000 r-xp 			/bin/postgres
+ * ...
+ * 3f526000-3f590000 rw-p 			[heap]
+ * 7fbd827fe000-7fbd8bdde000 rw-s 	/memfd:main (deleted)
+ * 7fbd8bdde000-7fbe82800000 ---s 	/memfd:main (deleted)
+ * 7fbe82800000-7fbe90670000 r--p 	/usr/lib/locale/locale-archive
+ * 7fbe90800000-7fbe90941000 r-xp 	/usr/lib64/libstdc++.so.6.0.34
+ * ...
+ *
+ * We need to place shared memory mappings in such a way, that there will be
+ * gaps between them in the address space. Those gaps have to be large enough
+ * to resize the mapping up to certain size, without counting towards the total
+ * memory consumption.
+ *
+ * To achieve this, for each shared memory segment we first create an anonymous
+ * file of specified size using memfd_create, which will accomodate actual
+ * shared memory mapping content. It is represented by the first /memfd:main
+ * with rw permissions. Then we create a mapping for this file using mmap, with
+ * size much larger than required and flags PROT_NONE (allows to make sure the
+ * reserved space will not be used) and MAP_NORESERVE (prevents the space from
+ * being counted against memory limits). The mapping serves as an address space
+ * reservation, into which shared memory segment can be extended and is
+ * represented by the second /memfd:main with no permissions.
+ *
+ * The reserved space for each segment is calculated as a fraction of the total
+ * reserved space (MaxAvailableMemory), as specified in the SHMEM_RESIZE_RATIO
+ * array.
+ */
+static double SHMEM_RESIZE_RATIO[1] = {
+	1.0, 									/* MAIN_SHMEM_SLOT */
+};
+
+/*
+ * Flag telling that we have decided to use huge pages.
+ *
+ * XXX: It's possible to use GetConfigOption("huge_pages_status", false, false)
+ * instead, but it feels like an overkill.
+ */
+static bool huge_pages_on = false;
+
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
@@ -503,19 +548,20 @@ PGSharedMemoryAttach(IpcMemoryId shmId,
  * hugepage sizes, we might want to think about more invasive strategies,
  * such as increasing shared_buffers to absorb the extra space.
  *
- * Returns the (real, assumed or config provided) page size into
- * *hugepagesize, and the hugepage-related mmap flags to use into
- * *mmap_flags if requested by the caller.  If huge pages are not supported,
- * *hugepagesize and *mmap_flags are set to 0.
+ * Returns the (real, assumed or config provided) page size into *hugepagesize,
+ * the hugepage-related mmap and memfd flags to use into *mmap_flags and
+ * *memfd_flags if requested by the caller. If huge pages are not supported,
+ * *hugepagesize, *mmap_flags and *memfd_flags are set to 0.
  */
 void
-GetHugePageSize(Size *hugepagesize, int *mmap_flags)
+GetHugePageSize(Size *hugepagesize, int *mmap_flags, int *memfd_flags)
 {
 #ifdef MAP_HUGETLB
 
 	Size		default_hugepagesize = 0;
 	Size		hugepagesize_local = 0;
 	int			mmap_flags_local = 0;
+	int			memfd_flags_local = 0;
 
 	/*
 	 * System-dependent code to find out the default huge page size.
@@ -574,6 +620,7 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 	}
 
 	mmap_flags_local = MAP_HUGETLB;
+	memfd_flags_local = MFD_HUGETLB;
 
 	/*
 	 * On recent enough Linux, also include the explicit page size, if
@@ -584,7 +631,16 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 	{
 		int			shift = pg_ceil_log2_64(hugepagesize_local);
 
-		mmap_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
+		memfd_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
+	}
+#endif
+
+#if defined(MFD_HUGE_MASK) && defined(MFD_HUGE_SHIFT)
+	if (hugepagesize_local != default_hugepagesize)
+	{
+		int			shift = pg_ceil_log2_64(hugepagesize_local);
+
+		memfd_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
 	}
 #endif
 
@@ -593,6 +649,8 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 		*mmap_flags = mmap_flags_local;
 	if (hugepagesize)
 		*hugepagesize = hugepagesize_local;
+	if (memfd_flags)
+		*memfd_flags = memfd_flags_local;
 
 #else
 
@@ -600,6 +658,8 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 		*hugepagesize = 0;
 	if (mmap_flags)
 		*mmap_flags = 0;
+	if (memfd_flags)
+		*memfd_flags = 0;
 
 #endif							/* MAP_HUGETLB */
 }
@@ -625,72 +685,90 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
  * Creates an anonymous mmap()ed shared memory segment.
  *
  * This function will modify mapping size to the actual size of the allocation,
- * if it ends up allocating a segment that is larger than requested.
+ * if it ends up allocating a segment that is larger than requested. If needed,
+ * it also rounds up the mapping reserved size to be a multiple of huge page
+ * size.
+ *
+ * Note that we do not fallback from huge pages to regular pages in this
+ * function, this decision was already made in ReserveAnonymousMemory and we
+ * stick to it.
  */
 static void
 CreateAnonymousSegment(AnonymousMapping *mapping)
 {
 	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
-	int			mmap_errno = 0;
+	int			save_errno = 0;
+	int			mmap_flags = PG_MMAP_FLAGS, memfd_flags = 0;
+
+	elog(DEBUG1, "segment[%s]: size %zu, reserved %zu",
+		 MappingName(mapping->shmem_segment), mapping->shmem_size,
+		 mapping->shmem_reserved);
 
 #ifndef MAP_HUGETLB
-	/* PGSharedMemoryCreate should have dealt with this case */
-	Assert(huge_pages != HUGE_PAGES_ON);
+	/* PrepareHugePages should have dealt with this case */
+	Assert(huge_pages != HUGE_PAGES_ON && !huge_pages_on);
 #else
-	if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
+	if (huge_pages_on)
 	{
-		/*
-		 * Round up the request size to a suitable large value.
-		 */
 		Size		hugepagesize;
-		int			mmap_flags;
 
-		GetHugePageSize(&hugepagesize, &mmap_flags);
+		/* Make sure nothing is messed up */
+		Assert(huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY);
+
+		/* Round up the request size to a suitable large value */
+		GetHugePageSize(&hugepagesize, &mmap_flags, &memfd_flags);
 
 		if (allocsize % hugepagesize != 0)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
-		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
-		mmap_errno = errno;
-		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
-		{
-			DebugMappings();
-			elog(DEBUG1, "segment[%s]: mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-				 MappingName(mapping->shmem_segment), allocsize);
-		}
+		/*
+		 * The reserved space is multiple of BLCKSZ. We know the huge page
+		 * size, round up the reserved space to it.
+		 */
+		mapping->shmem_reserved = mapping->shmem_reserved + hugepagesize -
+			(mapping->shmem_reserved % hugepagesize);
+
+		/* Verify that the new size is withing the reserved boundaries */
+		if (mapping->shmem_reserved < mapping->shmem_size)
+			ereport(ERROR,
+					(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+					 errmsg("not enough shared memory is reserved"),
+					 errhint("You may need to increase \"max_available_memory\".")));
+
+		mmap_flags = PG_MMAP_FLAGS | mmap_flags;
 	}
 #endif
 
 	/*
-	 * Report whether huge pages are in use.  This needs to be tracked before
-	 * the second mmap() call if attempting to use huge pages failed
-	 * previously.
+	 * Prepare an anonymous file backing the segment. Its size will be
+	 * specified later via ftruncate.
+	 *
+	 * The file behaves like a regular file, but lives in memory. Once all
+	 * references to the file are dropped,  it is automatically released.
+	 * Anonymous memory is used for all backing pages of the file, thus it has
+	 * the same semantics as anonymous memory allocations using mmap with the
+	 * MAP_ANONYMOUS flag.
 	 */
-	SetConfigOption("huge_pages_status", (ptr == MAP_FAILED) ? "off" : "on",
-					PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
+	mapping->segment_fd = memfd_create(MappingName(mapping->shmem_segment),
+									   memfd_flags);
 
-	if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
+	/*
+	 * Specify the segment file size using allocsize, which contains
+	 * potentially modified value.
+	 */
+	if(ftruncate(mapping->segment_fd, allocsize) == -1)
 	{
-		/*
-		 * Use the original size, not the rounded-up value, when falling back
-		 * to non-huge pages.
-		 */
-		allocsize = mapping->shmem_size;
-		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS, -1, 0);
-		mmap_errno = errno;
-	}
+		save_errno = errno;
 
-	if (ptr == MAP_FAILED)
-	{
-		errno = mmap_errno;
 		DebugMappings();
+		close(mapping->segment_fd);
+
+		errno = save_errno;
 		ereport(FATAL,
-				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
+				(errmsg("segment[%s]: could not truncate anonymous file: %m",
 						MappingName(mapping->shmem_segment)),
-				 (mmap_errno == ENOMEM) ?
+				 (save_errno == ENOMEM) ?
 				 errhint("This error usually means that PostgreSQL's request "
 						 "for a shared memory segment exceeded available memory, "
 						 "swap space, or huge pages. To reduce the request size "
@@ -700,10 +778,112 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 						 allocsize) : 0));
 	}
 
+	elog(DEBUG1, "segment[%s]: mmap(%zu)",
+		 MappingName(mapping->shmem_segment), allocsize);
+
+	/*
+	 * Create a reservation mapping.
+	 */
+	ptr = mmap(NULL, mapping->shmem_reserved, PROT_NONE,
+			   mmap_flags | MAP_NORESERVE, mapping->segment_fd, 0);
+	save_errno = errno;
+
+	if (ptr == MAP_FAILED)
+	{
+		DebugMappings();
+
+		errno = save_errno;
+		ereport(FATAL,
+				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
+						MappingName(mapping->shmem_segment))));
+	}
+
+	/* Make the memory accessible */
+	if(mprotect(ptr, allocsize, PROT_READ | PROT_WRITE) == -1)
+	{
+		save_errno = errno;
+		DebugMappings();
+
+		errno = save_errno;
+		ereport(FATAL,
+				(errmsg("segment[%s]: could not mprotect anonymous shared memory: %m",
+						MappingName(mapping->shmem_segment))));
+	}
+
 	mapping->shmem = ptr;
 	mapping->shmem_size = allocsize;
 }
 
+/*
+ * PrepareHugePages
+ *
+ * Figure out if there are enough huge pages to allocate all shared memory
+ * segments, and report that information via huge_pages_status and
+ * huge_pages_on. It needs to be called before creating shared memory segments.
+ *
+ * It is necessary to maintain the same semantic (simple on/off) for
+ * huge_pages_status, even if there are multiple shared memory segments: all
+ * segments either use huge pages or not, there is no mix of segments with
+ * different page size. The latter might be actually beneficial, in particular
+ * because only some segments may require large amount of memory, but for now
+ * we go with a simple solution.
+ */
+void
+PrepareHugePages()
+{
+	void	   *ptr = MAP_FAILED;
+
+	/* Reset to handle reinitialization */
+	next_free_segment = 0;
+
+	/* Complain if hugepages demanded but we can't possibly support them */
+#if !defined(MAP_HUGETLB)
+	if (huge_pages == HUGE_PAGES_ON)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("huge pages not supported on this platform")));
+#else
+	if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
+	{
+		Size		hugepagesize, total_size = 0;
+		int			mmap_flags;
+
+		GetHugePageSize(&hugepagesize, &mmap_flags, NULL);
+
+		/*
+		 * Figure out how much memory is needed for all segments, keeping in
+		 * mind that for every segment this value will be rounding up by the
+		 * huge page size. The resulting value will be used to probe memory and
+		 * decide whether we will allocate huge pages or not.
+		 */
+		for(int segment = 0; segment < ANON_MAPPINGS; segment++)
+		{
+			int	numSemas;
+			Size segment_size = CalculateShmemSize(&numSemas, segment);
+
+			if (segment_size % hugepagesize != 0)
+				segment_size += hugepagesize - (segment_size % hugepagesize);
+
+			total_size += segment_size;
+		}
+
+		/* Map total amount of memory to test its availability. */
+		elog(DEBUG1, "reserving space: probe mmap(%zu) with MAP_HUGETLB",
+					 total_size);
+		ptr = mmap(NULL, total_size, PROT_NONE,
+				   PG_MMAP_FLAGS | MAP_ANONYMOUS | mmap_flags, -1, 0);
+	}
+#endif
+
+	/*
+	 * Report whether huge pages are in use. This needs to be tracked before
+	 * creating shared memory segments.
+	 */
+	SetConfigOption("huge_pages_status", (ptr == MAP_FAILED) ? "off" : "on",
+					PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
+	huge_pages_on = ptr != MAP_FAILED;
+}
+
 /*
  * AnonymousShmemDetach --- detach from an anonymous mmap'd block
  * (called as an on_shmem_exit callback, hence funny argument list)
@@ -746,7 +926,7 @@ PGSharedMemoryCreate(Size size,
 	void	   *memAddress;
 	PGShmemHeader *hdr;
 	struct stat statbuf;
-	Size		sysvsize;
+	Size		sysvsize, total_reserved;
 	AnonymousMapping *mapping = &Mappings[next_free_segment];
 
 	/*
@@ -760,14 +940,6 @@ PGSharedMemoryCreate(Size size,
 				 errmsg("could not stat data directory \"%s\": %m",
 						DataDir)));
 
-	/* Complain if hugepages demanded but we can't possibly support them */
-#if !defined(MAP_HUGETLB)
-	if (huge_pages == HUGE_PAGES_ON)
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("huge pages not supported on this platform")));
-#endif
-
 	/* For now, we don't support huge pages in SysV memory */
 	if (huge_pages == HUGE_PAGES_ON && shared_memory_type != SHMEM_TYPE_MMAP)
 		ereport(ERROR,
@@ -776,8 +948,16 @@ PGSharedMemoryCreate(Size size,
 
 	/* Room for a header? */
 	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
+
+	/* Prepare the mapping information */
 	mapping->shmem_size = size;
 	mapping->shmem_segment = next_free_segment;
+	total_reserved = (Size) MaxAvailableMemory * BLCKSZ;
+	mapping->shmem_reserved = total_reserved * SHMEM_RESIZE_RATIO[next_free_segment];
+
+	/* Round up to be a multiple of BLCKSZ */
+	mapping->shmem_reserved = mapping->shmem_reserved + BLCKSZ -
+		(mapping->shmem_reserved % BLCKSZ);
 
 	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
diff --git a/src/backend/port/win32_shmem.c b/src/backend/port/win32_shmem.c
index 4dee856d6bd..732fedee87e 100644
--- a/src/backend/port/win32_shmem.c
+++ b/src/backend/port/win32_shmem.c
@@ -627,7 +627,7 @@ pgwin32_ReserveSharedMemoryRegion(HANDLE hChild)
  * use GetLargePageMinimum() instead.
  */
 void
-GetHugePageSize(Size *hugepagesize, int *mmap_flags)
+GetHugePageSize(Size *hugepagesize, int *mmap_flags, int *memfd_flags)
 {
 	if (hugepagesize)
 		*hugepagesize = 0;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 8b38e985327..b60f7ef9ce2 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -206,6 +206,9 @@ CreateSharedMemoryAndSemaphores(void)
 
 	Assert(!IsUnderPostmaster);
 
+	/* Decide if we use huge pages or regular size pages */
+	PrepareHugePages();
+
 	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
 	{
 		/* Compute the size of the shared-memory block */
@@ -377,7 +380,7 @@ InitializeShmemGUCs(void)
 	/*
 	 * Calculate the number of huge pages required.
 	 */
-	GetHugePageSize(&hp_size, NULL);
+	GetHugePageSize(&hp_size, NULL, NULL);
 	if (hp_size != 0)
 	{
 		Size		hp_required;
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index f185ed28f95..9bb73f31052 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -815,7 +815,7 @@ pg_get_shmem_pagesize(void)
 	Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
 
 	if (huge_pages_status == HUGE_PAGES_ON)
-		GetHugePageSize(&os_page_size, NULL);
+		GetHugePageSize(&os_page_size, NULL, NULL);
 
 	return os_page_size;
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..90d3feb547c 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -140,6 +140,7 @@ int			max_parallel_maintenance_workers = 2;
  * register background workers.
  */
 int			NBuffers = 16384;
+int			MaxAvailableMemory = 524288;
 int			MaxConnections = 100;
 int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 6bc6be13d2a..c94f3fc3c80 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1107,6 +1107,18 @@
   max => 'INT_MAX / 2',
 },
 
+# TODO: should this be PGC_POSTMASTER?
+{ name => "max_available_memory", type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_MEM',
+  short_desc => 'Sets the upper limit for the shared_buffers value.',
+  long_desc => 'Shared memory could be resized at runtime, this parameters sets the upper limit for it, beyond which resizing would not be supported. Normally this value would be the same as the total available memory.',
+  flags => 'GUC_UNIT_BLOCKS',
+  variable => 'MaxAvailableMemory',
+  boot_val => '524288',
+  min => '16',
+  max => 'INT_MAX / 2',
+},
+
+
 { name => 'vacuum_buffer_usage_limit', type => 'int', context => 'PGC_USERSET', group => 'RESOURCES_MEM',
   short_desc => 'Sets the buffer pool size for VACUUM, ANALYZE, and autovacuum.',
   flags => 'GUC_UNIT_KB',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..a0c37a7749e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -173,6 +173,7 @@ extern PGDLLIMPORT char *DataDir;
 extern PGDLLIMPORT int data_directory_mode;
 
 extern PGDLLIMPORT int NBuffers;
+extern PGDLLIMPORT int MaxAvailableMemory;
 extern PGDLLIMPORT int MaxBackends;
 extern PGDLLIMPORT int MaxConnections;
 extern PGDLLIMPORT int max_worker_processes;
diff --git a/src/include/portability/mem.h b/src/include/portability/mem.h
index ef9800732d9..40588ff6968 100644
--- a/src/include/portability/mem.h
+++ b/src/include/portability/mem.h
@@ -38,7 +38,7 @@
 #define MAP_NOSYNC			0
 #endif
 
-#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_HASSEMAPHORE)
 
 /* Some really old systems don't define MAP_FAILED. */
 #ifndef MAP_FAILED
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 2348c59b5a0..79b0b1ef9eb 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
 extern PGDLLIMPORT int huge_page_size;
 extern PGDLLIMPORT int huge_pages_status;
+extern PGDLLIMPORT int MaxAvailableMemory;
 
 /* Possible values for huge_pages and huge_pages_status */
 typedef enum
@@ -104,7 +105,9 @@ extern PGShmemHeader *PGSharedMemoryCreate(Size size,
 										   PGShmemHeader **shim);
 extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
-extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
+extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
+							int *memfd_flags);
+void PrepareHugePages(void);
 
 /* The main segment, contains everything except buffer blocks and related data. */
 #define MAIN_SHMEM_SEGMENT 0
-- 
2.34.1

0008-Fix-compilation-failures-from-previous-comm-20250918.patchapplication/x-patch; name=0008-Fix-compilation-failures-from-previous-comm-20250918.patchDownload

From 86ada56e8c48d8111b40d10cae8c96a3286d210a Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Wed, 20 Aug 2025 11:35:20 +0530
Subject: [PATCH 08/16] Fix compilation failures from previous commits

shm_total_page_count is used unitialized. If this variable has a random
value to start with, the final sum would be wrong.

Also include pg_shmem.h where shared memory segment macros are used.

Author: Ashutosh Bapat
---
 src/backend/storage/buffer/buf_init.c | 1 +
 src/backend/storage/ipc/shmem.c       | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 5383442e213..6d703e18f8b 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -16,6 +16,7 @@
 
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
+#include "storage/pg_shmem.h"
 #include "storage/bufmgr.h"
 
 BufferDescPadded *BufferDescriptors;
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 9bb73f31052..e6cb919f0fc 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -649,7 +649,7 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
 	Size		os_page_size;
 	void	  **page_ptrs;
 	int		   *pages_status;
-	uint64		shm_total_page_count,
+	uint64		shm_total_page_count = 0,
 				shm_ent_page_count,
 				max_nodes;
 	Size	   *nodes;
-- 
2.34.1

0007-Introduce-multiple-shmem-segments-for-share-20250918.patchapplication/x-patch; name=0007-Introduce-multiple-shmem-segments-for-share-20250918.patchDownload

From 942b69a0876b0e83303e6704da54c4c002a5a2d8 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 11:22:02 +0200
Subject: [PATCH 07/16] Introduce multiple shmem segments for shared buffers

Add more shmem segments to split shared buffers into following chunks:
* BUFFERS_SHMEM_SEGMENT: contains buffer blocks
* BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
* BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
* CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
* STRATEGY_SHMEM_SEGMENT: contains buffer strategy status

Size of the corresponding shared data directly depends on NBuffers,
meaning that if we would like to change NBuffers, they have to be
resized correspondingly. Placing each of them in a separate shmem
segment allows to achieve that.

There are some asumptions made about each of shmem segments upper size
limit. The buffer blocks have the largest, while the rest claim less
extra room for resize. Ideally those limits have to be deduced from the
maximum allowed shared memory.
---
 src/backend/port/sysv_shmem.c          | 24 +++++++-
 src/backend/storage/buffer/buf_init.c  | 79 +++++++++++++++++---------
 src/backend/storage/buffer/buf_table.c |  6 +-
 src/backend/storage/buffer/freelist.c  |  5 +-
 src/backend/storage/ipc/ipci.c         |  2 +-
 src/include/storage/bufmgr.h           |  2 +-
 src/include/storage/pg_shmem.h         | 24 +++++++-
 7 files changed, 105 insertions(+), 37 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 363ddfd1fca..dac011b766b 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -139,10 +139,18 @@ static int next_free_segment = 0;
  *
  * The reserved space for each segment is calculated as a fraction of the total
  * reserved space (MaxAvailableMemory), as specified in the SHMEM_RESIZE_RATIO
- * array.
+ * array. E.g. we allow BUFFERS_SHMEM_SEGMENT to take up to 60% of the whole
+ * space when resizing, based on the fact that it most likely will be the main
+ * consumer of this memory. Those numbers are pulled out of thin air for now,
+ * makes sense to evaluate them more precise.
  */
-static double SHMEM_RESIZE_RATIO[1] = {
-	1.0, 									/* MAIN_SHMEM_SLOT */
+static double SHMEM_RESIZE_RATIO[6] = {
+	0.1,    /* MAIN_SHMEM_SEGMENT */
+	0.6,    /* BUFFERS_SHMEM_SEGMENT */
+	0.1,    /* BUFFER_DESCRIPTORS_SHMEM_SEGMENT */
+	0.1,    /* BUFFER_IOCV_SHMEM_SEGMENT */
+	0.05,   /* CHECKPOINT_BUFFERS_SHMEM_SEGMENT */
+	0.05,   /* STRATEGY_SHMEM_SEGMENT */
 };
 
 /*
@@ -167,6 +175,16 @@ MappingName(int shmem_segment)
 	{
 		case MAIN_SHMEM_SEGMENT:
 			return "main";
+		case BUFFERS_SHMEM_SEGMENT:
+			return "buffers";
+		case BUFFER_DESCRIPTORS_SHMEM_SEGMENT:
+			return "descriptors";
+		case BUFFER_IOCV_SHMEM_SEGMENT:
+			return "iocv";
+		case CHECKPOINT_BUFFERS_SHMEM_SEGMENT:
+			return "checkpoint";
+		case STRATEGY_SHMEM_SEGMENT:
+			return "strategy";
 		default:
 			return "unknown";
 	}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6fd3a6bbac5..5383442e213 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -62,7 +62,10 @@ CkptSortItem *CkptBufferIds;
  * Initialize shared buffer pool
  *
  * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend).
+ * postmaster, or in a standalone backend). Size of data structures initialized
+ * here depends on NBuffers, and to be able to change NBuffers without a
+ * restart we store each structure into a separate shared memory segment, which
+ * could be resized on demand.
  */
 void
 BufferManagerShmemInit(void)
@@ -74,22 +77,22 @@ BufferManagerShmemInit(void)
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
+		ShmemInitStructInSegment("Buffer Descriptors",
 						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+						&foundDescs, BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
 		TYPEALIGN(PG_IO_ALIGN_SIZE,
-				  ShmemInitStruct("Buffer Blocks",
+				  ShmemInitStructInSegment("Buffer Blocks",
 								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
-								  &foundBufs));
+								  &foundBufs, BUFFERS_SHMEM_SEGMENT));
 
 	/* Align condition variables to cacheline boundary. */
 	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
-		ShmemInitStruct("Buffer IO Condition Variables",
+		ShmemInitStructInSegment("Buffer IO Condition Variables",
 						NBuffers * sizeof(ConditionVariableMinimallyPadded),
-						&foundIOCV);
+						&foundIOCV, BUFFER_IOCV_SHMEM_SEGMENT);
 
 	/*
 	 * The array used to sort to-be-checkpointed buffer ids is located in
@@ -99,8 +102,9 @@ BufferManagerShmemInit(void)
 	 * painful.
 	 */
 	CkptBufferIds = (CkptSortItem *)
-		ShmemInitStruct("Checkpoint BufferIds",
-						NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
+		ShmemInitStructInSegment("Checkpoint BufferIds",
+						NBuffers * sizeof(CkptSortItem), &foundBufCkpt,
+						CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
 
 	if (foundDescs || foundBufs || foundIOCV || foundBufCkpt)
 	{
@@ -147,33 +151,54 @@ BufferManagerShmemInit(void)
  * BufferManagerShmemSize
  *
  * compute the size of shared memory for the buffer pool including
- * data pages, buffer descriptors, hash tables, etc.
+ * data pages, buffer descriptors, hash tables, etc. based on the
+ * shared memory segment. The main segment must not allocate anything
+ * related to buffers, every other segment will receive part of the
+ * data.
  */
 Size
-BufferManagerShmemSize(void)
+BufferManagerShmemSize(int shmem_segment)
 {
 	Size		size = 0;
 
-	/* size of buffer descriptors */
-	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
-	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_segment == MAIN_SHMEM_SEGMENT)
+		return size;
 
-	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
-	size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	if (shmem_segment == BUFFER_DESCRIPTORS_SHMEM_SEGMENT)
+	{
+		/* size of buffer descriptors */
+		size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
+		/* to allow aligning buffer descriptors */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
 
-	/* size of stuff controlled by freelist.c */
-	size = add_size(size, StrategyShmemSize());
+	if (shmem_segment == BUFFERS_SHMEM_SEGMENT)
+	{
+		/* size of data pages, plus alignment padding */
+		size = add_size(size, PG_IO_ALIGN_SIZE);
+		size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	}
 
-	/* size of I/O condition variables */
-	size = add_size(size, mul_size(NBuffers,
-								   sizeof(ConditionVariableMinimallyPadded)));
-	/* to allow aligning the above */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_segment == STRATEGY_SHMEM_SEGMENT)
+	{
+		/* size of stuff controlled by freelist.c */
+		size = add_size(size, StrategyShmemSize());
+	}
 
-	/* size of checkpoint sort array in bufmgr.c */
-	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	if (shmem_segment == BUFFER_IOCV_SHMEM_SEGMENT)
+	{
+		/* size of I/O condition variables */
+		size = add_size(size, mul_size(NBuffers,
+									   sizeof(ConditionVariableMinimallyPadded)));
+		/* to allow aligning the above */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
+
+	if (shmem_segment == CHECKPOINT_BUFFERS_SHMEM_SEGMENT)
+	{
+		/* size of checkpoint sort array in bufmgr.c */
+		size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	}
 
 	return size;
 }
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 1f6e215a2ca..18a78967138 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -25,6 +25,7 @@
 #include "funcapi.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
+#include "storage/pg_shmem.h"
 #include "utils/rel.h"
 #include "utils/builtins.h"
 
@@ -64,10 +65,11 @@ InitBufTable(int size)
 	info.entrysize = sizeof(BufferLookupEnt);
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
-	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
+	SharedBufHash = ShmemInitHashInSegment("Shared Buffer Lookup Table",
 								  size, size,
 								  &info,
-								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION | HASH_FIXED_SIZE);
+								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION | HASH_FIXED_SIZE,
+								  STRATEGY_SHMEM_SEGMENT);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7d59a92bd1a..0bfbbb096d6 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
@@ -381,9 +382,9 @@ StrategyInitialize(bool init)
 	 * Get or create the shared strategy control block
 	 */
 	StrategyControl = (BufferStrategyControl *)
-		ShmemInitStruct("Buffer Strategy Status",
+		ShmemInitStructInSegment("Buffer Strategy Status",
 						sizeof(BufferStrategyControl),
-						&found);
+						&found, STRATEGY_SHMEM_SEGMENT);
 
 	if (!found)
 	{
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index b60f7ef9ce2..2dbd81afc87 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -113,7 +113,7 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
 											 sizeof(ShmemIndexEnt)));
 	size = add_size(size, dsm_estimate_size());
 	size = add_size(size, DSMRegistryShmemSize());
-	size = add_size(size, BufferManagerShmemSize());
+	size = add_size(size, BufferManagerShmemSize(shmem_segment));
 	size = add_size(size, LockManagerShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 47360a3d3d8..f8d34513c7f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -318,7 +318,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
-extern Size BufferManagerShmemSize(void);
+extern Size BufferManagerShmemSize(int);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 79b0b1ef9eb..a7b275b4db9 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -52,7 +52,7 @@ typedef struct ShmemSegment
 } ShmemSegment;
 
 /* Number of available segments for anonymous memory mappings */
-#define ANON_MAPPINGS 1
+#define ANON_MAPPINGS 6
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 
@@ -109,7 +109,29 @@ extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
 							int *memfd_flags);
 void PrepareHugePages(void);
 
+/*
+ * To be able to dynamically resize largest parts of the data stored in shared
+ * memory, we split it into multiple shared memory mappings segments. Each
+ * segment contains only certain part of the data, which size depends on
+ * NBuffers.
+ */
+
 /* The main segment, contains everything except buffer blocks and related data. */
 #define MAIN_SHMEM_SEGMENT 0
 
+/* Buffer blocks */
+#define BUFFERS_SHMEM_SEGMENT 1
+
+/* Buffer descriptors */
+#define BUFFER_DESCRIPTORS_SHMEM_SEGMENT 2
+
+/* Condition variables for buffers */
+#define BUFFER_IOCV_SHMEM_SEGMENT 3
+
+/* Checkpoint BufferIds */
+#define CHECKPOINT_BUFFERS_SHMEM_SEGMENT 4
+
+/* Buffer strategy status */
+#define STRATEGY_SHMEM_SEGMENT 5
+
 #endif							/* PG_SHMEM_H */
-- 
2.34.1

0009-Refactor-CalculateShmemSize-20250918.patchapplication/x-patch; name=0009-Refactor-CalculateShmemSize-20250918.patchDownload

From 2d96fa0ede7381c573ad608d89a90f2a960ceca3 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Thu, 21 Aug 2025 11:56:09 +0530
Subject: [PATCH 09/16] Refactor CalculateShmemSize()

This function calls many functions which return the amount of shared
memory required for different shared memory data structures. Up until
now, the returned total of these sizes was used to create a single
shared memory segment. But starting the previous patch, we create
multiple shared memory segments each of which contain one shared memory
structure related to shared buffers and one main memory segment
containing rest of the structures. Since CalculateShmemSize() is called
for every shared memory segment, and its return value is added to the
memory required for all the shared memory segments, we end up allocating
more memory than required.

Instead, CalculateShmemSize() is called only once. Each of its callees
are expected to a. return the size required from the main segment b. add
sizes to the AnonymousMappings corresponding to the other memory
segments.

For individual modules to add memory to their respective
AnonymousMappings, we need to know the different mappings upfront. Hence
ANON_MAPPINGS replaces next_free_segment.

TODOs:

1. This change however requires that the AnonymousMappings array and
   macros defining identifiers of each of the segments be
platform-independent.  This patch doesn't achieve that goal for all the
platforms for example windows. We need to fix that.

2. If postgres is invoked with -C shared_memory_size, it reports 0.
   That's because it report the GUC values before share memory sizes are
set in AnonymousMappings. Fix that too.

3.  Eliminate this assymetry in CalculateShmemSize().  See TODO in
    prologue of CalculateShmemSize().

4. This is one way to avoid requesting more memory in each segment. But
   there may be other ways to design CalculateShmemSize(). Need to think
and implement it better.

Author: Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c         | 48 ++++++--------------
 src/backend/port/win32_shmem.c        |  7 +--
 src/backend/postmaster/postmaster.c   | 14 +++---
 src/backend/storage/buffer/buf_init.c | 55 ++++++++---------------
 src/backend/storage/ipc/ipci.c        | 65 ++++++++++++++++++++++-----
 src/backend/storage/ipc/shmem.c       |  8 ++--
 src/backend/tcop/postgres.c           | 14 +++---
 src/include/storage/bufmgr.h          |  2 +-
 src/include/storage/ipc.h             |  2 +-
 src/include/storage/pg_shmem.h        | 17 ++++++-
 10 files changed, 125 insertions(+), 107 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index dac011b766b..b85911bdfc4 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -94,21 +94,7 @@ typedef enum
 unsigned long UsedShmemSegID = 0;
 void	   *UsedShmemSegAddr = NULL;
 
-typedef struct AnonymousMapping
-{
-	int shmem_segment;
-	Size shmem_size; 			/* Size of the actually used memory */
-	Size shmem_reserved; 		/* Size of the reserved mapping */
-	Pointer shmem; 				/* Pointer to the start of the mapped memory */
-	Pointer seg_addr; 			/* SysV shared memory for the header */
-	unsigned long seg_id; 		/* IPC key */
-	int segment_fd; 			/* fd for the backing anon file */
-} AnonymousMapping;
-
-static AnonymousMapping Mappings[ANON_MAPPINGS];
-
-/* Keeps track of used mapping segments */
-static int next_free_segment = 0;
+AnonymousMapping Mappings[ANON_MAPPINGS];
 
 /*
  * Anonymous mapping layout we use looks like this:
@@ -168,7 +154,7 @@ static IpcMemoryState PGSharedMemoryAttach(IpcMemoryId shmId,
 										   void *attachAt,
 										   PGShmemHeader **addr);
 
-static const char*
+const char*
 MappingName(int shmem_segment)
 {
 	switch (shmem_segment)
@@ -193,7 +179,7 @@ MappingName(int shmem_segment)
 static void
 DebugMappings()
 {
-	for(int i = 0; i < next_free_segment; i++)
+	for(int i = 0; i < ANON_MAPPINGS; i++)
 	{
 		AnonymousMapping m = Mappings[i];
 		elog(DEBUG1, "Mapping[%s]: addr %p, size %zu",
@@ -851,9 +837,6 @@ PrepareHugePages()
 {
 	void	   *ptr = MAP_FAILED;
 
-	/* Reset to handle reinitialization */
-	next_free_segment = 0;
-
 	/* Complain if hugepages demanded but we can't possibly support them */
 #if !defined(MAP_HUGETLB)
 	if (huge_pages == HUGE_PAGES_ON)
@@ -876,8 +859,7 @@ PrepareHugePages()
 		 */
 		for(int segment = 0; segment < ANON_MAPPINGS; segment++)
 		{
-			int	numSemas;
-			Size segment_size = CalculateShmemSize(&numSemas, segment);
+			Size segment_size = Mappings[segment].shmem_req_size;
 
 			if (segment_size % hugepagesize != 0)
 				segment_size += hugepagesize - (segment_size % hugepagesize);
@@ -909,7 +891,7 @@ PrepareHugePages()
 static void
 AnonymousShmemDetach(int status, Datum arg)
 {
-	for(int i = 0; i < next_free_segment; i++)
+	for(int i = 0; i < ANON_MAPPINGS; i++)
 	{
 		AnonymousMapping m = Mappings[i];
 
@@ -927,7 +909,7 @@ AnonymousShmemDetach(int status, Datum arg)
 /*
  * PGSharedMemoryCreate
  *
- * Create a shared memory segment of the given size and initialize its
+ * Create a shared memory segment for the given mapping and initialize its
  * standard header.  Also, register an on_shmem_exit callback to release
  * the storage.
  *
@@ -937,7 +919,7 @@ AnonymousShmemDetach(int status, Datum arg)
  * postmaster or backend.
  */
 PGShmemHeader *
-PGSharedMemoryCreate(Size size,
+PGSharedMemoryCreate(AnonymousMapping *mapping,
 					 PGShmemHeader **shim)
 {
 	IpcMemoryKey NextShmemSegID;
@@ -945,7 +927,6 @@ PGSharedMemoryCreate(Size size,
 	PGShmemHeader *hdr;
 	struct stat statbuf;
 	Size		sysvsize, total_reserved;
-	AnonymousMapping *mapping = &Mappings[next_free_segment];
 
 	/*
 	 * We use the data directory's ID info (inode and device numbers) to
@@ -965,13 +946,12 @@ PGSharedMemoryCreate(Size size,
 				 errmsg("huge pages not supported with the current \"shared_memory_type\" setting")));
 
 	/* Room for a header? */
-	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
+	Assert(mapping->shmem_req_size > MAXALIGN(sizeof(PGShmemHeader)));
 
 	/* Prepare the mapping information */
-	mapping->shmem_size = size;
-	mapping->shmem_segment = next_free_segment;
+	mapping->shmem_size = mapping->shmem_req_size;
 	total_reserved = (Size) MaxAvailableMemory * BLCKSZ;
-	mapping->shmem_reserved = total_reserved * SHMEM_RESIZE_RATIO[next_free_segment];
+	mapping->shmem_reserved = total_reserved * SHMEM_RESIZE_RATIO[mapping->shmem_segment];
 
 	/* Round up to be a multiple of BLCKSZ */
 	mapping->shmem_reserved = mapping->shmem_reserved + BLCKSZ -
@@ -982,8 +962,6 @@ PGSharedMemoryCreate(Size size,
 		/* On success, mapping data will be modified. */
 		CreateAnonymousSegment(mapping);
 
-		next_free_segment++;
-
 		/* Register on-exit routine to unmap the anonymous segment */
 		on_shmem_exit(AnonymousShmemDetach, (Datum) 0);
 
@@ -992,7 +970,7 @@ PGSharedMemoryCreate(Size size,
 	}
 	else
 	{
-		sysvsize = size;
+		sysvsize = mapping->shmem_req_size;
 
 		/* huge pages are only available with mmap */
 		SetConfigOption("huge_pages_status", "off",
@@ -1005,7 +983,7 @@ PGSharedMemoryCreate(Size size,
 	 * loop simultaneously.  (CreateDataDirLockFile() does not entirely ensure
 	 * that, but prefer fixing it over coping here.)
 	 */
-	NextShmemSegID = statbuf.st_ino + next_free_segment;
+	NextShmemSegID = statbuf.st_ino + mapping->shmem_segment;
 
 	for (;;)
 	{
@@ -1214,7 +1192,7 @@ PGSharedMemoryNoReAttach(void)
 void
 PGSharedMemoryDetach(void)
 {
-	for(int i = 0; i < next_free_segment; i++)
+	for(int i = 0; i < ANON_MAPPINGS; i++)
 	{
 		AnonymousMapping m = Mappings[i];
 
diff --git a/src/backend/port/win32_shmem.c b/src/backend/port/win32_shmem.c
index 732fedee87e..1db07ff65d3 100644
--- a/src/backend/port/win32_shmem.c
+++ b/src/backend/port/win32_shmem.c
@@ -204,7 +204,7 @@ EnableLockPagesPrivilege(int elevel)
  * standard header.
  */
 PGShmemHeader *
-PGSharedMemoryCreate(Size size,
+PGSharedMemoryCreate(AnonymousMapping *mapping,
 					 PGShmemHeader **shim)
 {
 	void	   *memAddress;
@@ -216,7 +216,7 @@ PGSharedMemoryCreate(Size size,
 	DWORD		size_high;
 	DWORD		size_low;
 	SIZE_T		largePageSize = 0;
-	Size		orig_size = size;
+	Size		size = mapping->shmem_req_size;
 	DWORD		flProtect = PAGE_READWRITE;
 	DWORD		desiredAccess;
 
@@ -304,7 +304,7 @@ retry:
 				 * Use the original size, not the rounded-up value, when
 				 * falling back to non-huge pages.
 				 */
-				size = orig_size;
+				size = mapping->shmem_req_size;
 				flProtect = PAGE_READWRITE;
 				goto retry;
 			}
@@ -391,6 +391,7 @@ retry:
 	hdr->totalsize = size;
 	hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader));
 	hdr->dsm_control = 0;
+	mapping->shmem_size = size;
 
 	/* Save info for possible future use */
 	UsedShmemSegAddr = memAddress;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index e1d643b013d..b59d20b4ac2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -963,13 +963,6 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	process_shmem_requests();
 
-	/*
-	 * Now that loadable modules have had their chance to request additional
-	 * shared memory, determine the value of any runtime-computed GUCs that
-	 * depend on the amount of shared memory required.
-	 */
-	InitializeShmemGUCs();
-
 	/*
 	 * Now that modules have been loaded, we can process any custom resource
 	 * managers specified in the wal_consistency_checking GUC.
@@ -1005,6 +998,13 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	CreateSharedMemoryAndSemaphores();
 
+	/*
+	 * Now that loadable modules have had their chance to request additional
+	 * shared memory, determine the value of any runtime-computed GUCs that
+	 * depend on the amount of shared memory required.
+	 */
+	InitializeShmemGUCs();
+
 	/*
 	 * Estimate number of openable files.  This must happen after setting up
 	 * semaphores, because on some platforms semaphores count as open files.
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6d703e18f8b..6f148d1d80b 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -158,48 +158,31 @@ BufferManagerShmemInit(void)
  * data.
  */
 Size
-BufferManagerShmemSize(int shmem_segment)
+BufferManagerShmemSize(void)
 {
-	Size		size = 0;
+	size_t size;
 
-	if (shmem_segment == MAIN_SHMEM_SEGMENT)
-		return size;
+	/* size of buffer descriptors, plus alignment padding */
+	size = add_size(0, mul_size(NBuffers, sizeof(BufferDescPadded)));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	Mappings[BUFFER_DESCRIPTORS_SHMEM_SEGMENT].shmem_req_size = size;
 
-	if (shmem_segment == BUFFER_DESCRIPTORS_SHMEM_SEGMENT)
-	{
-		/* size of buffer descriptors */
-		size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
-		/* to allow aligning buffer descriptors */
-		size = add_size(size, PG_CACHE_LINE_SIZE);
-	}
+	/* size of data pages, plus alignment padding */
+	size = add_size(0, PG_IO_ALIGN_SIZE);
+	size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	Mappings[BUFFERS_SHMEM_SEGMENT].shmem_req_size = size;
 
-	if (shmem_segment == BUFFERS_SHMEM_SEGMENT)
-	{
-		/* size of data pages, plus alignment padding */
-		size = add_size(size, PG_IO_ALIGN_SIZE);
-		size = add_size(size, mul_size(NBuffers, BLCKSZ));
-	}
+	/* size of stuff controlled by freelist.c */
+	Mappings[STRATEGY_SHMEM_SEGMENT].shmem_req_size = StrategyShmemSize();
 
-	if (shmem_segment == STRATEGY_SHMEM_SEGMENT)
-	{
-		/* size of stuff controlled by freelist.c */
-		size = add_size(size, StrategyShmemSize());
-	}
+	/* size of I/O condition variables, plus alignment padding */
+	size = add_size(0, mul_size(NBuffers,
+								   sizeof(ConditionVariableMinimallyPadded)));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	Mappings[BUFFER_IOCV_SHMEM_SEGMENT].shmem_req_size = size;
 
-	if (shmem_segment == BUFFER_IOCV_SHMEM_SEGMENT)
-	{
-		/* size of I/O condition variables */
-		size = add_size(size, mul_size(NBuffers,
-									   sizeof(ConditionVariableMinimallyPadded)));
-		/* to allow aligning the above */
-		size = add_size(size, PG_CACHE_LINE_SIZE);
-	}
-
-	if (shmem_segment == CHECKPOINT_BUFFERS_SHMEM_SEGMENT)
-	{
-		/* size of checkpoint sort array in bufmgr.c */
-		size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
-	}
+	/* size of checkpoint sort array in bufmgr.c */
+	Mappings[CHECKPOINT_BUFFERS_SHMEM_SEGMENT].shmem_req_size = mul_size(NBuffers, sizeof(CkptSortItem));
 
 	return size;
 }
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2dbd81afc87..2cd278449f0 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -84,9 +84,23 @@ RequestAddinShmemSpace(Size size)
  *
  * If num_semaphores is not NULL, it will be set to the number of semaphores
  * required.
+ * 
+ * TODO: Right now the minions of this function return the size of shared memory
+ * required in the main shared memory segment but add sizes required from other
+ * segments in the respective mappings. I think we should change this assymetry.
+ * It's only the buffer manager which adds sizes for other segments, but in
+ * future there may be others. Further the buffer manager related other segments
+ * are expected to hold only one resizable structure thus their size should be
+ * set only once when changing shared buffer pool size (i.e. when changin
+ * shared_buffers GUC). We shouldn't allow adding more structures to these
+ * segments, and thus restrict adding sizes to the corresponding mappings after
+ * the initial size is set.
+ * 
+ * TODO: Also we should do something about numSemas, which is not required
+ * everywhere CalculateShmemSize is called.
  */
 Size
-CalculateShmemSize(int *num_semaphores, int shmem_segment)
+CalculateShmemSize(int *num_semaphores)
 {
 	Size		size;
 	int			numSemas;
@@ -113,7 +127,13 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
 											 sizeof(ShmemIndexEnt)));
 	size = add_size(size, dsm_estimate_size());
 	size = add_size(size, DSMRegistryShmemSize());
-	size = add_size(size, BufferManagerShmemSize(shmem_segment));
+
+	/*
+	 * Buffer manager adds estimates for memory requirements for every shared
+	 * memory segment that it uses in the corresponding AnonymousMappings.
+	 * Consider size required from only the main shared memory segment here.
+	 */
+	size = add_size(size, BufferManagerShmemSize());
 	size = add_size(size, LockManagerShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
@@ -154,8 +174,15 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
 
+	/*
+	 * All the shared memory allocations considered so far happen in the main
+	 * shared memory segment.
+	 */
+	Mappings[MAIN_SHMEM_SEGMENT].shmem_req_size = size;
+
 	/* might as well round it off to a multiple of a typical page size */
-	size = add_size(size, 8192 - (size % 8192));
+	for (int segment = 0; segment < ANON_MAPPINGS; segment++)
+		Mappings[segment].shmem_req_size = add_size(Mappings[segment].shmem_req_size, 8192 - (Mappings[segment].shmem_req_size % 8192));
 
 	return size;
 }
@@ -201,26 +228,30 @@ CreateSharedMemoryAndSemaphores(void)
 {
 	PGShmemHeader *shim;
 	PGShmemHeader *seghdr;
-	Size		size;
 	int			numSemas;
 
 	Assert(!IsUnderPostmaster);
 
+	CalculateShmemSize(&numSemas);
+
 	/* Decide if we use huge pages or regular size pages */
 	PrepareHugePages();
 
 	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
 	{
+		AnonymousMapping *mapping = &Mappings[segment];
+
+		mapping->shmem_segment = segment;
+
 		/* Compute the size of the shared-memory block */
-		size = CalculateShmemSize(&numSemas, segment);
-		elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
+		elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", mapping->shmem_req_size);
 
 		/*
 		 * Create the shmem segment.
 		 *
 		 * XXX: Do multiple shims are needed, one per segment?
 		 */
-		seghdr = PGSharedMemoryCreate(size, &shim);
+		seghdr = PGSharedMemoryCreate(mapping, &shim);
 
 		/*
 		 * Make sure that huge pages are never reported as "unknown" while the
@@ -232,9 +263,13 @@ CreateSharedMemoryAndSemaphores(void)
 		InitShmemAccessInSegment(seghdr, segment);
 
 		/*
-		 * Create semaphores
+		 * Shared memory for semaphores is allocated in the main shared memory.
+		 * Hence they are allocated after the main segment is created. Patch
+		 * proposed at https://commitfest.postgresql.org/patch/5997/ simplifies
+		 * this.
 		 */
-		PGReserveSemaphores(numSemas, segment);
+		if (segment == MAIN_SHMEM_SEGMENT)
+			PGReserveSemaphores(numSemas, segment);
 
 		/*
 		 * Set up shared memory allocation mechanism
@@ -357,7 +392,9 @@ CreateOrAttachShmemStructs(void)
  * InitializeShmemGUCs
  *
  * This function initializes runtime-computed GUCs related to the amount of
- * shared memory required for the current configuration.
+ * shared memory required for the current configuration. It assumes that the
+ * memory required by the shared memory segments is already calculated and is
+ * available in AnonymousMappings.
  */
 void
 InitializeShmemGUCs(void)
@@ -366,12 +403,16 @@ InitializeShmemGUCs(void)
 	Size		size_b;
 	Size		size_mb;
 	Size		hp_size;
-	int			num_semas;
+	int			num_semas = ProcGlobalSemas();
+	int		i;
 
 	/*
 	 * Calculate the shared memory size and round up to the nearest megabyte.
 	 */
-	size_b = CalculateShmemSize(&num_semas, MAIN_SHMEM_SEGMENT);
+	size_b = 0;
+	for (i = 0; i < ANON_MAPPINGS; i++)
+		size_b = add_size(size_b, Mappings[i].shmem_req_size);
+
 	size_mb = add_size(size_b, (1024 * 1024) - 1) / (1024 * 1024);
 	sprintf(buf, "%zu", size_mb);
 	SetConfigOption("shared_memory_size", buf,
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index e6cb919f0fc..90c21a97225 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -178,8 +178,8 @@ ShmemAllocInSegment(Size size, int shmem_segment)
 	if (!newSpace)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("out of shared memory (%zu bytes requested)",
-						size)));
+				 errmsg("out of shared memory in segment %s (%zu bytes requested)",
+					MappingName(shmem_segment), size)));
 	return newSpace;
 }
 
@@ -286,8 +286,8 @@ ShmemAllocUnlockedInSegment(Size size, int shmem_segment)
 	if (newFree > Segments[shmem_segment].ShmemSegHdr->totalsize)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("out of shared memory (%zu bytes requested)",
-						size)));
+				 errmsg("out of shared memory in segment %s (%zu bytes requested)",
+						MappingName(shmem_segment), size)));
 	Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 
 	newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8d4d6cc3f33..c819608fff6 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4132,13 +4132,6 @@ PostgresSingleUserMain(int argc, char *argv[],
 	 */
 	process_shmem_requests();
 
-	/*
-	 * Now that loadable modules have had their chance to request additional
-	 * shared memory, determine the value of any runtime-computed GUCs that
-	 * depend on the amount of shared memory required.
-	 */
-	InitializeShmemGUCs();
-
 	/*
 	 * Now that modules have been loaded, we can process any custom resource
 	 * managers specified in the wal_consistency_checking GUC.
@@ -4151,6 +4144,13 @@ PostgresSingleUserMain(int argc, char *argv[],
 	 */
 	CreateSharedMemoryAndSemaphores();
 
+	/*
+	 * Now that loadable modules have had their chance to request additional
+	 * shared memory, determine the value of any runtime-computed GUCs that
+	 * depend on the amount of shared memory required.
+	 */
+	InitializeShmemGUCs();
+
 	/*
 	 * Estimate number of openable files.  This must happen after setting up
 	 * semaphores, because on some platforms semaphores count as open files.
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f8d34513c7f..47360a3d3d8 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -318,7 +318,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
-extern Size BufferManagerShmemSize(int);
+extern Size BufferManagerShmemSize(void);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 6ebda479ced..3baf418b3d1 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -77,7 +77,7 @@ extern void check_on_shmem_exit_lists_are_empty(void);
 /* ipci.c */
 extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;
 
-extern Size CalculateShmemSize(int *num_semaphores, int shmem_segment);
+extern Size CalculateShmemSize(int *num_semaphores);
 extern void CreateSharedMemoryAndSemaphores(void);
 #ifdef EXEC_BACKEND
 extern void AttachSharedMemoryStructs(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index a7b275b4db9..a1fa6b43fe3 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -27,6 +27,18 @@
 #include "storage/dsm_impl.h"
 #include "storage/spin.h"
 
+typedef struct AnonymousMapping
+{
+	int shmem_segment;			/* TODO: Do we really need it? */
+	Size shmem_req_size;		/* Required size of the segment */
+	Size shmem_size; 			/* Size of the actually used memory */
+	Size shmem_reserved; 		/* Size of the reserved mapping */
+	Pointer shmem; 				/* Pointer to the start of the mapped memory */
+	Pointer seg_addr; 			/* SysV shared memory for the header */
+	unsigned long seg_id; 		/* IPC key */
+	int segment_fd; 			/* fd for the backing anon file */
+} AnonymousMapping;
+
 typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 {
 	int32		magic;			/* magic # to identify Postgres segments */
@@ -55,6 +67,8 @@ typedef struct ShmemSegment
 #define ANON_MAPPINGS 6
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
+extern PGDLLIMPORT AnonymousMapping Mappings[ANON_MAPPINGS];
+
 
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
@@ -101,10 +115,11 @@ extern void PGSharedMemoryReAttach(void);
 extern void PGSharedMemoryNoReAttach(void);
 #endif
 
-extern PGShmemHeader *PGSharedMemoryCreate(Size size,
+extern PGShmemHeader *PGSharedMemoryCreate(AnonymousMapping *mapping,
 										   PGShmemHeader **shim);
 extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
+extern const char *MappingName(int shmem_segment);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
 							int *memfd_flags);
 void PrepareHugePages(void);
-- 
2.34.1

0010-WIP-Monitoring-views-20250918.patchapplication/x-patch; name=0010-WIP-Monitoring-views-20250918.patchDownload

From 86fb36cab8e2079fde361380ae42fe4dbabe7967 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Wed, 20 Aug 2025 10:55:27 +0530
Subject: [PATCH 10/16] WIP: Monitoring views

Modifies pg_shmem_allocations to report shared memory segment as well.

Adds pg_shmem_segments to report shared memory segment information.

TODO:
This commit should be merged with the earlier commit introducing
multiple shared memory segments.

Author: Ashutosh Bapat
---
 doc/src/sgml/system-views.sgml       |  9 +++
 src/backend/catalog/system_views.sql |  7 +++
 src/backend/storage/ipc/shmem.c      | 90 ++++++++++++++++++++++------
 src/include/catalog/pg_proc.dat      | 12 +++-
 src/include/storage/pg_shmem.h       |  1 -
 src/include/storage/shmem.h          |  1 +
 src/test/regress/expected/rules.out  | 10 +++-
 7 files changed, 108 insertions(+), 22 deletions(-)

diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 89be9bc333f..7d14a6eca24 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -4167,6 +4167,15 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>segment</structfield> <type>text</type>
+      </para>
+      <para>
+       The name of the shared memory segment concerning the allocation. 
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>off</structfield> <type>int8</type>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 46fc28396de..f659dbb2f86 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,13 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
 REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
 GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
 
+CREATE VIEW pg_shmem_segments AS
+    SELECT * FROM pg_get_shmem_segments();
+
+REVOKE ALL ON pg_shmem_segments FROM PUBLIC;
+GRANT SELECT ON pg_shmem_segments TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_segments() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_segments() TO pg_read_all_stats;
 CREATE VIEW pg_shmem_allocations_numa AS
     SELECT * FROM pg_get_shmem_allocations_numa();
 
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 90c21a97225..9499f332e77 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -531,6 +531,7 @@ ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
 		result->size = size;
 		result->allocated_size = allocated_size;
 		result->location = structPtr;
+		result->shmem_segment = shmem_segment;
 	}
 
 	LWLockRelease(ShmemIndexLock);
@@ -582,13 +583,14 @@ mul_size(Size s1, Size s2)
 Datum
 pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 {
-#define PG_GET_SHMEM_SIZES_COLS 4
+#define PG_GET_SHMEM_SIZES_COLS 5
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	HASH_SEQ_STATUS hstat;
 	ShmemIndexEnt *ent;
-	Size		named_allocated = 0;
+	Size		named_allocated[ANON_MAPPINGS] = {0};
 	Datum		values[PG_GET_SHMEM_SIZES_COLS];
 	bool		nulls[PG_GET_SHMEM_SIZES_COLS];
+	int			i;
 
 	InitMaterializedSRF(fcinfo, 0);
 
@@ -598,33 +600,42 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 
 	/* output all allocated entries */
 	memset(nulls, 0, sizeof(nulls));
-	/* XXX: take all shared memory segments into account. */
 	while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
 	{
 		values[0] = CStringGetTextDatum(ent->key);
-		values[1] = Int64GetDatum((char *) ent->location - (char *) Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr);
-		values[2] = Int64GetDatum(ent->size);
-		values[3] = Int64GetDatum(ent->allocated_size);
-		named_allocated += ent->allocated_size;
+		values[1] = CStringGetTextDatum(MappingName(ent->shmem_segment));
+		values[2] = Int64GetDatum((char *) ent->location - (char *) Segments[ent->shmem_segment].ShmemSegHdr);
+		values[3] = Int64GetDatum(ent->size);
+		values[4] = Int64GetDatum(ent->allocated_size);
+		named_allocated[ent->shmem_segment] += ent->allocated_size;
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
 							 values, nulls);
 	}
 
 	/* output shared memory allocated but not counted via the shmem index */
-	values[0] = CStringGetTextDatum("<anonymous>");
-	nulls[1] = true;
-	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset - named_allocated);
-	values[3] = values[2];
-	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	for (i = 0; i < ANON_MAPPINGS; i++)
+	{
+		values[0] = CStringGetTextDatum("<anonymous>");
+		values[1] = CStringGetTextDatum(MappingName(i));
+		nulls[2] = true;
+		values[3] = Int64GetDatum(Segments[i].ShmemSegHdr->freeoffset - named_allocated[i]);
+		values[4] = values[3];
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
 
 	/* output as-of-yet unused shared memory */
-	nulls[0] = true;
-	values[1] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
-	nulls[1] = false;
-	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->totalsize - Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
-	values[3] = values[2];
-	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	memset(nulls, 0, sizeof(nulls));
+
+	for (i = 0; i < ANON_MAPPINGS; i++)
+	{
+		nulls[0] = true;
+		values[1] = CStringGetTextDatum(MappingName(i));
+		values[2] = Int64GetDatum(Segments[i].ShmemSegHdr->freeoffset);
+		values[3] = Int64GetDatum(Segments[i].ShmemSegHdr->totalsize - Segments[i].ShmemSegHdr->freeoffset);
+		values[4] = values[3];
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
 
 	LWLockRelease(ShmemIndexLock);
 
@@ -825,3 +836,46 @@ pg_numa_available(PG_FUNCTION_ARGS)
 {
 	PG_RETURN_BOOL(pg_numa_init() != -1);
 }
+
+/* SQL SRF showing shared memory segments */
+Datum
+pg_get_shmem_segments(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_SEGS_COLS 6
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Datum		values[PG_GET_SHMEM_SEGS_COLS];
+	bool		nulls[PG_GET_SHMEM_SEGS_COLS];
+	int i;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	/* output all allocated entries */
+	for (i = 0; i < ANON_MAPPINGS; i++)
+	{
+		PGShmemHeader *shmhdr = Segments[i].ShmemSegHdr;
+		AnonymousMapping *segmapping = &Mappings[i];
+		int j;
+
+		if (shmhdr == NULL)
+		{
+			for (j = 0; j < PG_GET_SHMEM_SEGS_COLS; j++)
+				nulls[j] = true;
+		}
+		else
+		{
+			memset(nulls, 0, sizeof(nulls));
+			values[0] = Int32GetDatum(i);
+			values[1] = CStringGetTextDatum(MappingName(i));
+			values[2] = Int64GetDatum(shmhdr->totalsize);
+			values[3] = Int64GetDatum(shmhdr->freeoffset);
+			values[4] = Int64GetDatum(segmapping->shmem_size);
+			values[5] = Int64GetDatum(segmapping->shmem_reserved);
+		}
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
+	}
+
+	return (Datum) 0;
+}
+
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 1e53b7a4ae5..6c37fa90c89 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8568,8 +8568,8 @@
 { oid => '5052', descr => 'allocations from the main shared memory segment',
   proname => 'pg_get_shmem_allocations', prorows => '50', proretset => 't',
   provolatile => 'v', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,int8,int8,int8}', proargmodes => '{o,o,o,o}',
-  proargnames => '{name,off,size,allocated_size}',
+  proallargtypes => '{text,text,int8,int8,int8}', proargmodes => '{o,o,o,o,o}',
+  proargnames => '{name,segment,off,size,allocated_size}',
   prosrc => 'pg_get_shmem_allocations' },
 
 { oid => '4099', descr => 'Is NUMA support available?',
@@ -8592,6 +8592,14 @@
   proargmodes => '{o,o,o}', proargnames => '{name,type,size}',
   prosrc => 'pg_get_dsm_registry_allocations' },
 
+# shared memory segments 
+{ oid => '5101', descr => 'shared memory segments',
+  proname => 'pg_get_shmem_segments', prorows => '6', proretset => 't',
+  provolatile => 'v', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int4,text,int8,int8,int8,int8}', proargmodes => '{o,o,o,o,o,o}',
+  proargnames => '{id,name,size,freeoffset,mapping_size,mapping_reserved_size}',
+  prosrc => 'pg_get_shmem_segments' },
+
 # buffer lookup table
 { oid => '5102',
   descr => 'shared buffer lookup table',
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index a1fa6b43fe3..715f6acb5dd 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -69,7 +69,6 @@ typedef struct ShmemSegment
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 extern PGDLLIMPORT AnonymousMapping Mappings[ANON_MAPPINGS];
 
-
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index 910c43f54f4..64ff5a286ba 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -71,6 +71,7 @@ typedef struct
 	void	   *location;		/* location in shared mem */
 	Size		size;			/* # bytes requested for the structure */
 	Size		allocated_size; /* # bytes actually allocated */
+	int			shmem_segment;	/* segment in which the structure is allocated */
 } ShmemIndexEnt;
 
 #endif							/* SHMEM_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 760bb13fe95..e73314b5ef0 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1764,14 +1764,22 @@ pg_shadow| SELECT pg_authid.rolname AS usename,
      LEFT JOIN pg_db_role_setting s ON (((pg_authid.oid = s.setrole) AND (s.setdatabase = (0)::oid))))
   WHERE pg_authid.rolcanlogin;
 pg_shmem_allocations| SELECT name,
+    segment,
     off,
     size,
     allocated_size
-   FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+   FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, segment, off, size, allocated_size);
 pg_shmem_allocations_numa| SELECT name,
     numa_node,
     size
    FROM pg_get_shmem_allocations_numa() pg_get_shmem_allocations_numa(name, numa_node, size);
+pg_shmem_segments| SELECT id,
+    name,
+    size,
+    freeoffset,
+    mapping_size,
+    mapping_reserved_size
+   FROM pg_get_shmem_segments() pg_get_shmem_segments(id, name, size, freeoffset, mapping_size, mapping_reserved_size);
 pg_stat_activity| SELECT s.datid,
     d.datname,
     s.pid,
-- 
2.34.1

0013-Update-sizes-and-addresses-of-shared-memory-20250918.patchapplication/x-patch; name=0013-Update-sizes-and-addresses-of-shared-memory-20250918.patchDownload

From 7cdcf605c4d67aa35f66e42c98a12c1b97c20b69 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Thu, 21 Aug 2025 15:44:24 +0530
Subject: [PATCH 13/16] Update sizes and addresses of shared memory mapping and
 shared memory structures

Update totalsize and end address in segment and mapping: Once a shared
memory segment has been resized, the total size and end address of the
same needs to be updated in the corresponding AnonymousMapping and
Segment structure.

Update allocated_size for resized shared memory structure: Reallocating
the shared memory structure after resizing needs a bit more work. But at
least update the allocated_size as well along with the size of shared
memory structure.

Author: Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c   | 4 ++++
 src/backend/storage/ipc/shmem.c | 6 +++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index ba8613678f6..54d335b2e5d 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -1021,6 +1021,8 @@ AnonymousShmemResize(void)
 	for(int i = 0; i < ANON_MAPPINGS; i++)
 	{
 		AnonymousMapping *m = &Mappings[i];
+		ShmemSegment *segment = &Segments[i];
+		PGShmemHeader *shmem_hdr = segment->ShmemSegHdr;
 
 #ifdef MAP_HUGETLB
 		if (huge_pages_on && (m->shmem_req_size % hugepagesize != 0))
@@ -1067,6 +1069,8 @@ AnonymousShmemResize(void)
 
 		reinit = true;
 		m->shmem_size = m->shmem_req_size;
+		shmem_hdr->totalsize = m->shmem_size;
+		segment->ShmemEnd = m->shmem + m->shmem_size;
 	}
 
 	if (reinit)
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 2a197540300..0f9abf69fd5 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -504,13 +504,17 @@ ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
 		 *
 		 * XXX: There is an implicit assumption this can only happen in
 		 * "resizable" segments, where only one shared structure is allowed.
-		 * This has to be implemented more cleanly.
+		 * This has to be implemented more cleanly. Probably we should implement
+		 * ShmemReallocRawInSegment functionality just to adjust the size
+		 * according to alignment, return the allocated size and update the
+		 * mapping offset.
 		 */
 		if (result->size != size)
 		{
 			Size delta = size - result->size;
 
 			result->size = size;
+			result->allocated_size = size;
 
 			/* Reflect size change in the shared segment */
 			SpinLockAcquire(Segments[shmem_segment].ShmemLock);
-- 
2.34.1

0011-Allow-to-resize-shared-memory-without-resta-20250918.patchapplication/x-patch; name=0011-Allow-to-resize-shared-memory-without-resta-20250918.patchDownload

From 78bc0a49f8ebe17927abd66164764745ecc6d563 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 14:16:55 +0200
Subject: [PATCH 11/16] Allow to resize shared memory without restart

Add assing hook for shared_buffers to resize shared memory using space,
introduced in the previous commits without requiring PostgreSQL restart.
Essentially the implementation is based on two mechanisms: a
ProcSignalBarrier is used to make sure all processes are starting the
resize procedure simultaneously, and a global Barrier is used to
coordinate after that and make sure all finished processes are waiting
for others that are in progress.

The resize process looks like this:

* The GUC assign hook sets a flag to let the Postmaster know that resize
  was requested.

* Postmaster verifies the flag in the event loop, and starts the resize
  by emitting a ProcSignal barrier.

* All processes, that participate in ProcSignal mechanism, begin to
  process ProcSignal barrier. First a process waits until all processes
  have confirmed they received the message and can start simultaneously.

* Every process recalculates shared memory size based on the new
  NBuffers, adjusts its size using ftruncate and adjust reservation
  permissions with mprotect. One elected process signals the postmaster
  to do the same.

* When finished, every process waits on a global ShmemControl barrier,
  untill all others are finished as well. This way we ensure three
  stages with clear boundaries: before the resize, when all processes
  use old NBuffers; during the resize, when processes have mix of old
  and new NBuffers, and wait until it's done; after the resize, when all
  processes use new NBuffers.

* After all processes are using new value, one of them will initialize
  new shared structures (buffer blocks, descriptors, etc) as needed and
  broadcast new value of NBuffers via ShmemControl in shared memory.
  Other backends are waiting for this operation to finish as well. Then
  the barrier is lifted and everything goes as usual.

Since resizing takes time, we need to take into account that during that time:

- New backends can be spawned. They will check status of the barrier
  early during the bootstrap, and wait until everything is over to work
  with the new NBuffers value.

- Old backends can exit before attempting to resize. Synchronization
  used between backends relies on ProcSignalBarrier and waits for all
  participants received the message at the beginning to gather all
  existing backends.

- Some backends might be blocked and not responsing either before or
  after receiving the message. In the first case such backend still
  have ProcSignalSlot and should be waited for, in the second case
  shared barrier will make sure we still waiting for those backends. In
  any case there is an unbounded wait.

- Backends might join barrier in disjoint groups with some time in
  between. That means that relying only on the shared dynamic barrier is
  not enough -- it will only synchronize resize procedure withing those
  groups. That's why we wait first for all participants of ProcSignal
  mechanism who received the message.

Here is how it looks like after raising shared_buffers from 128 MB to
512 MB and calling pg_reload_conf():

    -- 128 MB
    7f87909fc000-7f8798248000 rw-s /memfd:strategy (deleted)
    7f8798248000-7f879d6ca000 ---s /memfd:strategy (deleted)
    7f879d6ca000-7f87a4e84000 rw-s /memfd:checkpoint (deleted)
    7f87a4e84000-7f87aa398000 ---s /memfd:checkpoint (deleted)
    7f87aa398000-7f87b1b42000 rw-s /memfd:iocv (deleted)
    7f87b1b42000-7f87c3d32000 ---s /memfd:iocv (deleted)
    7f87c3d32000-7f87cb59c000 rw-s /memfd:descriptors (deleted)
    7f87cb59c000-7f87dd6cc000 ---s /memfd:descriptors (deleted)
    7f87dd6cc000-7f87ece38000 rw-s /memfd:buffers (deleted)
    ^ buffers content, ~247 MB
    7f87ece38000-7f8877066000 ---s /memfd:buffers (deleted)
    ^ reserved space, ~2210 MB
    7f8877066000-7f887e7d0000 rw-s /memfd:main (deleted)
    7f887e7d0000-7f8890a00000 ---s /memfd:main (deleted)

    -- 512 MB
    7f87909fc000-7f879866a000 rw-s /memfd:strategy (deleted)
    7f879866a000-7f879d6ca000 ---s /memfd:strategy (deleted)
    7f879d6ca000-7f87a50f4000 rw-s /memfd:checkpoint (deleted)
    7f87a50f4000-7f87aa398000 ---s /memfd:checkpoint (deleted)
    7f87aa398000-7f87b1d82000 rw-s /memfd:iocv (deleted)
    7f87b1d82000-7f87c3d32000 ---s /memfd:iocv (deleted)
    7f87c3d32000-7f87cba1c000 rw-s /memfd:descriptors (deleted)
    7f87cba1c000-7f87dd6cc000 ---s /memfd:descriptors (deleted)
    7f87dd6cc000-7f8804fb8000 rw-s /memfd:buffers (deleted)
    ^ buffers content, ~632 MB
    7f8804fb8000-7f8877066000 ---s /memfd:buffers (deleted)
    ^ reserved space, ~1824 MB
    7f8877066000-7f887e950000 rw-s /memfd:main (deleted)
    7f887e950000-7f8890a00000 ---s /memfd:main (deleted)

The implementation supports only increasing of shared_buffers. For
decreasing the value a similar procedure is needed. But the buffer
blocks with data have to be drained first, so that the actual data set
fits into the new smaller space.

From experiment it turns out that shared mappings have to be extended
separately for each process that uses them. Another rough edge is that a
backend blocked on ReadCommand will not apply shared_buffers change
until it receives something.

Authors: Dmitrii Dolgov, Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c                 | 443 ++++++++++++++++++
 src/backend/postmaster/checkpointer.c         |  12 +-
 src/backend/postmaster/postmaster.c           |  18 +
 src/backend/storage/buffer/buf_init.c         |  60 ++-
 src/backend/storage/ipc/ipci.c                |  15 +-
 src/backend/storage/ipc/procsignal.c          |  46 ++
 src/backend/storage/ipc/shmem.c               |  23 +-
 src/backend/tcop/postgres.c                   |  10 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/backend/utils/misc/guc_parameters.dat     |   3 +-
 src/include/storage/bufmgr.h                  |   2 +-
 src/include/storage/ipc.h                     |   3 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/storage/pg_shmem.h                |  26 +
 src/include/storage/pmsignal.h                |   1 +
 src/include/storage/procsignal.h              |   1 +
 src/tools/pgindent/typedefs.list              |   1 +
 17 files changed, 631 insertions(+), 37 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index b85911bdfc4..dc4eeeee56a 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -30,13 +30,19 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "portability/mem.h"
+#include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
+#include "storage/pmsignal.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
 #include "utils/pidfile.h"
+#include "utils/wait_event.h"
 
 
 /*
@@ -96,6 +102,13 @@ void	   *UsedShmemSegAddr = NULL;
 
 AnonymousMapping Mappings[ANON_MAPPINGS];
 
+/* Flag telling postmaster that resize is needed */
+volatile bool pending_pm_shmem_resize = false;
+
+/* Keeps track of the previous NBuffers value */
+static int NBuffersOld = -1;
+static int NBuffersPending = -1;
+
 /*
  * Anonymous mapping layout we use looks like this:
  *
@@ -147,6 +160,49 @@ static double SHMEM_RESIZE_RATIO[6] = {
  */
 static bool huge_pages_on = false;
 
+/*
+ * Flag telling that we have prepared the memory layout to be resizable. If
+ * false after all shared memory segments creation, it means we failed to setup
+ * needed layout and falled back to the regular non-resizable approach.
+ */
+static bool shmem_resizable = false;
+
+/*
+ * Currently broadcasted value of NBuffers in shared memory.
+ *
+ * Most of the time this value is going to be equal to NBuffers. But if
+ * postmaster is resizing shared memory and a new backend was created
+ * at the same time, there is a possibility for the new backend to inherit the
+ * old NBuffers value, but miss the resize signal if ProcSignal infrastructure
+ * was not initialized yet. Consider this situation:
+ *
+ *     Postmaster ------> New Backend
+ *         |                   |
+ *         |                Launch
+ *         |                   |
+ *         |             Inherit NBuffers
+ *         |                   |
+ *     Resize NBuffers         |
+ *         |                   |
+ *     Emit Barrier            |
+ *         |            Init ProcSignal
+ *         |                   |
+ *     Finish resize           |
+ *         |                   |
+ *     New NBuffers       Old NBuffers
+ *
+ * In this case the backend is not yet ready to receive a signal from
+ * EmitProcSignalBarrier, and will be ignored. The same happens if ProcSignal
+ * is initialized even later, after the resizing was finished.
+ *
+ * To address resulting inconsistency, postmaster broadcasts the current
+ * NBuffers value via shared memory. Every new backend has to verify this value
+ * before it will access the buffer pool: if it differs from its own value,
+ * this indicates a shared memory resize has happened and the backend has to
+ * first synchronize with rest of the pack.
+ */
+ShmemControl *ShmemCtrl = NULL;
+
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
@@ -906,6 +962,346 @@ AnonymousShmemDetach(int status, Datum arg)
 	}
 }
 
+/*
+ * Resize all shared memory segments based on the current NBuffers value, which
+ * is is applied from NBuffersPending. The actual segment resizing is done via
+ * ftruncate, which will fail if is not sufficient space to expand the anon
+ * file. When finished, based on the new and old values initialize new buffer
+ * blocks if any.
+ *
+ * If reinitializing took place, as the last step this function does buffers
+ * reinitialization as well and broadcasts the new value of NSharedBuffers. All
+ * of that needs to be done only by one backend, the first one that managed to
+ * grab the ShmemResizeLock.
+ */
+bool
+AnonymousShmemResize(void)
+{
+	int		numSemas;
+	bool 	reinit = false;
+	int		mmap_flags = PG_MMAP_FLAGS;
+	Size 	hugepagesize;
+
+	NBuffers = NBuffersPending;
+
+	elog(DEBUG1, "Resize shmem from %d to %d", NBuffersOld, NBuffers);
+
+	/*
+	 * XXX: Where to reset the flag is still an open question. E.g. do we
+	 * consider a no-op when NBuffers is equal to NBuffersOld a genuine resize
+	 * and reset the flag?
+	 */
+	pending_pm_shmem_resize = false;
+
+	/*
+	 * XXX: Currently only increasing of shared_buffers is supported. For
+	 * decreasing something similar has to be done, but buffer blocks with
+	 * data have to be drained first.
+	 */
+	if(NBuffersOld > NBuffers)
+		return false;
+
+#ifndef MAP_HUGETLB
+	/* PrepareHugePages should have dealt with this case */
+	Assert(huge_pages != HUGE_PAGES_ON && !huge_pages_on);
+#else
+	if (huge_pages_on)
+	{
+		/* Make sure nothing is messed up */
+		Assert(huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY);
+
+		/* Round up the new size to a suitable large value */
+		GetHugePageSize(&hugepagesize, &mmap_flags, NULL);
+	}
+#endif
+
+	/* Note that CalculateShmemSize indirectly depends on NBuffers */
+	CalculateShmemSize(&numSemas);
+
+	for(int i = 0; i < ANON_MAPPINGS; i++)
+	{
+		AnonymousMapping *m = &Mappings[i];
+
+#ifdef MAP_HUGETLB
+		if (huge_pages_on && (m->shmem_req_size % hugepagesize != 0))
+			m->shmem_req_size += hugepagesize - (m->shmem_req_size % hugepagesize);
+#endif
+
+		if (m->shmem == NULL)
+			continue;
+
+		if (m->shmem_size == m->shmem_req_size)
+			continue;
+
+		if (m->shmem_reserved < m->shmem_req_size)
+			ereport(ERROR,
+					(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+					 errmsg("not enough shared memory is reserved"),
+					 errhint("You may need to increase \"max_available_memory\".")));
+
+		elog(DEBUG1, "segment[%s]: resize from %zu to %zu at address %p",
+					 MappingName(m->shmem_segment), m->shmem_size,
+					 m->shmem_req_size, m->shmem);
+
+		/* Resize the backing anon file. */
+		if(ftruncate(m->segment_fd, m->shmem_req_size) == -1)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not truncase anonymous file for \"%s\": %m",
+							MappingName(m->shmem_segment))));
+
+		/* Adjust memory accessibility */
+		if(mprotect(m->shmem, m->shmem_req_size, PROT_READ | PROT_WRITE) == -1)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not mprotect anonymous shared memory for \"%s\": %m",
+							MappingName(m->shmem_segment))));
+
+		/* If shrinking, make reserved space unavailable again */
+		if(m->shmem_req_size < m->shmem_size &&
+		   mprotect(m->shmem + m->shmem_req_size, m->shmem_size - m->shmem_req_size, PROT_NONE) == -1)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not mprotect reserved shared memory for \"%s\": %m",
+							MappingName(m->shmem_segment))));
+
+		reinit = true;
+		m->shmem_size = m->shmem_req_size;
+	}
+
+	if (reinit)
+	{
+		if(IsUnderPostmaster &&
+			LWLockConditionalAcquire(ShmemResizeLock, LW_EXCLUSIVE))
+		{
+			/*
+			 * If the new NBuffers was already broadcasted, the buffer pool was
+			 * already initialized before.
+			 *
+			 * Since we're not on a hot path, we use lwlocks and do not need to
+			 * involve memory barrier.
+			 */
+			if(pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers)
+			{
+				/*
+				 * Allow the first backend that managed to get the lock to
+				 * reinitialize the new portion of buffer pool. Every other
+				 * process will wait on the shared barrier for that to finish,
+				 * since it's a part of the SHMEM_RESIZE_DONE phase.
+				 *
+				 * Note that it's enough when only one backend will do that,
+				 * even the ShmemInitStruct part. The reason is that resized
+				 * shared memory will maintain the same addresses, meaning that
+				 * all the pointers are still valid, and we only need to update
+				 * structures size in the ShmemIndex once -- any other backend
+				 * will pick up this shared structure from the index.
+				 *
+				 * XXX: This is the right place for buffer eviction as well.
+				 */
+				BufferManagerShmemInit(NBuffersOld);
+
+				/* If all fine, broadcast the new value */
+				pg_atomic_write_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
+			}
+
+			LWLockRelease(ShmemResizeLock);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * We are asked to resize shared memory. Wait for all ProcSignal participants
+ * to join the barrier, then do the resize and wait on the barrier until all
+ * participating finish resizing as well -- otherwise we face danger of
+ * inconsistency between backends.
+ *
+ * XXX: If a backend is blocked on ReadCommand in PostgresMain, it will not
+ * proceed with AnonymousShmemResize after receiving SIGHUP, until something
+ * will be sent.
+ */
+bool
+ProcessBarrierShmemResize(Barrier *barrier)
+{
+	Assert(IsUnderPostmaster);
+
+	elog(DEBUG1, "Handle a barrier for shmem resizing from %d to %d, %d",
+		 NBuffersOld, NBuffersPending, pending_pm_shmem_resize);
+
+	/* Wait until we have seen the new NBuffers value */
+	if (!pending_pm_shmem_resize)
+		return false;
+
+	/*
+	 * First thing to do after attaching to the barrier is to wait for others.
+	 * We can't simply use BarrierArriveAndWait, because backends might arrive
+	 * here in disjoint groups, e.g. first two backends, pause, then second two
+	 * backends. If the resize is quick enough that can lead to a situation
+	 * when the first group is already finished before the second has appeared,
+	 * and the barrier will only synchonize withing those groups.
+	 */
+	if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED)
+		WaitForProcSignalBarrierReceived(
+				pg_atomic_read_u64(&ShmemCtrl->Generation));
+
+	/*
+	 * Now start the procedure, and elect one backend to ping postmaster to do
+	 * the same.
+	 *
+	 * XXX: If we need to be able to abort resizing, this has to be done later,
+	 * after the SHMEM_RESIZE_DONE.
+	 */
+	if (BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START))
+	{
+		Assert(IsUnderPostmaster);
+		SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
+	}
+
+	AnonymousShmemResize();
+
+	/* The second phase means the resize has finished, SHMEM_RESIZE_DONE */
+	BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_DONE);
+
+	BarrierDetach(barrier);
+	return true;
+}
+
+/*
+ * GUC assign hook for shared_buffers. It's recommended for an assign hook to
+ * be as minimal as possible, thus we just request shared memory resize and
+ * remember the previous value.
+ */
+void
+assign_shared_buffers(int newval, void *extra, bool *pending)
+{
+	elog(DEBUG1, "Received SIGHUP for shmem resizing");
+
+	pending_pm_shmem_resize = true;
+	*pending = true;
+	NBuffersPending = newval;
+
+	NBuffersOld = NBuffers;
+}
+
+/*
+ * Test if we have somehow missed a shmem resize signal and NBuffers value
+ * differs from NSharedBuffers. If yes, catchup and do resize.
+ */
+void
+AdjustShmemSize(void)
+{
+	uint32 NSharedBuffers = pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers);
+
+	if (NSharedBuffers != NBuffers)
+	{
+		/*
+		 * If the broadcasted shared_buffers is different from the one we see,
+		 * it could be that the backend has missed a resize signal. To avoid
+		 * any inconsistency, adjust the shared mappings, before having a
+		 * chance to access the buffer pool.
+		 */
+		ereport(LOG,
+				(errmsg("shared_buffers has been changed from %d to %d, "
+						"resize shared memory",
+						NBuffers, NSharedBuffers)));
+		NBuffers = NSharedBuffers;
+		AnonymousShmemResize();
+	}
+}
+
+/*
+ * Start resizing procedure, making sure all existing processes will have
+ * consistent view of shared memory size. Must be called only in postmaster.
+ */
+void
+CoordinateShmemResize(void)
+{
+	elog(DEBUG1, "Coordinating shmem resize from %d to %d",
+		 NBuffersOld, NBuffers);
+	Assert(!IsUnderPostmaster);
+
+	/*
+	 * We use dynamic barrier to help dealing with backends that were spawned
+	 * during the resize.
+	 */
+	BarrierInit(&ShmemCtrl->Barrier, 0);
+
+	/*
+	 * If the value did not change, or shared memory segments are not
+	 * initialized yet, skip the resize.
+	 */
+	if (NBuffersPending == NBuffersOld)
+	{
+		elog(DEBUG1, "Skip resizing, new %d, old %d",
+			 NBuffers, NBuffersOld);
+		return;
+	}
+
+	/*
+	 * Shared memory resize requires some coordination done by postmaster,
+	 * and consists of three phases:
+	 *
+	 * - Before the resize all existing backends have the same old NBuffers.
+	 * - When resize is in progress, backends are expected to have a
+	 *   mixture of old a new values. They're not allowed to touch buffer
+	 *   pool during this time frame.
+	 * - After resize has been finished, all existing backends, that can access
+	 *   the buffer pool, are expected to have the same new value of NBuffers.
+	 *
+	 * Those phases are ensured by joining the shared barrier associated with
+	 * the procedure. Since resizing takes time, we need to take into account
+	 * that during that time:
+	 *
+	 * - New backends can be spawned. They will check status of the barrier
+	 *   early during the bootstrap, and wait until everything is over to work
+	 *   with the new NBuffers value.
+	 *
+	 * - Old backends can exit before attempting to resize. Synchronization
+	 *   used between backends relies on ProcSignalBarrier and waits for all
+	 *   participants received the message at the beginning to gather all
+	 *   existing backends.
+	 *
+	 * - Some backends might be blocked and not responsing either before or
+	 *   after receiving the message. In the first case such backend still
+	 *   have ProcSignalSlot and should be waited for, in the second case
+	 *   shared barrier will make sure we still waiting for those backends. In
+	 *   any case there is an unbounded wait.
+	 *
+	 * - Backends might join barrier in disjoint groups with some time in
+	 *   between. That means that relying only on the shared dynamic barrier is
+	 *   not enough -- it will only synchronize resize procedure withing those
+	 *   groups. That's why we wait first for all participants of ProcSignal
+	 *   mechanism who received the message.
+	 */
+	elog(DEBUG1, "Emit a barrier for shmem resizing");
+	pg_atomic_init_u64(&ShmemCtrl->Generation,
+					   EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SHMEM_RESIZE));
+
+	/* To order everything after setting Generation value */
+	pg_memory_barrier();
+
+	/*
+	 * After that postmaster waits for PMSIGNAL_SHMEM_RESIZE as a sign that all
+	 * the rest of the pack has started the procedure and it can resize shared
+	 * memory as well.
+	 *
+	 * Normally we would call WaitForProcSignalBarrier here to wait until every
+	 * backend has reported on the ProcSignalBarrier. But for shared memory
+	 * resize we don't need this, as every participating backend will
+	 * synchronize on the ProcSignal barrier. In fact even if we would like to
+	 * wait here, it wouldn't be possible -- we're in the postmaster, without
+	 * any waiting infrastructure available.
+	 *
+	 * If at some point it will turn out that waiting is essential, we would
+	 * need to consider some alternatives. E.g. it could be a designated
+	 * coordination process, which is not a postmaster. Another option would be
+	 * to introduce a CoordinateShmemResize lock and allow only one process to
+	 * take it (this probably would have to be something different than
+	 * LWLocks, since they block interrupts, and coordination relies on them).
+	 */
+}
+
 /*
  * PGSharedMemoryCreate
  *
@@ -1217,3 +1613,50 @@ PGSharedMemoryDetach(void)
 		}
 	}
 }
+
+void
+WaitOnShmemBarrier()
+{
+	Barrier *barrier = &ShmemCtrl->Barrier;
+
+	/* Nothing to do if resizing is not started */
+	if (BarrierPhase(barrier) < SHMEM_RESIZE_START)
+		return;
+
+	BarrierAttach(barrier);
+
+	/* Otherwise wait through all available phases */
+	while (BarrierPhase(barrier) < SHMEM_RESIZE_DONE)
+	{
+		ereport(LOG, (errmsg("ProcSignal barrier is in phase %d, waiting",
+							 BarrierPhase(barrier))));
+
+		BarrierArriveAndWait(barrier, 0);
+	}
+
+	BarrierDetach(barrier);
+}
+
+void
+ShmemControlInit(void)
+{
+	bool foundShmemCtrl;
+
+	ShmemCtrl = (ShmemControl *)
+	ShmemInitStruct("Shmem Control", sizeof(ShmemControl),
+									 &foundShmemCtrl);
+
+	if (!foundShmemCtrl)
+	{
+		/*
+		 * The barrier is missing here, it will be initialized right before
+		 * starting the resizing process as a convenient way to reset it.
+		 */
+
+		/* Initialize with the currently known value */
+		pg_atomic_init_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
+
+		/* shmem_resizable should be initialized by now */
+		ShmemCtrl->Resizable = shmem_resizable;
+	}
+}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e84e8663e96..ef3f84a55f5 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -654,9 +654,12 @@ CheckpointerMain(const void *startup_data, size_t startup_data_len)
 static void
 ProcessCheckpointerInterrupts(void)
 {
-	if (ProcSignalBarrierPending)
-		ProcessProcSignalBarrier();
-
+	/*
+	 * Reloading config can trigger further signals, complicating interrupts
+	 * processing -- so let it run first.
+	 *
+	 * XXX: Is there any need in memory barrier after ProcessConfigFile?
+	 */
 	if (ConfigReloadPending)
 	{
 		ConfigReloadPending = false;
@@ -676,6 +679,9 @@ ProcessCheckpointerInterrupts(void)
 		UpdateSharedMemoryConfig();
 	}
 
+	if (ProcSignalBarrierPending)
+		ProcessProcSignalBarrier();
+
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
 		ProcessLogMemoryContextInterrupt();
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b59d20b4ac2..ba9528d5dfa 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -426,6 +426,7 @@ static void process_pm_pmsignal(void);
 static void process_pm_child_exit(void);
 static void process_pm_reload_request(void);
 static void process_pm_shutdown_request(void);
+static void process_pm_shmem_resize(void);
 static void dummy_handler(SIGNAL_ARGS);
 static void CleanupBackend(PMChild *bp, int exitstatus);
 static void HandleChildCrash(int pid, int exitstatus, const char *procname);
@@ -1697,6 +1698,9 @@ ServerLoop(void)
 			if (pending_pm_pmsignal)
 				process_pm_pmsignal();
 
+			if (pending_pm_shmem_resize)
+				process_pm_shmem_resize();
+
 			if (events[i].events & WL_SOCKET_ACCEPT)
 			{
 				ClientSocket s;
@@ -2042,6 +2046,17 @@ process_pm_reload_request(void)
 	}
 }
 
+static void
+process_pm_shmem_resize(void)
+{
+	/*
+	 * Failure to resize is considered to be fatal and will not be
+	 * retried, which means we can disable pending flag right here.
+	 */
+	pending_pm_shmem_resize = false;
+	CoordinateShmemResize();
+}
+
 /*
  * pg_ctl uses SIGTERM, SIGINT and SIGQUIT to request different types of
  * shutdown.
@@ -3862,6 +3877,9 @@ process_pm_pmsignal(void)
 		request_state_update = true;
 	}
 
+	if (CheckPostmasterSignal(PMSIGNAL_SHMEM_RESIZE))
+		AnonymousShmemResize();
+
 	/*
 	 * Try to advance postmaster's state machine, if a child requests it.
 	 */
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6f148d1d80b..0e72e373193 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -18,6 +18,7 @@
 #include "storage/buf_internals.h"
 #include "storage/pg_shmem.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -63,18 +64,28 @@ CkptSortItem *CkptBufferIds;
  * Initialize shared buffer pool
  *
  * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend). Size of data structures initialized
- * here depends on NBuffers, and to be able to change NBuffers without a
- * restart we store each structure into a separate shared memory segment, which
- * could be resized on demand.
+ * postmaster, or in a standalone backend) or during shared-memory resize. Size
+ * of data structures initialized here depends on NBuffers, and to be able to
+ * change NBuffers without a restart we store each structure into a separate
+ * shared memory segment, which could be resized on demand.
+ *
+ * FirstBufferToInit tells where to start initializing buffers. For
+ * initialization it always will be zero, but when resizing shared-memory it
+ * indicates the number of already initialized buffers.
+ *
+ * No locks are taking in this function, it is the caller responsibility to
+ * make sure only one backend can work with new buffers.
  */
 void
-BufferManagerShmemInit(void)
+BufferManagerShmemInit(int FirstBufferToInit)
 {
 	bool		foundBufs,
 				foundDescs,
 				foundIOCV,
 				foundBufCkpt;
+	int			i;
+	elog(DEBUG1, "BufferManagerShmemInit from %d to %d",
+				 FirstBufferToInit, NBuffers);
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
@@ -111,34 +122,35 @@ BufferManagerShmemInit(void)
 	{
 		/* should find all of these, or none of them */
 		Assert(foundDescs && foundBufs && foundIOCV && foundBufCkpt);
-		/* note: this path is only taken in EXEC_BACKEND case */
-	}
-	else
-	{
-		int			i;
-
 		/*
-		 * Initialize all the buffer headers.
+		 * note: this path is only taken in EXEC_BACKEND case when initializing
+		 * shared memory, or in all cases when resizing shared memory.
 		 */
-		for (i = 0; i < NBuffers; i++)
-		{
-			BufferDesc *buf = GetBufferDescriptor(i);
+	}
+
+#ifndef EXEC_BACKEND
+	/*
+	 * Initialize all the buffer headers.
+	 */
+	for (i = FirstBufferToInit; i < NBuffers; i++)
+	{
+		BufferDesc *buf = GetBufferDescriptor(i);
 
-			ClearBufferTag(&buf->tag);
+		ClearBufferTag(&buf->tag);
 
-			pg_atomic_init_u32(&buf->state, 0);
-			buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
+		pg_atomic_init_u32(&buf->state, 0);
+		buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
 
-			buf->buf_id = i;
+		buf->buf_id = i;
 
-			pgaio_wref_clear(&buf->io_wref);
+		pgaio_wref_clear(&buf->io_wref);
 
-			LWLockInitialize(BufferDescriptorGetContentLock(buf),
-							 LWTRANCHE_BUFFER_CONTENT);
+		LWLockInitialize(BufferDescriptorGetContentLock(buf),
+						 LWTRANCHE_BUFFER_CONTENT);
 
-			ConditionVariableInit(BufferDescriptorGetIOCV(buf));
-		}
+		ConditionVariableInit(BufferDescriptorGetIOCV(buf));
 	}
+#endif
 
 	/* Init other shared buffer-management stuff */
 	StrategyInitialize(!foundDescs);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2cd278449f0..bd75f06047e 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -171,6 +171,14 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
 
+	/*
+	 * XXX: For some reason slightly more memory is needed for larger
+	 * shared_buffers, but this size is enough for any large value I've tested
+	 * with. Is it a mistake in how slots are split, or there was a hidden
+	 * inconsistency in shmem calculation?
+	 */
+	size = add_size(size, 1024 * 1024 * 100);
+
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
 
@@ -333,7 +341,7 @@ CreateOrAttachShmemStructs(void)
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
-	BufferManagerShmemInit();
+	BufferManagerShmemInit(0);
 
 	/*
 	 * Set up lock manager
@@ -345,6 +353,11 @@ CreateOrAttachShmemStructs(void)
 	 */
 	PredicateLockShmemInit();
 
+	/*
+	 * Set up shared memory resize manager
+	 */
+	ShmemControlInit();
+
 	/*
 	 * Set up process table
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index eb3ceaae809..2160d258fa7 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -27,6 +27,7 @@
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
+#include "storage/pg_shmem.h"
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "storage/smgr.h"
@@ -113,6 +114,10 @@ static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
 
+#ifdef DEBUG_SHMEM_RESIZE
+bool delay_proc_signal_init = false;
+#endif
+
 /*
  * ProcSignalShmemSize
  *		Compute space needed for ProcSignal's shared memory
@@ -176,6 +181,43 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
 	uint32		old_pss_pid;
 
 	Assert(cancel_key_len >= 0 && cancel_key_len <= MAX_CANCEL_KEY_LENGTH);
+
+#ifdef DEBUG_SHMEM_RESIZE
+	/*
+	 * Introduced for debugging purposes. You can change the variable at
+	 * runtime using gdb, then start new backends with delayed ProcSignal
+	 * initialization. Simple pg_usleep wont work here due to SIGHUP interrupt
+	 * needed for testing. Taken from pg_sleep;
+	 */
+	if (delay_proc_signal_init)
+	{
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+		float8		endtime = GetNowFloat() + 5;
+
+		for (;;)
+		{
+			float8		delay;
+			long		delay_ms;
+
+			CHECK_FOR_INTERRUPTS();
+
+			delay = endtime - GetNowFloat();
+			if (delay >= 600.0)
+				delay_ms = 600000;
+			else if (delay > 0.0)
+				delay_ms = (long) (delay * 1000.0);
+			else
+				break;
+
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 delay_ms,
+							 WAIT_EVENT_PG_SLEEP);
+			ResetLatch(MyLatch);
+		}
+	}
+#endif
+
 	if (MyProcNumber < 0)
 		elog(ERROR, "MyProcNumber not set");
 	if (MyProcNumber >= NumProcSignalSlots)
@@ -615,6 +657,10 @@ ProcessProcSignalBarrier(void)
 					case PROCSIGNAL_BARRIER_SMGRRELEASE:
 						processed = ProcessBarrierSmgrRelease();
 						break;
+					case PROCSIGNAL_BARRIER_SHMEM_RESIZE:
+						processed = ProcessBarrierShmemResize(
+								&ShmemCtrl->Barrier);
+						break;
 				}
 
 				/*
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 9499f332e77..2a197540300 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -498,17 +498,26 @@ ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
 	{
 		/*
 		 * Structure is in the shmem index so someone else has allocated it
-		 * already.  The size better be the same as the size we are trying to
-		 * initialize to, or there is a name conflict (or worse).
+		 * already. Verify the structure's size:
+		 * - If it's the same, we've found the expected structure.
+		 * - If it's different, we're resizing the expected structure.
+		 *
+		 * XXX: There is an implicit assumption this can only happen in
+		 * "resizable" segments, where only one shared structure is allowed.
+		 * This has to be implemented more cleanly.
 		 */
 		if (result->size != size)
 		{
-			LWLockRelease(ShmemIndexLock);
-			ereport(ERROR,
-					(errmsg("ShmemIndex entry size is wrong for data structure"
-							" \"%s\": expected %zu, actual %zu",
-							name, size, result->size)));
+			Size delta = size - result->size;
+
+			result->size = size;
+
+			/* Reflect size change in the shared segment */
+			SpinLockAcquire(Segments[shmem_segment].ShmemLock);
+			Segments[shmem_segment].ShmemSegHdr->freeoffset += delta;
+			SpinLockRelease(Segments[shmem_segment].ShmemLock);
 		}
+
 		structPtr = result->location;
 	}
 	else
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index c819608fff6..15e9dde41d1 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -62,6 +62,7 @@
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -4317,6 +4318,15 @@ PostgresMain(const char *dbname, const char *username)
 	 */
 	BeginReportingGUCOptions();
 
+	/* Verify the shared barrier, if it's still active: join and wait. */
+	WaitOnShmemBarrier();
+
+	/*
+	 * After waiting on the barrier above we guaranteed to have NSharedBuffers
+	 * broadcasted, so we can use it in the function below.
+	 */
+	AdjustShmemSize();
+
 	/*
 	 * Also set up handler to log session end; we have to wait till now to be
 	 * sure Log_disconnections has its final value.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..82cee6b8877 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -155,6 +155,8 @@ REPLICATION_ORIGIN_DROP	"Waiting for a replication origin to become inactive so
 REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it can be dropped."
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
+SHMEM_RESIZE_START	"Waiting for other backends to start resizing shared memory."
+SHMEM_RESIZE_DONE	"Waiting for other backends to finish resizing shared memory."
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
 WAL_RECEIVER_EXIT	"Waiting for the WAL receiver to exit."
 WAL_RECEIVER_WAIT_START	"Waiting for startup process to send initial data for streaming replication."
@@ -355,6 +357,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+ShmemResize	"Waiting to resize shared memory."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index c94f3fc3c80..5c534cee2ac 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1098,13 +1098,14 @@
 
 # We sometimes multiply the number of shared buffers by two without
 # checking for overflow, so we mustn't allow more than INT_MAX / 2.
-{ name => 'shared_buffers', type => 'int', context => 'PGC_POSTMASTER', group => 'RESOURCES_MEM',
+{ name => 'shared_buffers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_MEM',
   short_desc => 'Sets the number of shared memory buffers used by the server.',
   flags => 'GUC_UNIT_BLOCKS',
   variable => 'NBuffers',
   boot_val => '16384',
   min => '16',
   max => 'INT_MAX / 2',
+  assign_hook => 'assign_shared_buffers'
 },
 
 # TODO: should this be PGC_POSTMASTER?
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 47360a3d3d8..51ce6ebcf6c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -317,7 +317,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 									int32 *buffers_skipped);
 
 /* in buf_init.c */
-extern void BufferManagerShmemInit(void);
+extern void BufferManagerShmemInit(int);
 extern Size BufferManagerShmemSize(void);
 
 /* in localbuf.c */
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 3baf418b3d1..847f56a36dc 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -64,6 +64,7 @@ typedef void (*shmem_startup_hook_type) (void);
 /* ipc.c */
 extern PGDLLIMPORT bool proc_exit_inprogress;
 extern PGDLLIMPORT bool shmem_exit_inprogress;
+extern PGDLLIMPORT volatile bool pending_pm_shmem_resize;
 
 pg_noreturn extern void proc_exit(int code);
 extern void shmem_exit(int code);
@@ -83,5 +84,7 @@ extern void CreateSharedMemoryAndSemaphores(void);
 extern void AttachSharedMemoryStructs(void);
 #endif
 extern void InitializeShmemGUCs(void);
+extern void CoordinateShmemResize(void);
+extern bool AnonymousShmemResize(void);
 
 #endif							/* IPC_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..cba586027a7 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, ShmemResize)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 715f6acb5dd..eba28ce8a5c 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -24,6 +24,7 @@
 #ifndef PG_SHMEM_H
 #define PG_SHMEM_H
 
+#include "storage/barrier.h"
 #include "storage/dsm_impl.h"
 #include "storage/spin.h"
 
@@ -69,6 +70,25 @@ typedef struct ShmemSegment
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 extern PGDLLIMPORT AnonymousMapping Mappings[ANON_MAPPINGS];
 
+/*
+ * ShmemControl is shared between backends and helps to coordinate shared
+ * memory resize.
+ */
+typedef struct
+{
+	pg_atomic_uint32 	NSharedBuffers;
+	Barrier 			Barrier;
+	pg_atomic_uint64 	Generation;
+	bool                Resizable;
+} ShmemControl;
+
+extern PGDLLIMPORT ShmemControl *ShmemCtrl;
+
+/* The phases for shared memory resizing, used by for ProcSignal barrier. */
+#define SHMEM_RESIZE_REQUESTED			0
+#define SHMEM_RESIZE_START				1
+#define SHMEM_RESIZE_DONE				2
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -123,6 +143,12 @@ extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
 							int *memfd_flags);
 void PrepareHugePages(void);
 
+bool ProcessBarrierShmemResize(Barrier *barrier);
+void assign_shared_buffers(int newval, void *extra, bool *pending);
+void AdjustShmemSize(void);
+extern void WaitOnShmemBarrier(void);
+extern void ShmemControlInit(void);
+
 /*
  * To be able to dynamically resize largest parts of the data stored in shared
  * memory, we split it into multiple shared memory mappings segments. Each
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 428aa3fd68a..1a55bf57a70 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
 	PMSIGNAL_XLOG_IS_SHUTDOWN,	/* ShutdownXLOG() completed */
+	PMSIGNAL_SHMEM_RESIZE,	/* resize shared memory */
 } PMSignalReason;
 
 #define NUM_PMSIGNALS (PMSIGNAL_XLOG_IS_SHUTDOWN+1)
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 2733bbb8c5b..97033f84dce 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -54,6 +54,7 @@ typedef enum
 typedef enum
 {
 	PROCSIGNAL_BARRIER_SMGRRELEASE, /* ask smgr to close files */
+	PROCSIGNAL_BARRIER_SHMEM_RESIZE, /* ask backends to resize shared memory */
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e90af5b2ad3..ee5c2dd0ad4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2762,6 +2762,7 @@ ShellTypeInfo
 ShippableCacheEntry
 ShippableCacheKey
 ShmemIndexEnt
+ShmemControl
 ShutdownForeignScan_function
 ShutdownInformation
 ShutdownMode
-- 
2.34.1

0012-Initial-value-of-shared_buffers-or-NBuffers-20250918.patchapplication/x-patch; name=0012-Initial-value-of-shared_buffers-or-NBuffers-20250918.patchDownload

From cd2fd975382c629a5fa825a32a1b35e485564001 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Mon, 1 Sep 2025 15:40:41 +0530
Subject: [PATCH 12/16] Initial value of shared_buffers (or NBuffers)

The assign_hook for shared_buffers (assign_shared_buffers()) is called twice
during server startup. First time it sets the default value of shared_buffers,
followed by a second time when it sets the value specified in the configuration
file or on the command line.  At those times the shared buffer pool is yet to be
initialized. Hence there is no need to keep the GUC change pending or going
through the entire process of resizing memory maps, reinitializing the shared memory
and process synchronization. Instead the given value should be assigned directly to
NBuffers, which will be used when creating the shared
memory and also when initializing the buffer pool the first time.  Any changes
to shared_buffer after that will need remapping the shared memory segment and
synchronize buffer pool reinitialization across the backends.

If BufferBlocks is not initilized assign_shared_buffers() sets the given
value to NBuffers directly. Otherwise it marks the change as pending and
sets the flag pending_pm_shmem_resize so that Postmaster can start the
buffer pool reinitialization.

TODO:

1. The change depends upon the C convention that the global pointer variables
being initialized to NULL. May be initialize BufferBlocks to NULL explicitly.

2. We might think of a better way to check whether buffer pool has been
initialized or not. See comment in assign_shared_buffers().

Author: Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c | 42 ++++++++++++++++++++++++++---------
 1 file changed, 32 insertions(+), 10 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index dc4eeeee56a..ba8613678f6 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -1168,20 +1168,42 @@ ProcessBarrierShmemResize(Barrier *barrier)
 }

 /*
- * GUC assign hook for shared_buffers. It's recommended for an assign hook to
- * be as minimal as possible, thus we just request shared memory resize and
- * remember the previous value.
+ * GUC assign hook for shared_buffers.
+ *
+ * When setting the GUC first time after starting the server, the GUC value is
+ * changed immediately since there is not shared memory setup yet.
+ *
+ * After the shared memory is setup, changing the GUC value requires resizing and
+ * reiniatializing (at least parts of) the shared memory structures related to
+ * shared buffers. That's a long and complicated process.  It's recommended for
+ * an assign hook to be as minimal as possible, thus we just request shared
+ * memory resize and remember the previous value.
  */
 void
 assign_shared_buffers(int newval, void *extra, bool *pending)
 {
-	elog(DEBUG1, "Received SIGHUP for shmem resizing");
-
-	pending_pm_shmem_resize = true;
-	*pending = true;
-	NBuffersPending = newval;
-
-	NBuffersOld = NBuffers;
+	/*
+	 * TODO: If a backend joins while the buffer resizing is in progress or it
+	 * reads a value of shared_buffers from configuration which is different from
+	 * the value being used by existing backends, this method may not work. Need
+	 * to think of a better solution. 
+	 */
+	if (BufferBlocks)
+	{
+		elog(DEBUG1, "bufferpool is already initialized with size = %d, reinitializing it with size = %d",
+			NBuffers, newval);
+		pending_pm_shmem_resize = true;
+		*pending = true;
+		NBuffersPending = newval;
+		NBuffersOld = NBuffers;
+	}
+	else
+	{
+		elog(DEBUG1, "initializing buffer pool with size = %d", newval);
+		NBuffers = newval;
+		*pending = false;
+		pending_pm_shmem_resize = false;
+	}
 }

 /*
-- 
2.34.1

0014-Support-shrinking-shared-buffers-20250918.patchapplication/x-patch; name=0014-Support-shrinking-shared-buffers-20250918.patchDownload

From df1fbd464901d3097d9e74ae06a70bc629948d14 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Thu, 19 Jun 2025 17:38:29 +0200
Subject: [PATCH 14/16] Support shrinking shared buffers

Buffer eviction
===============
When shrinking the shared buffers pool, each buffer in the area being
shrunk needs to be flushed if it's dirty so as not to loose the changes
to that buffer after shrinking. Also, each such buffer needs to be
removed from the buffer mapping table so that backends do not access it
after shrinking.

Buffer eviction requires a separate barrier phase for two reasons:

1. No other backend should map a new page to any of  buffers being
   evicted when eviction is in progress. So they wait while eviction is
   in progress.

2. Since a pinned buffer has the pin recorded in the backend local
   memory as well as the buffer descriptor (which is in shared memory),
   eviction should not coincide with remapping the shared memory of a
   backend. Otherwise we might loose consistency of local and shared
   pinning records. Hence it needs to be carried out in
   ProcessBarrierShmemResize() and not in AnonymousShmemResize() as
   indicated by now removed comment.

If a buffer being evicted is pinned, we raise a FATAL error but this should
improve. There are multiple options 1. to wait for the pinned buffer to get
unpinned, 2. the backend is killed or it itself cancels the query  or 3.
rollback the operation. Note that option 1 and 2 would require the pinning
related local and shared records to be accessed. But we need infrastructure to
do either of this right now.

Removing the evicted buffers from buffer ring
=============================================
If the buffer pool has been shrunk, the buffers in the buffer ring may
not be valid anymore. Modify GetBufferFromRing to check if the buffer is
still valid before using it. This makes GetBufferFromRing() a bit more
expensive because of additional boolean condition and masks any bug that
introduces an invalid buffer into the ring. The alternative fix is more
complex as explained below.

The strategy object is created in CurrentMemoryContext and is not
available in any global structure thus accessible when processing buffer
resizing barriers. We may modify GetAccessStrategy() to register
strategy in a global linked list and then arrange to deregister it once
it's no more in use. Looking at the places which use
GetAccessStrategy(), fixing all those may be some work.

Author: Ashutosh Bapat
Reviewed-by: Tomas Vondra
---
 src/backend/port/sysv_shmem.c                 | 42 ++++++---
 src/backend/storage/buffer/bufmgr.c           | 93 +++++++++++++++++++
 src/backend/storage/buffer/freelist.c         | 18 +++-
 .../utils/activity/wait_event_names.txt       |  1 +
 src/include/storage/bufmgr.h                  |  1 +
 src/include/storage/pg_shmem.h                |  1 +
 6 files changed, 139 insertions(+), 17 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 54d335b2e5d..9e1b2c3201f 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -993,14 +993,6 @@ AnonymousShmemResize(void)
 	 */
 	pending_pm_shmem_resize = false;
 
-	/*
-	 * XXX: Currently only increasing of shared_buffers is supported. For
-	 * decreasing something similar has to be done, but buffer blocks with
-	 * data have to be drained first.
-	 */
-	if(NBuffersOld > NBuffers)
-		return false;
-
 #ifndef MAP_HUGETLB
 	/* PrepareHugePages should have dealt with this case */
 	Assert(huge_pages != HUGE_PAGES_ON && !huge_pages_on);
@@ -1099,11 +1091,14 @@ AnonymousShmemResize(void)
 				 * all the pointers are still valid, and we only need to update
 				 * structures size in the ShmemIndex once -- any other backend
 				 * will pick up this shared structure from the index.
-				 *
-				 * XXX: This is the right place for buffer eviction as well.
 				 */
 				BufferManagerShmemInit(NBuffersOld);
 
+				/*
+				 * Wipe out the evictor PID so that it can be used for the next
+				 * buffer resizing operation.
+				*/
+				ShmemCtrl->evictor_pid = 0;
 				/* If all fine, broadcast the new value */
 				pg_atomic_write_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
 			}
@@ -1156,11 +1151,31 @@ ProcessBarrierShmemResize(Barrier *barrier)
 	 * XXX: If we need to be able to abort resizing, this has to be done later,
 	 * after the SHMEM_RESIZE_DONE.
 	 */
-	if (BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START))
+
+	/*
+	 * Evict extra buffers when shrinking shared buffers. We need to do this
+	 * while the memory for extra buffers is still mapped i.e. before remapping
+	 * the shared memory segments to a smaller memory area.
+	 */
+	if (NBuffersOld > NBuffersPending)
 	{
-		Assert(IsUnderPostmaster);
-		SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
+		BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START);
+
+		/*
+		 * TODO: If the buffer eviction fails for any reason, we should
+		 * gracefully rollback the shared buffer resizing and try again. But the
+		 * infrastructure to do so is not available right now. Hence just raise
+		 * a FATAL so that the system restarts.
+		 */
+		if (!EvictExtraBuffers(NBuffersPending, NBuffersOld))
+			elog(FATAL, "buffer eviction failed");
+
+		if (BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_EVICT))
+			SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
 	}
+	else
+		if (BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START))
+			SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
 
 	AnonymousShmemResize();
 
@@ -1684,5 +1699,6 @@ ShmemControlInit(void)
 
 		/* shmem_resizable should be initialized by now */
 		ShmemCtrl->Resizable = shmem_resizable;
+		ShmemCtrl->evictor_pid = 0;
 	}
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fe470de63f2..5424c405b44 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -57,6 +57,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/read_stream.h"
 #include "storage/smgr.h"
@@ -7422,3 +7423,95 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+/*
+ * When shrinking shared buffers pool, evict the buffers which will not be part
+ * of the shrunk buffer pool.
+ */
+bool
+EvictExtraBuffers(int newBufSize, int oldBufSize)
+{
+	bool result = true;
+
+	/*
+	 * If the buffer being evicated is locked, this function will need to wait.
+	 * This function should not be called from a Postmaster since it can not wait on a lock.
+	 */
+	Assert(IsUnderPostmaster);
+
+	/*
+	 * Let only one backend perform eviction. We could split the work across all
+	 * the backends but that doesn't seem necessary.
+	 *
+	 * The first backend to acquire ShmemResizeLock, sets its own PID as the
+	 * evictor PID for other backends to know that the eviction is in progress or
+	 * has already been performed. The evictor backend releases the lock when it
+	 * finishes eviction.  While the eviction is in progress, backends other than
+	 * evictor backend won't be able to take the lock. They won't perform
+	 * eviction. A backend may acquire the lock after eviction has completed, but
+	 * it will not perform eviction since the evictor PID is already set. Evictor
+	 * PID is reset only when the buffer resizing finishes. Thus only one backend
+	 * will perform eviction in a given instance of shared buffers resizing.
+	 *
+	 * Any backend which acquires this lock will release it before the eviction
+	 * phase finishes, hence the same lock can be reused for the next phase of
+	 * resizing buffers.
+	 */
+	if (LWLockConditionalAcquire(ShmemResizeLock, LW_EXCLUSIVE))
+	{
+		if (ShmemCtrl->evictor_pid == 0)
+		{
+			ShmemCtrl->evictor_pid = MyProcPid;
+
+			/*
+			 * TODO: Before evicting any buffer, we should check whether any of the
+			 * buffers are pinned. If we find that a buffer is pinned after evicting
+			 * most of them, that will impact performance since all those evicted
+			 * buffers might need to be read again.
+			 */
+			for (Buffer buf = newBufSize + 1; buf <= oldBufSize; buf++)
+			{
+				BufferDesc *desc = GetBufferDescriptor(buf - 1);
+				uint32		buf_state;
+				bool		buffer_flushed;
+
+				buf_state = pg_atomic_read_u32(&desc->state);
+
+				/*
+				 * Nobody is expected to touch the buffers while resizing is
+				 * going one hence unlocked precheck should be safe and saves
+				 * some cycles.
+				 */
+				if (!(buf_state & BM_VALID))
+					continue;
+
+				/*
+				 * XXX: Looks like CurrentResourceOwner can be NULL here, find
+				 * another one in that case?
+				 * */
+				if (CurrentResourceOwner)
+					ResourceOwnerEnlarge(CurrentResourceOwner);
+
+				ReservePrivateRefCountEntry();
+
+				LockBufHdr(desc);
+
+				/*
+				 * Now that we have locked buffer descriptor, make sure that the
+				 * buffer without valid data has been skipped above.
+				 */
+				Assert(buf_state & BM_VALID);
+
+				if (!EvictUnpinnedBufferInternal(desc, &buffer_flushed))
+				{
+					elog(WARNING, "could not remove buffer %u, it is pinned", buf);
+					result = false;
+					break;
+				}
+			}
+		}
+		LWLockRelease(ShmemResizeLock);
+	}
+
+	return result;
+}
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 0bfbbb096d6..db8aafdaf8c 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -630,12 +630,22 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 		strategy->current = 0;
 
 	/*
-	 * If the slot hasn't been filled yet, tell the caller to allocate a new
-	 * buffer with the normal allocation strategy.  He will then fill this
-	 * slot by calling AddBufferToRing with the new buffer.
+	 * If the slot hasn't been filled yet or the buffer in the slot has been
+	 * invalidated when buffer pool was shrunk, tell the caller to allocate a new
+	 * buffer with the normal allocation strategy.  He will then fill this slot
+	 * by calling AddBufferToRing with the new buffer.
+	 * 
+	 * TODO: Ideally we would want to check for bufnum > NBuffers only once
+	 * after every time the buffer pool is shrunk so as to catch any runtime
+	 * bugs that introduce invalid buffers in the ring. But that is complicated.
+	 * The BufferAccessStrategy objects are not accessible outside the
+	 * ScanState. Hence we can not purge the buffers while evicting the buffers.
+	 * After the resizing is finished, it's not possible to notice when we touch
+	 * the first of those objects and the last of objects. See if this can
+	 * fixed. 
 	 */
 	bufnum = strategy->buffers[strategy->current];
-	if (bufnum == InvalidBuffer)
+	if (bufnum == InvalidBuffer || bufnum > NBuffers)
 		return NULL;
 
 	/*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 82cee6b8877..9a6a6275305 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -156,6 +156,7 @@ REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it c
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
 SHMEM_RESIZE_START	"Waiting for other backends to start resizing shared memory."
+SHMEM_RESIZE_EVICT	"Waiting for other backends to finish buffer evication phase."
 SHMEM_RESIZE_DONE	"Waiting for other backends to finish resizing shared memory."
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
 WAL_RECEIVER_EXIT	"Waiting for the WAL receiver to exit."
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 51ce6ebcf6c..c91a42fc598 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -315,6 +315,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 									int32 *buffers_evicted,
 									int32 *buffers_flushed,
 									int32 *buffers_skipped);
+extern bool EvictExtraBuffers(int fromBuf, int toBuf);
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(int);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index eba28ce8a5c..0a59746b472 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -77,6 +77,7 @@ extern PGDLLIMPORT AnonymousMapping Mappings[ANON_MAPPINGS];
 typedef struct
 {
 	pg_atomic_uint32 	NSharedBuffers;
+	pid_t				evictor_pid;
 	Barrier 			Barrier;
 	pg_atomic_uint64 	Generation;
 	bool                Resizable;
-- 
2.34.1

0015-Reinitialize-StrategyControl-after-resizing-20250918.patchapplication/x-patch; name=0015-Reinitialize-StrategyControl-after-resizing-20250918.patchDownload

From d389c5f5948c4c577b480ecf7bf0de551016d447 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Thu, 19 Jun 2025 17:38:51 +0200
Subject: [PATCH 15/16] Reinitialize StrategyControl after resizing buffers

... and BgBufferSync and ClockSweepTick adjustments

Reinitializing strategry control area
=====================================
The commit introduces a separate function StrategyReInitialize() instead
of reusing StrategyInitialize() since some of the things that the second
one does are not required in the first one. Here's list of what
StrategyReInitialize() does and how does it differ from
StrategyInitialize().

1. StrategyControl pointer needn't be fetched again since it should not
   change. But added an Assert to make sure the pointer is valid.
2. &StrategyControl->buffer_strategy_lock need not be initialized again.
3. nextVictimBuffer, completePasses and numBufferAllocs are viewed in
   the context of NBuffers. Now that NBuffers itself has changed, those
   three do not make sense. Reset them as if the server has restarted
   again.

Ability to delay resizing operation
===================================
This commit introduces a flag delay_shmem_resize, which postgresql
backends and workers can use to signal the coordinator to delay resizing
operation. Background writer sets this flag when its scanning buffers.

Background writer operation
===========================
Background writer is blocked when the actual resizing is in progress. It
stops a scan in progress when it sees that the resizing has begun or is
about to begin. Once the buffer resizing is finished, before resuming
the regular operation, bgwriter resets the information saved so far.
This information is viewed in the context of NBuffers and hence does not
make sense after resizing which chanegs NBuffers.

Buffer lookup table
===================
Right now there is no way to free shared memory. Even if we shrink the
buffer lookup table when shrinking the buffer pool the unused hash table
entries can not be freed. When we expand the buffer pool, more entries
can be allocated but we can not resize the hash table directory without
rehashing all the entries. Just allocating more entries will lead to
more contention. Hence we setup the buffer lookup table considering the
maximum possible size of the buffer pool which is MaxAvailableMemory
only once at the beginning.  Shared buffer lookup table and
StrategyControl are not resized even if the buffer pool is resized hence
they are allocated in the main shared memory segment

TODO:
====
1. The way BgBufferSync is written today, it packs four functionalities:
   setting up the buffer sync state, performing the buffer sync,
   resetting the buffer sync state when bgwriter_lru_maxpages <= 0 and
   setting it up again after bgwriter_lru_maxpages > 0. That makes the
   code hard to read.  It will be good to divide this function into 3/4
   different functions each performing one functionality. Then pack all
   the state (the local variables from that function converted to static
   global) into a structure, which is passed to these functions. Once
   that happens BgBufferSyncReset() will call one of the functions to
   reset the state when buffer pool is resized.

2. The condition (pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) ==
   NBuffers) checked in BgBufferSync() to check whether buffer resizing
   is "about to begin" is wrong. NBuffers it not changed, until
   AnonymousShmemResize() is called and it wont' be called unless
   BgBufferSync() finishes if it has already begun. Need a better
   condition to check whether buffer resizing is about to begin.

Author: Ashutosh Bapat
Reviewed-by: Tomas Vondra
---
 src/backend/port/sysv_shmem.c          | 23 ++++++--
 src/backend/storage/buffer/buf_init.c  | 19 +++++--
 src/backend/storage/buffer/buf_table.c |  9 ++-
 src/backend/storage/buffer/bufmgr.c    | 72 ++++++++++++++++++------
 src/backend/storage/buffer/freelist.c  | 77 ++++++++++++++++++++++++--
 src/include/storage/buf_internals.h    |  1 +
 src/include/storage/bufmgr.h           |  1 +
 src/include/storage/ipc.h              |  1 +
 src/include/storage/pg_shmem.h         |  5 +-
 9 files changed, 170 insertions(+), 38 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 9e1b2c3201f..3be28e228ae 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -104,6 +104,7 @@ AnonymousMapping Mappings[ANON_MAPPINGS];
 
 /* Flag telling postmaster that resize is needed */
 volatile bool pending_pm_shmem_resize = false;
+volatile bool delay_shmem_resize = false;
 
 /* Keeps track of the previous NBuffers value */
 static int NBuffersOld = -1;
@@ -144,12 +145,11 @@ static int NBuffersPending = -1;
  * makes sense to evaluate them more precise.
  */
 static double SHMEM_RESIZE_RATIO[6] = {
-	0.1,    /* MAIN_SHMEM_SEGMENT */
+	0.15,    /* MAIN_SHMEM_SEGMENT */
 	0.6,    /* BUFFERS_SHMEM_SEGMENT */
 	0.1,    /* BUFFER_DESCRIPTORS_SHMEM_SEGMENT */
 	0.1,    /* BUFFER_IOCV_SHMEM_SEGMENT */
 	0.05,   /* CHECKPOINT_BUFFERS_SHMEM_SEGMENT */
-	0.05,   /* STRATEGY_SHMEM_SEGMENT */
 };
 
 /*
@@ -225,8 +225,6 @@ MappingName(int shmem_segment)
 			return "iocv";
 		case CHECKPOINT_BUFFERS_SHMEM_SEGMENT:
 			return "checkpoint";
-		case STRATEGY_SHMEM_SEGMENT:
-			return "strategy";
 		default:
 			return "unknown";
 	}
@@ -1125,13 +1123,17 @@ ProcessBarrierShmemResize(Barrier *barrier)
 {
 	Assert(IsUnderPostmaster);
 
-	elog(DEBUG1, "Handle a barrier for shmem resizing from %d to %d, %d",
-		 NBuffersOld, NBuffersPending, pending_pm_shmem_resize);
+	elog(DEBUG1, "Handle a barrier for shmem resizing from %d to %d, %d, %d",
+		 NBuffersOld, NBuffersPending, pending_pm_shmem_resize, delay_shmem_resize);
 
 	/* Wait until we have seen the new NBuffers value */
 	if (!pending_pm_shmem_resize)
 		return false;
 
+	/* Wait till this process becomes ready to resize buffers. */
+	if (delay_shmem_resize)
+		return false;
+
 	/*
 	 * First thing to do after attaching to the barrier is to wait for others.
 	 * We can't simply use BarrierArriveAndWait, because backends might arrive
@@ -1182,6 +1184,15 @@ ProcessBarrierShmemResize(Barrier *barrier)
 	/* The second phase means the resize has finished, SHMEM_RESIZE_DONE */
 	BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_DONE);
 
+	if (MyBackendType == B_BG_WRITER)
+	{
+		/*
+		 * Before resuming regular background writer activity, adjust the
+		 * statistics collected so far.
+		 */
+		BgBufferSyncReset(NBuffersOld, NBuffers);
+	}
+
 	BarrierDetach(barrier);
 	return true;
 }
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 0e72e373193..be64fa5a136 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -152,8 +152,15 @@ BufferManagerShmemInit(int FirstBufferToInit)
 	}
 #endif
 
-	/* Init other shared buffer-management stuff */
-	StrategyInitialize(!foundDescs);
+	/*
+	 * Init other shared buffer-management stuff from scratch configuring buffer
+	 * pool the first time. If we are just resizing buffer pool adjust only the
+	 * required structures.
+	 */
+	if (FirstBufferToInit == 0)
+		StrategyInitialize(!foundDescs);
+	else
+		StrategyReInitialize(FirstBufferToInit);
 
 	/* Initialize per-backend file flush context */
 	WritebackContextInit(&BackendWritebackContext,
@@ -184,9 +191,6 @@ BufferManagerShmemSize(void)
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
 	Mappings[BUFFERS_SHMEM_SEGMENT].shmem_req_size = size;
 
-	/* size of stuff controlled by freelist.c */
-	Mappings[STRATEGY_SHMEM_SEGMENT].shmem_req_size = StrategyShmemSize();
-
 	/* size of I/O condition variables, plus alignment padding */
 	size = add_size(0, mul_size(NBuffers,
 								   sizeof(ConditionVariableMinimallyPadded)));
@@ -196,5 +200,10 @@ BufferManagerShmemSize(void)
 	/* size of checkpoint sort array in bufmgr.c */
 	Mappings[CHECKPOINT_BUFFERS_SHMEM_SEGMENT].shmem_req_size = mul_size(NBuffers, sizeof(CkptSortItem));
 
+	/* Allocations in the main memory segment, at the end. */
+
+	/* size of stuff controlled by freelist.c */
+	size = add_size(0, StrategyShmemSize());
+
 	return size;
 }
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 18a78967138..e5a97e557d9 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -65,11 +65,18 @@ InitBufTable(int size)
 	info.entrysize = sizeof(BufferLookupEnt);
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
+	/*
+	 * The shared buffer look up table is set up only once with maximum possible
+	 * entries considering maximum size of the buffer pool. It is not resized
+	 * after that even if the buffer pool is resized. Hence it is allocated in
+	 * the main shared memory segment and not in a resizeable shared memory
+	 * segment.
+	 */
 	SharedBufHash = ShmemInitHashInSegment("Shared Buffer Lookup Table",
 								  size, size,
 								  &info,
 								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION | HASH_FIXED_SIZE,
-								  STRATEGY_SHMEM_SEGMENT);
+								  MAIN_SHMEM_SEGMENT);
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5424c405b44..48c46d5b963 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3580,6 +3580,32 @@ BufferSync(int flags)
 	TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_scan);
 }
 
+/*
+ * Information saved between BgBufferSync() calls so we can determine the
+ * strategy point's advance rate and avoid scanning already-cleaned buffers. The
+ * variables are global instead of static local so that BgBufferSyncReset() can
+ * adjust it when resizing shared buffers.
+ */
+static bool saved_info_valid = false;
+static int	prev_strategy_buf_id;
+static uint32 prev_strategy_passes;
+static int	next_to_clean;
+static uint32 next_passes;
+
+/* Moving averages of allocation rate and clean-buffer density */
+static float smoothed_alloc = 0;
+static float smoothed_density = 10.0;
+
+void
+BgBufferSyncReset(int NBuffersOld, int NBuffersNew)
+{
+	saved_info_valid = false;
+#ifdef BGW_DEBUG
+	elog(DEBUG2, "invalidated background writer status after resizing buffers from %d to %d",
+		 NBuffersOld, NBuffersNew);
+#endif
+}
+
 /*
  * BgBufferSync -- Write out some dirty buffers in the pool.
  *
@@ -3599,20 +3625,6 @@ BgBufferSync(WritebackContext *wb_context)
 	uint32		strategy_passes;
 	uint32		recent_alloc;
 
-	/*
-	 * Information saved between calls so we can determine the strategy
-	 * point's advance rate and avoid scanning already-cleaned buffers.
-	 */
-	static bool saved_info_valid = false;
-	static int	prev_strategy_buf_id;
-	static uint32 prev_strategy_passes;
-	static int	next_to_clean;
-	static uint32 next_passes;
-
-	/* Moving averages of allocation rate and clean-buffer density */
-	static float smoothed_alloc = 0;
-	static float smoothed_density = 10.0;
-
 	/* Potentially these could be tunables, but for now, not */
 	float		smoothing_samples = 16;
 	float		scan_whole_pool_milliseconds = 120000.0;
@@ -3635,6 +3647,22 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	/*
+	 * If buffer pool is being shrunk the buffer being written out may not remain
+	 * valid. If the buffer pool is being expanded, more buffers will become
+	 * available without even this function writing out any. Hence wait till
+	 * buffer resizing finishes i.e. go into hibernation mode.
+	 */
+	if (pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers)
+		return true;
+
+	/*
+	 * Resizing shared buffers while this function is performing an LRU scan on
+	 * them may lead to wrong results. Indicate that the resizing should wait for
+	 * the LRU scan to complete.
+	 */
+	delay_shmem_resize = true;
+
 	/*
 	 * Find out where the clock-sweep currently is, and how many buffer
 	 * allocations have happened since our last call.
@@ -3811,8 +3839,17 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
-	/* Execute the LRU scan */
-	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
+	/*
+	 * Execute the LRU scan.
+	 *
+	 * If buffer pool is being shrunk, the buffer being written may not remain
+	 * valid. If the buffer pool is being expanded, more buffers will become
+	 * available without even this function writing any. Hence stop what we are doing. This
+	 * also unblocks other processes that are waiting for buffer resizing to
+	 * finish.
+	 */
+	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est &&
+			pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) == NBuffers)
 	{
 		int			sync_state = SyncOneBuffer(next_to_clean, true,
 											   wb_context);
@@ -3871,6 +3908,9 @@ BgBufferSync(WritebackContext *wb_context)
 #endif
 	}
 
+	/* Let the resizing commence. */
+	delay_shmem_resize = false;
+
 	/* Return true if OK to hibernate */
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index db8aafdaf8c..89269087034 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -371,12 +371,21 @@ StrategyInitialize(bool init)
 	 *
 	 * Since we can't tolerate running out of lookup table entries, we must be
 	 * sure to specify an adequate table size here.  The maximum steady-state
-	 * usage is of course NBuffers entries, but BufferAlloc() tries to insert
-	 * a new entry before deleting the old.  In principle this could be
-	 * happening in each partition concurrently, so we could need as many as
-	 * NBuffers + NUM_BUFFER_PARTITIONS entries.
+	 * usage is of course is as many number of entries as the number of buffers
+	 * in the buffer pool.  Right now there is no way to free shared memory. Even
+	 * if we shrink the buffer lookup table when shrinking the buffer pool the
+	 * unused hash table entries can not be freed. When we expand the buffer
+	 * pool, more entries can be allocated but we can not resize the hash table
+	 * directory without rehashing all the entries. Just allocating more entries
+	 * will lead to more contention. Hence we setup the buffer lookup table
+	 * considering the maximum possible size of the buffer pool which is
+	 * MaxAvailableMemory.
+	 *
+	 * Additionally BufferAlloc() tries to insert a new entry before deleting the
+	 * old.  In principle this could be happening in each partition concurrently,
+	 * so we need extra NUM_BUFFER_PARTITIONS entries.
 	 */
-	InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
+	InitBufTable(MaxAvailableMemory + NUM_BUFFER_PARTITIONS);
 
 	/*
 	 * Get or create the shared strategy control block
@@ -384,7 +393,7 @@ StrategyInitialize(bool init)
 	StrategyControl = (BufferStrategyControl *)
 		ShmemInitStructInSegment("Buffer Strategy Status",
 						sizeof(BufferStrategyControl),
-						&found, STRATEGY_SHMEM_SEGMENT);
+						&found, MAIN_SHMEM_SEGMENT);
 
 	if (!found)
 	{
@@ -409,6 +418,62 @@ StrategyInitialize(bool init)
 		Assert(!init);
 }
 
+/*
+ * StrategyReInitialize -- re-initialize the buffer cache replacement
+ *		strategy.
+ *
+ * To be called when resizing buffer manager and only from the coordinator.
+ * TODO: Assess the differences between this function and StrategyInitialize().
+ */
+void
+StrategyReInitialize(int FirstBufferIdToInit)
+{
+	bool		found;
+
+	/*
+	 * Resizing memory for buffer pools should not affect the address of
+	 * StrategyControl.
+	 */
+	if (StrategyControl != (BufferStrategyControl *)
+		ShmemInitStructInSegment("Buffer Strategy Status",
+						sizeof(BufferStrategyControl),
+						&found, MAIN_SHMEM_SEGMENT))
+		elog(FATAL, "something went wrong while re-initializing the buffer strategy");
+
+	Assert(found);
+
+	/* TODO: Buffer lookup table adjustment: There are two options:
+	 *
+	 * 1. Resize the buffer lookup table to match the new number of buffers. But
+	 * this requires rehashing all the entries in the buffer lookup table with
+	 * the new table size.
+	 *
+	 * 2. Allocate maximum size of the buffer lookup table at the beginning and
+	 * never resize it. This leaves sparse buffer lookup table which is
+	 * inefficient from both memory and time perspective. According to David
+	 * Rowley, the sparse entries in the buffer look up table cause frequent
+	 * cacheline reload which affect performance. If the impact of that
+	 * inefficiency in a benchmark is significant, we will need to consider first
+	 * option.
+	 */
+	/*
+	 * The clock sweep tick pointer might have got invalidated. Reset it as if
+	 * starting a fresh server.
+	 */
+	pg_atomic_write_u32(&StrategyControl->nextVictimBuffer, 0);
+
+	/*
+	 * The old statistics is viewed in the context of the number of shared
+	 * buffers. It does not make sense now that the number of shared buffers
+	 * itself has changed.
+	 */
+	StrategyControl->completePasses = 0;
+	pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
+
+	/* No pending notification */
+	StrategyControl->bgwprocno = -1;
+}
+
 
 /* ----------------------------------------------------------------
  *				Backend-private buffer ring management
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index dfd614f7ca4..551479649ca 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -443,6 +443,7 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
+extern void StrategyReInitialize(int FirstBufferToInit);
 
 /* buf_table.c */
 extern Size BufTableShmemSize(int size);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c91a42fc598..2fe3202168b 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -299,6 +299,7 @@ extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(WritebackContext *wb_context);
+extern void BgBufferSyncReset(int NBuffersOld, int NBuffersNew);
 
 extern uint32 GetPinLimit(void);
 extern uint32 GetLocalPinLimit(void);
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 847f56a36dc..6e7b0abb625 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -65,6 +65,7 @@ typedef void (*shmem_startup_hook_type) (void);
 extern PGDLLIMPORT bool proc_exit_inprogress;
 extern PGDLLIMPORT bool shmem_exit_inprogress;
 extern PGDLLIMPORT volatile bool pending_pm_shmem_resize;
+extern PGDLLIMPORT volatile bool delay_shmem_resize;
 
 pg_noreturn extern void proc_exit(int code);
 extern void shmem_exit(int code);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 0a59746b472..704b065f9e9 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -65,7 +65,7 @@ typedef struct ShmemSegment
 } ShmemSegment;
 
 /* Number of available segments for anonymous memory mappings */
-#define ANON_MAPPINGS 6
+#define ANON_MAPPINGS 5
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 extern PGDLLIMPORT AnonymousMapping Mappings[ANON_MAPPINGS];
@@ -172,7 +172,4 @@ extern void ShmemControlInit(void);
 /* Checkpoint BufferIds */
 #define CHECKPOINT_BUFFERS_SHMEM_SEGMENT 4
 
-/* Buffer strategy status */
-#define STRATEGY_SHMEM_SEGMENT 5
-
 #endif							/* PG_SHMEM_H */
-- 
2.34.1

0016-Tests-for-dynamic-shared_buffers-resizing-20250918.patchapplication/x-patch; name=0016-Tests-for-dynamic-shared_buffers-resizing-20250918.patchDownload

From 359800f4e40a4ac97cb5aec1198bd0b8584ce53e Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Wed, 3 Sep 2025 10:59:20 +0530
Subject: [PATCH 16/16] Tests for dynamic shared_buffers resizing

The commit adds two tests:

1. TAP test to stress test buffer pool resizing under concurrent load.

2. SQL test to test sanity of shared memory allocations and mappings
   after buffer pool resizing operation.

Author: Palak Chaturvedi <chaturvedipalak1911@gmail.com>
Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
---
 src/test/Makefile                             |   2 +-
 src/test/README                               |   3 +
 src/test/buffermgr/Makefile                   |  27 ++
 src/test/buffermgr/README                     |  26 ++
 src/test/buffermgr/expected/buffer_resize.out | 237 ++++++++++++++++++
 src/test/buffermgr/meson.build                |  17 ++
 src/test/buffermgr/sql/buffer_resize.sql      |  73 ++++++
 src/test/buffermgr/t/001_resize_buffer.pl     | 126 ++++++++++
 src/test/meson.build                          |   1 +
 9 files changed, 511 insertions(+), 1 deletion(-)
 create mode 100644 src/test/buffermgr/Makefile
 create mode 100644 src/test/buffermgr/README
 create mode 100644 src/test/buffermgr/expected/buffer_resize.out
 create mode 100644 src/test/buffermgr/meson.build
 create mode 100644 src/test/buffermgr/sql/buffer_resize.sql
 create mode 100644 src/test/buffermgr/t/001_resize_buffer.pl

diff --git a/src/test/Makefile b/src/test/Makefile
index 511a72e6238..95f8858a818 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,7 +12,7 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = perl postmaster regress isolation modules authentication recovery subscription
+SUBDIRS = perl postmaster regress isolation modules authentication recovery subscription buffermgr
 
 ifeq ($(with_icu),yes)
 SUBDIRS += icu
diff --git a/src/test/README b/src/test/README
index afdc7676519..77f11607ff7 100644
--- a/src/test/README
+++ b/src/test/README
@@ -15,6 +15,9 @@ examples/
   Demonstration programs for libpq that double as regression tests via
   "make check"
 
+buffermgr/
+  Tests for resizing buffer pool without restarting the server
+
 isolation/
   Tests for concurrent behavior at the SQL level
 
diff --git a/src/test/buffermgr/Makefile b/src/test/buffermgr/Makefile
new file mode 100644
index 00000000000..97c3da9e20a
--- /dev/null
+++ b/src/test/buffermgr/Makefile
@@ -0,0 +1,27 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/test/buffermgr
+#
+# Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/test/buffermgr/Makefile
+#
+#-------------------------------------------------------------------------
+
+EXTRA_INSTALL = contrib/pg_buffercache
+
+REGRESS = buffer_resize
+
+subdir = src/test/buffermgr
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
+clean distclean:
+	rm -rf tmp_check
diff --git a/src/test/buffermgr/README b/src/test/buffermgr/README
new file mode 100644
index 00000000000..c375ad80989
--- /dev/null
+++ b/src/test/buffermgr/README
@@ -0,0 +1,26 @@
+src/test/buffermgr/README
+
+Regression tests for buffer manager
+===================================
+
+This directory contains a test suite for resizing buffer manager without restarting the server.
+
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
+
+See src/test/perl/README for more info about running these tests.
diff --git a/src/test/buffermgr/expected/buffer_resize.out b/src/test/buffermgr/expected/buffer_resize.out
new file mode 100644
index 00000000000..a986be9a5da
--- /dev/null
+++ b/src/test/buffermgr/expected/buffer_resize.out
@@ -0,0 +1,237 @@
+-- Test buffer pool resizing and shared memory allocation tracking
+-- This test resizes the buffer pool multiple times and monitors
+-- shared memory allocations related to buffer management
+-- Create a separate schema for this test
+CREATE SCHEMA buffer_resize_test;
+SET search_path TO buffer_resize_test, public;
+-- Create a view for buffer-related shared memory allocations
+CREATE VIEW buffer_allocations AS
+SELECT name, segment, size, allocated_size 
+FROM pg_shmem_allocations 
+WHERE name IN ('Buffer Blocks', 'Buffer Descriptors', 'Buffer IO Condition Variables', 
+               'Checkpoint BufferIds')
+ORDER BY name;
+-- Note: We exclude the 'main' segment even if it contains the shared buffer
+-- lookup table because it contains other shared structures whose total sizes
+-- may vary as the code changes.
+CREATE VIEW buffer_segments AS
+SELECT name, size, mapping_size, mapping_reserved_size
+FROM pg_shmem_segments
+WHERE name <> 'main'
+ORDER BY name;
+-- Enable pg_buffercache for buffer count verification
+CREATE EXTENSION IF NOT EXISTS pg_buffercache;
+-- Test 1: Default shared_buffers 
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 128MB
+(1 row)
+
+SELECT * FROM buffer_allocations;
+             name              |   segment   |   size    | allocated_size 
+-------------------------------+-------------+-----------+----------------
+ Buffer Blocks                 | buffers     | 134221824 |      134221824
+ Buffer Descriptors            | descriptors |   1048576 |        1048576
+ Buffer IO Condition Variables | iocv        |    262144 |         262144
+ Checkpoint BufferIds          | checkpoint  |    327680 |         327680
+(4 rows)
+
+SELECT * FROM buffer_segments;
+    name     |   size    | mapping_size | mapping_reserved_size 
+-------------+-----------+--------------+-----------------------
+ buffers     | 134225920 |    134225920 |            2576982016
+ checkpoint  |    335872 |       335872 |             214753280
+ descriptors |   1056768 |      1056768 |             429498368
+ iocv        |    270336 |       270336 |             429498368
+(4 rows)
+
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+ buffer_count 
+--------------
+        16384
+(1 row)
+
+-- Test 2: Set to 64MB  
+ALTER SYSTEM SET shared_buffers = '64MB';
+SELECT pg_reload_conf();
+ pg_reload_conf 
+----------------
+ t
+(1 row)
+
+SELECT pg_sleep(1);
+ pg_sleep 
+----------
+ 
+(1 row)
+
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 64MB
+(1 row)
+
+SELECT * FROM buffer_allocations;
+             name              |   segment   |   size   | allocated_size 
+-------------------------------+-------------+----------+----------------
+ Buffer Blocks                 | buffers     | 67112960 |       67112960
+ Buffer Descriptors            | descriptors |   524288 |         524288
+ Buffer IO Condition Variables | iocv        |   131072 |         131072
+ Checkpoint BufferIds          | checkpoint  |   163840 |         163840
+(4 rows)
+
+SELECT * FROM buffer_segments;
+    name     |   size   | mapping_size | mapping_reserved_size 
+-------------+----------+--------------+-----------------------
+ buffers     | 67117056 |     67117056 |            2576982016
+ checkpoint  |   172032 |       172032 |             214753280
+ descriptors |   532480 |       532480 |             429498368
+ iocv        |   139264 |       139264 |             429498368
+(4 rows)
+
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+ buffer_count 
+--------------
+         8192
+(1 row)
+
+-- Test 3: Set to 256MB
+ALTER SYSTEM SET shared_buffers = '256MB';
+SELECT pg_reload_conf();
+ pg_reload_conf 
+----------------
+ t
+(1 row)
+
+SELECT pg_sleep(1);
+ pg_sleep 
+----------
+ 
+(1 row)
+
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 256MB
+(1 row)
+
+SELECT * FROM buffer_allocations;
+             name              |   segment   |   size    | allocated_size 
+-------------------------------+-------------+-----------+----------------
+ Buffer Blocks                 | buffers     | 268439552 |      268439552
+ Buffer Descriptors            | descriptors |   2097152 |        2097152
+ Buffer IO Condition Variables | iocv        |    524288 |         524288
+ Checkpoint BufferIds          | checkpoint  |    655360 |         655360
+(4 rows)
+
+SELECT * FROM buffer_segments;
+    name     |   size    | mapping_size | mapping_reserved_size 
+-------------+-----------+--------------+-----------------------
+ buffers     | 268443648 |    268443648 |            2576982016
+ checkpoint  |    663552 |       663552 |             214753280
+ descriptors |   2105344 |      2105344 |             429498368
+ iocv        |    532480 |       532480 |             429498368
+(4 rows)
+
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+ buffer_count 
+--------------
+        32768
+(1 row)
+
+-- Test 4: Set to 100MB (non-power-of-two)
+ALTER SYSTEM SET shared_buffers = '100MB';
+SELECT pg_reload_conf();
+ pg_reload_conf 
+----------------
+ t
+(1 row)
+
+SELECT pg_sleep(1);
+ pg_sleep 
+----------
+ 
+(1 row)
+
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 100MB
+(1 row)
+
+SELECT * FROM buffer_allocations;
+             name              |   segment   |   size    | allocated_size 
+-------------------------------+-------------+-----------+----------------
+ Buffer Blocks                 | buffers     | 104861696 |      104861696
+ Buffer Descriptors            | descriptors |    819200 |         819200
+ Buffer IO Condition Variables | iocv        |    204800 |         204800
+ Checkpoint BufferIds          | checkpoint  |    256000 |         256000
+(4 rows)
+
+SELECT * FROM buffer_segments;
+    name     |   size    | mapping_size | mapping_reserved_size 
+-------------+-----------+--------------+-----------------------
+ buffers     | 104865792 |    104865792 |            2576982016
+ checkpoint  |    262144 |       262144 |             214753280
+ descriptors |    827392 |       827392 |             429498368
+ iocv        |    212992 |       212992 |             429498368
+(4 rows)
+
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+ buffer_count 
+--------------
+        12800
+(1 row)
+
+-- Test 5: Set to minimum 128kB
+ALTER SYSTEM SET shared_buffers = '128kB';
+SELECT pg_reload_conf();
+ pg_reload_conf 
+----------------
+ t
+(1 row)
+
+SELECT pg_sleep(1);
+ pg_sleep 
+----------
+ 
+(1 row)
+
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 128kB
+(1 row)
+
+SELECT * FROM buffer_allocations;
+             name              |   segment   |  size  | allocated_size 
+-------------------------------+-------------+--------+----------------
+ Buffer Blocks                 | buffers     | 135168 |         135168
+ Buffer Descriptors            | descriptors |   1024 |           1024
+ Buffer IO Condition Variables | iocv        |    256 |            256
+ Checkpoint BufferIds          | checkpoint  |    320 |            320
+(4 rows)
+
+SELECT * FROM buffer_segments;
+    name     |  size  | mapping_size | mapping_reserved_size 
+-------------+--------+--------------+-----------------------
+ buffers     | 139264 |       139264 |            2576982016
+ checkpoint  |   8192 |         8192 |             214753280
+ descriptors |   8192 |         8192 |             429498368
+ iocv        |   8192 |         8192 |             429498368
+(4 rows)
+
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+ buffer_count 
+--------------
+           16
+(1 row)
+
+-- Clean up the schema and all its objects
+RESET search_path;
+DROP SCHEMA buffer_resize_test CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to view buffer_resize_test.buffer_allocations
+drop cascades to view buffer_resize_test.buffer_segments
+drop cascades to extension pg_buffercache
diff --git a/src/test/buffermgr/meson.build b/src/test/buffermgr/meson.build
new file mode 100644
index 00000000000..e71dcdea685
--- /dev/null
+++ b/src/test/buffermgr/meson.build
@@ -0,0 +1,17 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'buffermgr',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'buffer_resize',
+    ],
+  },
+  'tap': {
+    'tests': [
+      't/001_resize_buffer.pl',
+    ],
+  },
+}
diff --git a/src/test/buffermgr/sql/buffer_resize.sql b/src/test/buffermgr/sql/buffer_resize.sql
new file mode 100644
index 00000000000..45f5bb6d78b
--- /dev/null
+++ b/src/test/buffermgr/sql/buffer_resize.sql
@@ -0,0 +1,73 @@
+-- Test buffer pool resizing and shared memory allocation tracking
+-- This test resizes the buffer pool multiple times and monitors
+-- shared memory allocations related to buffer management
+
+-- Create a separate schema for this test
+CREATE SCHEMA buffer_resize_test;
+SET search_path TO buffer_resize_test, public;
+
+-- Create a view for buffer-related shared memory allocations
+CREATE VIEW buffer_allocations AS
+SELECT name, segment, size, allocated_size 
+FROM pg_shmem_allocations 
+WHERE name IN ('Buffer Blocks', 'Buffer Descriptors', 'Buffer IO Condition Variables', 
+               'Checkpoint BufferIds')
+ORDER BY name;
+
+-- Note: We exclude the 'main' segment even if it contains the shared buffer
+-- lookup table because it contains other shared structures whose total sizes
+-- may vary as the code changes.
+CREATE VIEW buffer_segments AS
+SELECT name, size, mapping_size, mapping_reserved_size
+FROM pg_shmem_segments
+WHERE name <> 'main'
+ORDER BY name;
+
+-- Enable pg_buffercache for buffer count verification
+CREATE EXTENSION IF NOT EXISTS pg_buffercache;
+
+-- Test 1: Default shared_buffers 
+SHOW shared_buffers;
+SELECT * FROM buffer_allocations;
+SELECT * FROM buffer_segments;
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+
+-- Test 2: Set to 64MB  
+ALTER SYSTEM SET shared_buffers = '64MB';
+SELECT pg_reload_conf();
+SELECT pg_sleep(1);
+SHOW shared_buffers;
+SELECT * FROM buffer_allocations;
+SELECT * FROM buffer_segments;
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+
+-- Test 3: Set to 256MB
+ALTER SYSTEM SET shared_buffers = '256MB';
+SELECT pg_reload_conf();
+SELECT pg_sleep(1);
+SHOW shared_buffers;
+SELECT * FROM buffer_allocations;
+SELECT * FROM buffer_segments;
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+
+-- Test 4: Set to 100MB (non-power-of-two)
+ALTER SYSTEM SET shared_buffers = '100MB';
+SELECT pg_reload_conf();
+SELECT pg_sleep(1);
+SHOW shared_buffers;
+SELECT * FROM buffer_allocations;
+SELECT * FROM buffer_segments;
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+
+-- Test 5: Set to minimum 128kB
+ALTER SYSTEM SET shared_buffers = '128kB';
+SELECT pg_reload_conf();
+SELECT pg_sleep(1);
+SHOW shared_buffers;
+SELECT * FROM buffer_allocations;
+SELECT * FROM buffer_segments;
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+
+-- Clean up the schema and all its objects
+RESET search_path;
+DROP SCHEMA buffer_resize_test CASCADE;
diff --git a/src/test/buffermgr/t/001_resize_buffer.pl b/src/test/buffermgr/t/001_resize_buffer.pl
new file mode 100644
index 00000000000..8cf9e4539ab
--- /dev/null
+++ b/src/test/buffermgr/t/001_resize_buffer.pl
@@ -0,0 +1,126 @@
+# Copyright (c) 2025-2025, PostgreSQL Global Development Group
+#
+# Minimal test testing shared_buffer resizing under load
+
+use strict;
+use warnings;
+use IPC::Run;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Function to resize buffer pool and verify the change.
+sub apply_and_verify_buffer_change
+{
+	my ($node, $new_size) = @_;
+	
+	# Use a single background_psql session for consistency
+	my $psql_session = $node->background_psql('postgres');
+	$psql_session->query_safe("ALTER SYSTEM SET shared_buffers = '$new_size'");
+	$psql_session->query_safe("SELECT pg_reload_conf()");
+	
+	# Wait till the resizing finishes using the same session
+	# 
+	# TODO: Right now there is no way to know when the resize has finished and
+	# all the backends are using new value of shared_buffers. Hence we poll
+	# manually until we get the expected value in the same session.
+	my $current_size;
+	my $attempts = 0;
+	my $max_attempts = 60; # 60 seconds timeout
+	do {
+		$current_size = $psql_session->query_safe("SHOW shared_buffers");
+		$attempts++;
+		
+		# Only sleep if we didn't get the expected result and haven't timed out yet
+		if ($current_size ne $new_size && $attempts < $max_attempts) {
+			sleep(1);
+		}
+	} while ($current_size ne $new_size && $attempts < $max_attempts);
+	
+	$psql_session->quit;
+	
+	# Check if we succeeded or timed out
+	if ($current_size ne $new_size) {
+		die "Timeout waiting for shared_buffers to change to $new_size (got $current_size after ${attempts}s)";
+	}
+}
+
+# Initialize a cluster and start pgbench in the background for concurrent load.
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->start;
+$node->safe_psql('postgres', "CREATE EXTENSION pg_buffercache");
+my $pgb_scale = 10;
+my $pgb_duration = 120;
+my $pgb_num_clients = 10;
+$node->pgbench(
+	"--initialize --init-steps=dtpvg --scale=$pgb_scale --quiet",
+	0,
+	[qr{^$}],
+	[   # stderr patterns to verify initialization stages
+		qr{dropping old tables},
+		qr{creating tables},
+		qr{done in \d+\.\d\d s }
+	],
+	"pgbench initialization (scale=$pgb_scale)"
+);
+my ($pgbench_stdin, $pgbench_stdout, $pgbench_stderr) = ('', '', '');
+my $pgbench_process = IPC::Run::start(
+	[
+		'pgbench',
+		'-p', $node->port,
+		'-T', $pgb_duration,
+		'-c', $pgb_num_clients,
+		'postgres'
+	],
+	'<'  => \$pgbench_stdin,
+	'>'  => \$pgbench_stdout,
+	'2>' => \$pgbench_stderr
+);
+
+ok($pgbench_process, "pgbench started successfully");
+
+# Allow pgbench to establish connections and start generating load.
+# 
+# TODO: When creating new backends is known to work well with buffer pool
+# resizing, this wait should be removed.
+sleep(1);
+
+# Resize buffer pool to various sizes while pgbench is running in the
+# background.
+# 
+# TODO: These are pseudo-randomly picked sizes, but we can do better.
+my $tests_completed = 0;
+my @buffer_sizes = ('900MB', '500MB', '250MB', '400MB', '120MB', '600MB');
+for my $target_size (@buffer_sizes)
+{
+	# Verify workload generator is still running
+	if (!$pgbench_process->pumpable) {
+		ok(0, "pgbench is still running");
+		last;
+	}
+	
+	apply_and_verify_buffer_change($node, $target_size);
+	$tests_completed++;
+	
+	# Wait for the resized buffer pool to stabilize. If the resized buffer pool
+	# is utilized fully, it might hit any wrongly initialized areas of shared
+	# memory.
+	sleep(2);
+}
+is($tests_completed, scalar(@buffer_sizes), "All buffer sizes were tested");
+
+# Make sure that pgbench can end normally.
+$pgbench_process->signal('TERM');
+IPC::Run::finish $pgbench_process;
+ok(grep { $pgbench_process->result == $_ } (0, 15),  "pgbench exited gracefully");
+
+# Log any error output from pgbench for debugging
+diag("pgbench stderr:\n$pgbench_stderr");
+diag("pgbench stdout:\n$pgbench_stdout");
+
+# Ensure database is still functional after all the buffer changes
+$node->connect_ok("dbname=postgres", 
+	"Database remains accessible after $tests_completed buffer resize operations");
+
+done_testing();
\ No newline at end of file
diff --git a/src/test/meson.build b/src/test/meson.build
index ccc31d6a86a..2a5ba1dec39 100644
--- a/src/test/meson.build
+++ b/src/test/meson.build
@@ -4,6 +4,7 @@ subdir('regress')
 subdir('isolation')
 
 subdir('authentication')
+subdir('buffermgr')
 subdir('postmaster')
 subdir('recovery')
 subdir('subscription')
-- 
2.34.1

#117

Andres Freund

andres@anarazel.de

4 months ago

In reply to: Ashutosh Bapat (#116)

Re: Changing shared_buffers without restart

Hi,

On 2025-09-18 10:25:29 +0530, Ashutosh Bapat wrote:

From d1ed934ccd02fca2c831e582b07a169e17d19f59 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 15:14:33 +0200
Subject: [PATCH 02/16] Process config reload in AIO workers

I think this is superfluous due to b8e1f2d96bb9

Currenly AIO workers process interrupts only via CHECK_FOR_INTERRUPTS,
which does not include ConfigReloadPending. Thus we need to check for it
explicitly.

+/*
+ * Process any new interrupts.
+ */
+static void
+pgaio_worker_process_interrupts(void)
+{
+	/*
+	 * Reloading config can trigger further signals, complicating interrupts
+	 * processing -- so let it run first.
+	 *
+	 * XXX: Is there any need in memory barrier after ProcessConfigFile?
+	 */
+	if (ConfigReloadPending)
+	{
+		ConfigReloadPending = false;
+		ProcessConfigFile(PGC_SIGHUP);
+	}
+
+	if (ProcSignalBarrierPending)
+		ProcessProcSignalBarrier();
+}

Given that even before b8e1f2d96bb9 method_worker.c used
CHECK_FOR_INTERRUPTS(), which contains a ProcessProcSignalBarrier(), I don't
know why that second check was added here?

From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sun, 6 Apr 2025 16:40:32 +0200
Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks

Currently an assing hook can perform some preprocessing of a new value,
but it cannot change the behavior, which dictates that the new value
will be applied immediately after the hook. Certain GUC options (like
shared_buffers, coming in subsequent patches) may need coordinating work
between backends to change, meaning we cannot apply it right away.

Add a new flag "pending" for an assign hook to allow the hook indicate
exactly that. If the pending flag is set after the hook, the new value
will not be applied and it's handling becomes the hook's implementation
responsibility.

I doubt it makes sense to add this to the GUC system. I think it'd be better
to just use the GUC value as the desired "target" configuration and have a
function or a show-only GUC for reporting the current size.

I don't think you can't just block application of the GUC until the resize is
complete. E.g. what if the value was too big and the new configuration needs
to fixed to be lower?

From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 4 Apr 2025 21:46:14 +0200
Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration

Currently WaitForProcSignalBarrier allows to make sure the message sent
via EmitProcSignalBarrier was processed by all ProcSignal mechanism
participants.

Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
which will be updated when a process has received the message, but not
processed it yet. This makes it possible to support a new mode of
waiting, when ProcSignal participants want to synchronize message
processing. To do that, a participant can wait via
WaitForProcSignalBarrierReceived when processing a message, effectively
making sure that all processes are going to start processing
ProcSignalBarrier simultaneously.

I doubt "online resizing" that requires synchronously processing the same
event, can really be called "online". There can be significant delays in
processing a barrier, stalling the entire server until that is reached seems
like a complete no-go for production systems?

From 63fe27340656c52b13f4eecebd9e73d24efe5e33 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 28 Feb 2025 19:54:47 +0100
Subject: [PATCH 05/16] Allow to use multiple shared memory mappings

Currently all the work with shared memory is done via a single anonymous
memory mapping, which limits ways how the shared memory could be organized.

Introduce possibility to allocate multiple shared memory mappings, where
a single mapping is associated with a specified shared memory segment.
There is only fixed amount of available segments, currently only one
main shared memory segment is allocated. A new shared memory API is
introduces, extended with a segment as a new parameter. As a path of
least resistance, the original API is kept in place, utilizing the main
shared memory segment.

-#define MAX_ON_EXITS 20
+#define MAX_ON_EXITS 40

Why does a patch like this contain changes like this mixed in with the rest?
That's clearly not directly related to $subject.

/* shared memory global variables */
-static PGShmemHeader *ShmemSegHdr;	/* shared mem segment header */
+ShmemSegment Segments[ANON_MAPPINGS];
-static void *ShmemBase; /* start address of shared memory */
-
-static void *ShmemEnd; /* end+1 address of shared memory */
-
-slock_t *ShmemLock; /* spinlock for shared memory and LWLock
- * allocation */
-
-static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */

Why do we need a separate ShmemLock for each segment? Besides being
unnecessary, it seems like that prevents locking in a way that provides
consistency across all segments.

From e2f48da8a8206711b24e34040d699431910fbf9c Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 11:47:04 +0200
Subject: [PATCH 06/16] Address space reservation for shared memory

Currently the shared memory layout is designed to pack everything tight
together, leaving no space between mappings for resizing. Here is how it
looks like for one mapping in /proc/$PID/maps, /dev/zero represents the
anonymous shared memory we talk about:

00400000-00490000 /path/bin/postgres
...
012d9000-0133e000 [heap]
7f443a800000-7f470a800000 /dev/zero (deleted)
7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34
...

Make the layout more dynamic via splitting every shared memory segment
into two parts:

* An anonymous file, which actually contains shared memory content. Such
an anonymous file is created via memfd_create, it lives in memory,
behaves like a regular file and semantically equivalent to an
anonymous memory allocated via mmap with MAP_ANONYMOUS.

* A reservation mapping, which size is much larger than required shared
segment size. This mapping is created with flags PROT_NONE (which
makes sure the reserved space is not used), and MAP_NORESERVE (to not
count the reserved space against memory limits). The anonymous file is
mapped into this reservation mapping.

The commit message fails to explain why, if we're already relying on
MAP_NORESERVE, we need to anything else? Why can't we just have one maximally
sized allocation that's marked MAP_NORESERVE for all the parts that we don't
yet need?

There are also few unrelated advantages of using anon files:

* We've got a file descriptor, which could be used for regular file
operations (modification, truncation, you name it).

What is this an advantage for?

* The file could be given a name, which improves readability when it
comes to process maps.

* By default, Linux will not add file-backed shared mappings into a core dump,
making it more convenient to work with them in PostgreSQL: no more huge dumps
to process.

That's just as well a downside, because you now can't investigate some
issues. This was already configurable via coredump_filter.

From 942b69a0876b0e83303e6704da54c4c002a5a2d8 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 11:22:02 +0200
Subject: [PATCH 07/16] Introduce multiple shmem segments for shared buffers

Add more shmem segments to split shared buffers into following chunks:
* BUFFERS_SHMEM_SEGMENT: contains buffer blocks
* BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
* BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
* CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
* STRATEGY_SHMEM_SEGMENT: contains buffer strategy status

Why do all these need to be separate segments? Afaict we'll have to maximally
size everything other than BUFFERS_SHMEM_SEGMENT at start?

From 78bc0a49f8ebe17927abd66164764745ecc6d563 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 14:16:55 +0200
Subject: [PATCH 11/16] Allow to resize shared memory without restart

Add assing hook for shared_buffers to resize shared memory using space,
introduced in the previous commits without requiring PostgreSQL restart.
Essentially the implementation is based on two mechanisms: a
ProcSignalBarrier is used to make sure all processes are starting the
resize procedure simultaneously, and a global Barrier is used to
coordinate after that and make sure all finished processes are waiting
for others that are in progress.

The resize process looks like this:

* The GUC assign hook sets a flag to let the Postmaster know that resize
was requested.

* Postmaster verifies the flag in the event loop, and starts the resize
by emitting a ProcSignal barrier.

* All processes, that participate in ProcSignal mechanism, begin to
process ProcSignal barrier. First a process waits until all processes
have confirmed they received the message and can start simultaneously.

As mentioned above, this basically makes the entire feature not really
online. Besides the latency of some processes not getting to the barrier
immediately, there's also the issue that actually reserving large amounts of
memory can take a long time - during which all processes would be unavailable.

I really don't see that being viable. It'd be one thing if that were a
"temporary" restriction, but the whole design seems to be fairly centered
around that.

* Every process recalculates shared memory size based on the new
NBuffers, adjusts its size using ftruncate and adjust reservation
permissions with mprotect. One elected process signals the postmaster
to do the same.

If we just used a single memory mapping with all unused parts marked
MAP_NORESERVE, we wouldn't need this (and wouldn't need a fair bit of other
work in this patchset)..

From experiment it turns out that shared mappings have to be extended
separately for each process that uses them. Another rough edge is that a
backend blocked on ReadCommand will not apply shared_buffers change
until it receives something.

That's not a rough edge, that basically makes the feature unusable, no?

+-- Test 2: Set to 64MB  
+ALTER SYSTEM SET shared_buffers = '64MB';
+SELECT pg_reload_conf();
+SELECT pg_sleep(1);
+SHOW shared_buffers;

Tests containing sleeps are a significant warning flag imo.

Greetings,

Andres Freund

#118

Andres Freund

andres@anarazel.de

4 months ago

In reply to: Andres Freund (#117)

Re: Changing shared_buffers without restart

Hi,

On 2025-09-18 09:52:03 -0400, Andres Freund wrote:

On 2025-09-18 10:25:29 +0530, Ashutosh Bapat wrote:

From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 4 Apr 2025 21:46:14 +0200
Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration

Currently WaitForProcSignalBarrier allows to make sure the message sent
via EmitProcSignalBarrier was processed by all ProcSignal mechanism
participants.

Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
which will be updated when a process has received the message, but not
processed it yet. This makes it possible to support a new mode of
waiting, when ProcSignal participants want to synchronize message
processing. To do that, a participant can wait via
WaitForProcSignalBarrierReceived when processing a message, effectively
making sure that all processes are going to start processing
ProcSignalBarrier simultaneously.

I doubt "online resizing" that requires synchronously processing the same
event, can really be called "online". There can be significant delays in
processing a barrier, stalling the entire server until that is reached seems
like a complete no-go for production systems?

[...]

From 78bc0a49f8ebe17927abd66164764745ecc6d563 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 14:16:55 +0200
Subject: [PATCH 11/16] Allow to resize shared memory without restart

Add assing hook for shared_buffers to resize shared memory using space,
introduced in the previous commits without requiring PostgreSQL restart.
Essentially the implementation is based on two mechanisms: a
ProcSignalBarrier is used to make sure all processes are starting the
resize procedure simultaneously, and a global Barrier is used to
coordinate after that and make sure all finished processes are waiting
for others that are in progress.

The resize process looks like this:

* The GUC assign hook sets a flag to let the Postmaster know that resize
was requested.

* Postmaster verifies the flag in the event loop, and starts the resize
by emitting a ProcSignal barrier.

* All processes, that participate in ProcSignal mechanism, begin to
process ProcSignal barrier. First a process waits until all processes
have confirmed they received the message and can start simultaneously.

As mentioned above, this basically makes the entire feature not really
online. Besides the latency of some processes not getting to the barrier
immediately, there's also the issue that actually reserving large amounts of
memory can take a long time - during which all processes would be unavailable.

I really don't see that being viable. It'd be one thing if that were a
"temporary" restriction, but the whole design seems to be fairly centered
around that.

Besides not really being online, isn't this a recipe for endless undetected
deadlocks? What if process A waits for a lock held by process B and process B
arrives at the barrier? Process A won't ever get there, because process B
can't make progress, because A is not making progress.

Greetings,

Andres Freund

#119

Dmitry Dolgov

9erthalion6@gmail.com

4 months ago

In reply to: Andres Freund (#118)

Re: Changing shared_buffers without restart

Sorry for late reply folks.

On Thu, Sep 18, 2025 at 09:52:03AM -0400, Andres Freund wrote:

From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sun, 6 Apr 2025 16:40:32 +0200
Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks

Currently an assing hook can perform some preprocessing of a new value,
but it cannot change the behavior, which dictates that the new value
will be applied immediately after the hook. Certain GUC options (like
shared_buffers, coming in subsequent patches) may need coordinating work
between backends to change, meaning we cannot apply it right away.

Add a new flag "pending" for an assign hook to allow the hook indicate
exactly that. If the pending flag is set after the hook, the new value
will not be applied and it's handling becomes the hook's implementation
responsibility.

I doubt it makes sense to add this to the GUC system. I think it'd be better
to just use the GUC value as the desired "target" configuration and have a
function or a show-only GUC for reporting the current size.

I don't think you can't just block application of the GUC until the resize is
complete. E.g. what if the value was too big and the new configuration needs
to fixed to be lower?

I think it was a bit hasty to post another version of the patch without
the design changes we've agreed upon last time. I'm still working on
that (sorry, it takes time, I haven't wrote so much Perl for testing
since forever), the current implementation doesn't include anything with
GUC to simplify the discussion. I'm still convinced that multi-step GUC
changing makes sense, but it has proven to be more complicated than I
anticipated, so I'll spin up another thread to discuss when I come to
it.

From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 4 Apr 2025 21:46:14 +0200
Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration

Currently WaitForProcSignalBarrier allows to make sure the message sent
via EmitProcSignalBarrier was processed by all ProcSignal mechanism
participants.

Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
which will be updated when a process has received the message, but not
processed it yet. This makes it possible to support a new mode of
waiting, when ProcSignal participants want to synchronize message
processing. To do that, a participant can wait via
WaitForProcSignalBarrierReceived when processing a message, effectively
making sure that all processes are going to start processing
ProcSignalBarrier simultaneously.

I doubt "online resizing" that requires synchronously processing the same
event, can really be called "online". There can be significant delays in
processing a barrier, stalling the entire server until that is reached seems
like a complete no-go for production systems?

[...]

As mentioned above, this basically makes the entire feature not really
online. Besides the latency of some processes not getting to the barrier
immediately, there's also the issue that actually reserving large amounts of
memory can take a long time - during which all processes would be unavailable.

I really don't see that being viable. It'd be one thing if that were a
"temporary" restriction, but the whole design seems to be fairly centered
around that.

[...]

Besides not really being online, isn't this a recipe for endless undetected
deadlocks? What if process A waits for a lock held by process B and process B
arrives at the barrier? Process A won't ever get there, because process B
can't make progress, because A is not making progress.

Same as above, in the version I'm working right now it's changed in
favor of an approach that looks more like the one from "online checksum
change" patch. I've even stumbled upon a cases when a process was just
killed and never arrive at the barrier, so that was it. The new approach
makes certain parts simpler, but requires managing backends with
different understanding of how large shared memory segments are for some
time interval. Introducing a new parameter "number of available buffers"
seems to be helpful to address all cases I've found so far.

Btw, under "online" resizing I mostly understood "without restart", the
goal was not to make it really "online".

-#define MAX_ON_EXITS 20
+#define MAX_ON_EXITS 40
Why does a patch like this contain changes like this mixed in with the rest?
That's clearly not directly related to $subject.

An artifact of rebasing, it belonged to 0007.

From e2f48da8a8206711b24e34040d699431910fbf9c Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 11:47:04 +0200
Subject: [PATCH 06/16] Address space reservation for shared memory

Currently the shared memory layout is designed to pack everything tight
together, leaving no space between mappings for resizing. Here is how it
looks like for one mapping in /proc/$PID/maps, /dev/zero represents the
anonymous shared memory we talk about:

00400000-00490000 /path/bin/postgres
...
012d9000-0133e000 [heap]
7f443a800000-7f470a800000 /dev/zero (deleted)
7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34
...

Make the layout more dynamic via splitting every shared memory segment
into two parts:

* An anonymous file, which actually contains shared memory content. Such
an anonymous file is created via memfd_create, it lives in memory,
behaves like a regular file and semantically equivalent to an
anonymous memory allocated via mmap with MAP_ANONYMOUS.

* A reservation mapping, which size is much larger than required shared
segment size. This mapping is created with flags PROT_NONE (which
makes sure the reserved space is not used), and MAP_NORESERVE (to not
count the reserved space against memory limits). The anonymous file is
mapped into this reservation mapping.

The commit message fails to explain why, if we're already relying on
MAP_NORESERVE, we need to anything else? Why can't we just have one maximally
sized allocation that's marked MAP_NORESERVE for all the parts that we don't
yet need?

How do we return memory to the OS in that case? Currently it's done
explicitly via truncating the anonymous file.

* The file could be given a name, which improves readability when it
comes to process maps.

* By default, Linux will not add file-backed shared mappings into a core dump,
making it more convenient to work with them in PostgreSQL: no more huge dumps
to process.

That's just as well a downside, because you now can't investigate some
issues. This was already configurable via coredump_filter.

This behaviour is configured via coredump_filter as well, so just the
default value has been changed.

From 942b69a0876b0e83303e6704da54c4c002a5a2d8 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 11:22:02 +0200
Subject: [PATCH 07/16] Introduce multiple shmem segments for shared buffers

Add more shmem segments to split shared buffers into following chunks:
* BUFFERS_SHMEM_SEGMENT: contains buffer blocks
* BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
* BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
* CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
* STRATEGY_SHMEM_SEGMENT: contains buffer strategy status

Why do all these need to be separate segments? Afaict we'll have to maximally
size everything other than BUFFERS_SHMEM_SEGMENT at start?

Why would they need to me maxed out at the start? So far my rule of
thumb was one segment for one structure which size depends on NBuffers,
so that when changing NBuffers each segment could be adjusted
independently.

+-- Test 2: Set to 64MB  
+ALTER SYSTEM SET shared_buffers = '64MB';
+SELECT pg_reload_conf();
+SELECT pg_sleep(1);
+SHOW shared_buffers;
Tests containing sleeps are a significant warning flag imo.

Tests I'm preparing so far avoiding this by waiting in injection points.
I haven't found anything similar in existing tests, but I assume such
approach is fine.

#120

Andres Freund

andres@anarazel.de

4 months ago

In reply to: Dmitry Dolgov (#119)

Re: Changing shared_buffers without restart

Hi,

On 2025-09-26 20:04:21 +0200, Dmitry Dolgov wrote:

On Thu, Sep 18, 2025 at 09:52:03AM -0400, Andres Freund wrote:

From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sun, 6 Apr 2025 16:40:32 +0200
Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks

Currently an assing hook can perform some preprocessing of a new value,
but it cannot change the behavior, which dictates that the new value
will be applied immediately after the hook. Certain GUC options (like
shared_buffers, coming in subsequent patches) may need coordinating work
between backends to change, meaning we cannot apply it right away.

Add a new flag "pending" for an assign hook to allow the hook indicate
exactly that. If the pending flag is set after the hook, the new value
will not be applied and it's handling becomes the hook's implementation
responsibility.

I doubt it makes sense to add this to the GUC system. I think it'd be better
to just use the GUC value as the desired "target" configuration and have a
function or a show-only GUC for reporting the current size.

I don't think you can't just block application of the GUC until the resize is
complete. E.g. what if the value was too big and the new configuration needs
to fixed to be lower?

I think it was a bit hasty to post another version of the patch without
the design changes we've agreed upon last time. I'm still working on
that (sorry, it takes time, I haven't wrote so much Perl for testing
since forever), the current implementation doesn't include anything with
GUC to simplify the discussion. I'm still convinced that multi-step GUC
changing makes sense, but it has proven to be more complicated than I
anticipated, so I'll spin up another thread to discuss when I come to
it.

FWIW, I'm fairly convinced it's a completely dead end.

From e2f48da8a8206711b24e34040d699431910fbf9c Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 11:47:04 +0200
Subject: [PATCH 06/16] Address space reservation for shared memory

Currently the shared memory layout is designed to pack everything tight
together, leaving no space between mappings for resizing. Here is how it
looks like for one mapping in /proc/$PID/maps, /dev/zero represents the
anonymous shared memory we talk about:

00400000-00490000 /path/bin/postgres
...
012d9000-0133e000 [heap]
7f443a800000-7f470a800000 /dev/zero (deleted)
7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34
...

Make the layout more dynamic via splitting every shared memory segment
into two parts:

* An anonymous file, which actually contains shared memory content. Such
an anonymous file is created via memfd_create, it lives in memory,
behaves like a regular file and semantically equivalent to an
anonymous memory allocated via mmap with MAP_ANONYMOUS.

* A reservation mapping, which size is much larger than required shared
segment size. This mapping is created with flags PROT_NONE (which
makes sure the reserved space is not used), and MAP_NORESERVE (to not
count the reserved space against memory limits). The anonymous file is
mapped into this reservation mapping.

The commit message fails to explain why, if we're already relying on
MAP_NORESERVE, we need to anything else? Why can't we just have one maximally
sized allocation that's marked MAP_NORESERVE for all the parts that we don't
yet need?

How do we return memory to the OS in that case? Currently it's done
explicitly via truncating the anonymous file.

madvise with MADV_DONTNEED or MADV_REMOVE.

Greetings,

Andres Freund

#121

Dmitry Dolgov

9erthalion6@gmail.com

4 months ago

In reply to: Ashutosh Bapat (#116)

Re: Changing shared_buffers without restart

On Thu, Sep 18, 2025 at 10:25:29AM +0530, Ashutosh Bapat wrote:
Given these things, I think we should set up the buffer lookup table
to hold maximum entries required to expand the buffer pool to its
maximum, right at the beginning.

Thanks for investigating. I think another option would be to rebuild the
buffer lookup table (create a new table based on the new size and copy
the data over from the original one) as part of the resize procedure,
alongsize with buffers eviction and initialization. From what I recall
the size of buffer lookup table is about two orders of magnitude lower
than shared buffers, so the overhead should not be that large even for
significant amount of buffers.

#122

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

4 months ago

In reply to: Dmitry Dolgov (#121)

Re: Changing shared_buffers without restart

On Sun, Sep 28, 2025 at 2:54 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Thu, Sep 18, 2025 at 10:25:29AM +0530, Ashutosh Bapat wrote:
Given these things, I think we should set up the buffer lookup table
to hold maximum entries required to expand the buffer pool to its
maximum, right at the beginning.

Thanks for investigating. I think another option would be to rebuild the
buffer lookup table (create a new table based on the new size and copy
the data over from the original one) as part of the resize procedure,
alongsize with buffers eviction and initialization. From what I recall
the size of buffer lookup table is about two orders of magnitude lower
than shared buffers, so the overhead should not be that large even for
significant amount of buffers.

The proposal will work but will require significant work:

1. The pointer to the shared buffer lookup table will change. The
change needs to be absorbed by all the processes at the same time; we
can not have few processes accessing old lookup table and few
processes new one. That has potential to make many processes wait for
a very long time. That can be fixed by accessing a new pointer when
the next buffer lookup access happens by modifying BufTable*
functions. But that means an extra condition checks and some extra
code in those hot paths. Not sure whether that's acceptable.
2. The memory consumed by the old buffer lookup table will need to be
"freed" to the OS. The only way to do so is by having a new memory
segment (which can be unmapped) or unmapping portions of segment
dedicated to the buffer lookup table. That's some more synchronization
and additional wait times for backends.
3. When the new shared buffer lookup table will be built, processes
may be able to access it in shared mode but they may not be able to
make changes to it (or else we need to make corresponding changes to
new table as well). That means more restrictions on the running
backends.

I am not saying that we can not implement your idea, but maybe we
could do that incrementally after basic resizing is in place.

--
Best Wishes,
Ashutosh Bapat

#123

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

4 months ago

In reply to: Dmitry Dolgov (#119)

Re: Changing shared_buffers without restart

On Fri, Sep 26, 2025 at 11:34 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Sorry for late reply folks.

On Thu, Sep 18, 2025 at 09:52:03AM -0400, Andres Freund wrote:

From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sun, 6 Apr 2025 16:40:32 +0200
Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks

Currently an assing hook can perform some preprocessing of a new value,
but it cannot change the behavior, which dictates that the new value
will be applied immediately after the hook. Certain GUC options (like
shared_buffers, coming in subsequent patches) may need coordinating work
between backends to change, meaning we cannot apply it right away.

Add a new flag "pending" for an assign hook to allow the hook indicate
exactly that. If the pending flag is set after the hook, the new value
will not be applied and it's handling becomes the hook's implementation
responsibility.

I doubt it makes sense to add this to the GUC system. I think it'd be better
to just use the GUC value as the desired "target" configuration and have a
function or a show-only GUC for reporting the current size.

I don't think you can't just block application of the GUC until the resize is
complete. E.g. what if the value was too big and the new configuration needs
to fixed to be lower?

I think it was a bit hasty to post another version of the patch without
the design changes we've agreed upon last time. I'm still working on
that (sorry, it takes time, I haven't wrote so much Perl for testing
since forever), the current implementation doesn't include anything with
GUC to simplify the discussion. I'm still convinced that multi-step GUC
changing makes sense, but it has proven to be more complicated than I
anticipated, so I'll spin up another thread to discuss when I come to
it.

From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 4 Apr 2025 21:46:14 +0200
Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration

Currently WaitForProcSignalBarrier allows to make sure the message sent
via EmitProcSignalBarrier was processed by all ProcSignal mechanism
participants.

Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
which will be updated when a process has received the message, but not
processed it yet. This makes it possible to support a new mode of
waiting, when ProcSignal participants want to synchronize message
processing. To do that, a participant can wait via
WaitForProcSignalBarrierReceived when processing a message, effectively
making sure that all processes are going to start processing
ProcSignalBarrier simultaneously.

I doubt "online resizing" that requires synchronously processing the same
event, can really be called "online". There can be significant delays in
processing a barrier, stalling the entire server until that is reached seems
like a complete no-go for production systems?

[...]

As mentioned above, this basically makes the entire feature not really
online. Besides the latency of some processes not getting to the barrier
immediately, there's also the issue that actually reserving large amounts of
memory can take a long time - during which all processes would be unavailable.

I really don't see that being viable. It'd be one thing if that were a
"temporary" restriction, but the whole design seems to be fairly centered
around that.

[...]

Besides not really being online, isn't this a recipe for endless undetected
deadlocks? What if process A waits for a lock held by process B and process B
arrives at the barrier? Process A won't ever get there, because process B
can't make progress, because A is not making progress.

Same as above, in the version I'm working right now it's changed in
favor of an approach that looks more like the one from "online checksum
change" patch. I've even stumbled upon a cases when a process was just
killed and never arrive at the barrier, so that was it. The new approach
makes certain parts simpler, but requires managing backends with
different understanding of how large shared memory segments are for some
time interval. Introducing a new parameter "number of available buffers"
seems to be helpful to address all cases I've found so far.

Btw, under "online" resizing I mostly understood "without restart", the
goal was not to make it really "online".
-#define MAX_ON_EXITS 20
+#define MAX_ON_EXITS 40
Why does a patch like this contain changes like this mixed in with the rest?
That's clearly not directly related to $subject.
An artifact of rebasing, it belonged to 0007.

From e2f48da8a8206711b24e34040d699431910fbf9c Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 11:47:04 +0200
Subject: [PATCH 06/16] Address space reservation for shared memory

Currently the shared memory layout is designed to pack everything tight
together, leaving no space between mappings for resizing. Here is how it
looks like for one mapping in /proc/$PID/maps, /dev/zero represents the
anonymous shared memory we talk about:

00400000-00490000 /path/bin/postgres
...
012d9000-0133e000 [heap]
7f443a800000-7f470a800000 /dev/zero (deleted)
7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34
...

Make the layout more dynamic via splitting every shared memory segment
into two parts:

* An anonymous file, which actually contains shared memory content. Such
an anonymous file is created via memfd_create, it lives in memory,
behaves like a regular file and semantically equivalent to an
anonymous memory allocated via mmap with MAP_ANONYMOUS.

* A reservation mapping, which size is much larger than required shared
segment size. This mapping is created with flags PROT_NONE (which
makes sure the reserved space is not used), and MAP_NORESERVE (to not
count the reserved space against memory limits). The anonymous file is
mapped into this reservation mapping.

The commit message fails to explain why, if we're already relying on
MAP_NORESERVE, we need to anything else? Why can't we just have one maximally
sized allocation that's marked MAP_NORESERVE for all the parts that we don't
yet need?

How do we return memory to the OS in that case? Currently it's done
explicitly via truncating the anonymous file.

* The file could be given a name, which improves readability when it
comes to process maps.

* By default, Linux will not add file-backed shared mappings into a core dump,
making it more convenient to work with them in PostgreSQL: no more huge dumps
to process.

That's just as well a downside, because you now can't investigate some
issues. This was already configurable via coredump_filter.

This behaviour is configured via coredump_filter as well, so just the
default value has been changed.

From 942b69a0876b0e83303e6704da54c4c002a5a2d8 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 11:22:02 +0200
Subject: [PATCH 07/16] Introduce multiple shmem segments for shared buffers

Add more shmem segments to split shared buffers into following chunks:
* BUFFERS_SHMEM_SEGMENT: contains buffer blocks
* BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
* BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
* CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
* STRATEGY_SHMEM_SEGMENT: contains buffer strategy status

Why do all these need to be separate segments? Afaict we'll have to maximally
size everything other than BUFFERS_SHMEM_SEGMENT at start?

Why would they need to me maxed out at the start? So far my rule of
thumb was one segment for one structure which size depends on NBuffers,
so that when changing NBuffers each segment could be adjusted
independently.

Offlist Andres expressed that having multiple shared memory segments
may impact the time it takes to disconnect a backend. If the
application is using all the configured number of backends, a slow
disconnection will lead to a slow connection. If we want to go the
route of multple segments (as many as 5) it would make sense to
measure that impact first.

Maxing out at start avoids using multiple segments. Those segments
have much much lower memory compared to the buffer blocks even when
maxed out with a reasonable max_shared_buffers setting. We avoid
complicating code for a small increase in shared memory.

--
Best Wishes,
Ashutosh Bapat

#124

Dmitry Dolgov

9erthalion6@gmail.com

3 months ago

In reply to: Ashutosh Bapat (#122)

Re: Changing shared_buffers without restart

On Mon, Sep 29, 2025 at 12:21:08PM +0530, Ashutosh Bapat wrote:
On Sun, Sep 28, 2025 at 2:54 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Thu, Sep 18, 2025 at 10:25:29AM +0530, Ashutosh Bapat wrote:
Given these things, I think we should set up the buffer lookup table
to hold maximum entries required to expand the buffer pool to its
maximum, right at the beginning.

Thanks for investigating. I think another option would be to rebuild the
buffer lookup table (create a new table based on the new size and copy
the data over from the original one) as part of the resize procedure,
alongsize with buffers eviction and initialization. From what I recall
the size of buffer lookup table is about two orders of magnitude lower
than shared buffers, so the overhead should not be that large even for
significant amount of buffers.

The proposal will work but will require significant work:

1. The pointer to the shared buffer lookup table will change.

Which pointers you mean? AFAICT no operation on the buffer lookup table
returns a pointer (they work with buffer id or a hash) and keys are
compared by value as well.

we can not have few processes accessing old lookup table and few
processes new one. That has potential to make many processes wait for
a very long time.

As I've mentioned above, size of the buffer lookup table is few
magnitudes lower than shared buffers, so I doubt about "a very long
time". But it can be measured.

2. The memory consumed by the old buffer lookup table will need to be
"freed" to the OS. The only way to do so is by having a new memory
segment

Shared buffer lookup table already lives in it's own segment as
implemented in the current patch, so I don't see any problem here.

I see you folks are inclined to keep some small segments static and
allocate maximum allowed memory for it. It's an option, at the end of
the day we need to experiment and measure both approaches.

#125

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

3 months ago

In reply to: Dmitry Dolgov (#124)

Re: Changing shared_buffers without restart

On Wed, Oct 1, 2025 at 2:40 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Mon, Sep 29, 2025 at 12:21:08PM +0530, Ashutosh Bapat wrote:
On Sun, Sep 28, 2025 at 2:54 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Thu, Sep 18, 2025 at 10:25:29AM +0530, Ashutosh Bapat wrote:
Given these things, I think we should set up the buffer lookup table
to hold maximum entries required to expand the buffer pool to its
maximum, right at the beginning.

Thanks for investigating. I think another option would be to rebuild the
buffer lookup table (create a new table based on the new size and copy
the data over from the original one) as part of the resize procedure,
alongsize with buffers eviction and initialization. From what I recall
the size of buffer lookup table is about two orders of magnitude lower
than shared buffers, so the overhead should not be that large even for
significant amount of buffers.

The proposal will work but will require significant work:

1. The pointer to the shared buffer lookup table will change.

Which pointers you mean? AFAICT no operation on the buffer lookup table
returns a pointer (they work with buffer id or a hash) and keys are
compared by value as well.

The buffer lookup table itself.
/* Pass location of hashtable header to hash_create */
infoP->hctl = (HASHHDR *) location;

we can not have few processes accessing old lookup table and few
processes new one. That has potential to make many processes wait for
a very long time.

As I've mentioned above, size of the buffer lookup table is few
magnitudes lower than shared buffers, so I doubt about "a very long
time". But it can be measured.

2. The memory consumed by the old buffer lookup table will need to be
"freed" to the OS. The only way to do so is by having a new memory
segment

Shared buffer lookup table already lives in it's own segment as
implemented in the current patch, so I don't see any problem here.

The table is not a single chunk of memory. It's a few chunks spread
across the shared memory segment. Freeing a lookup table is like
freeing those chunks. We have ways to free tail parts of shared memory
segments, but not chunks in-between.

--
Best Wishes,
Ashutosh Bapat

#126

Dmitry Dolgov

9erthalion6@gmail.com

3 months ago

In reply to: Ashutosh Bapat (#125)

Re: Changing shared_buffers without restart

On Wed, Oct 01, 2025 at 03:50:17PM +0530, Ashutosh Bapat wrote:
The buffer lookup table itself.
/* Pass location of hashtable header to hash_create */
infoP->hctl = (HASHHDR *) location;

How does this affect any users of the lookup table, if they do not even
get to see those?

Shared buffer lookup table already lives in it's own segment as
implemented in the current patch, so I don't see any problem here.

The table is not a single chunk of memory. It's a few chunks spread
across the shared memory segment. Freeing a lookup table is like
freeing those chunks. We have ways to free tail parts of shared memory
segments, but not chunks in-between.

Right, and the idea was to rebuild it completely to fit into the new
size, not just chunk-by-chunk.

#127

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

3 months ago

In reply to: Andres Freund (#117)

18 attachment(s)

Re: Changing shared_buffers without restart

Hi,

I started studying the interaction of the checkpointer process with
buffer pool resizing. Soon I noticed that the checkpointer didn't load
the config as frequently as other backends. When it is executing a
checkpoint, it does not reload the config for the entire duration of
the checkpoint for example. As the synchronization is implemented, in
the set of patches so far, the checkpointer will not see the new value
of shared_buffers and will not acknowledge the proc signal barrier and
thus not enter the synchronized buffer resizing. However, other
backends will notice that the checkpointer has received the proc
signal barrier and will enter the synchronization process. Once the
proc signal barrier is received by all the backends, the backends
which have entered the synchronization process will move forward with
resizing buffer pool leaving behind those who have received but not
acknowledged the proc signal barrier. At the end there will be two
sets of backends, one which have entered synchronization and see the
buffer pool with new size and the other which haven't entered
synchronization and do not see the buffer pool with new size. This
leads to SIGBUS, SIG 11 in the other set of backends. I saw this
mostly with the checkpointer process but we also saw it with other
types of backends.

Every aspect of buffer resizing that I started looking at was blocked
by this behaviour. Since there were already other suggestions and
comments about the current UI as well as synchronization mechanism, I
started implementing a different UI and synchronization as described
below. The WIP implementation is available in the attached set of
patches.

Patches 0001 to 0016 are the same as the previous patchset. I haven't
touched them in case someone would like to see an incremental change.
However, it's getting unwieldy at this point, so I will squash
relevant patches together and provide a patchset with fewer patches
next.
0017 reverts to 0003 and gets rid of the "pending" GUC flag which is
not required by the new UI. They will vanish from the next patchset.
0018 implements the new UI described below.

New UI and synchronization
======================

0018 changes the way "shared_buffers" is handled.
a. A new global variable NBuffersPending is used to hold the value of
this GUC. When the server starts, shared memory required by the buffer
manager is calculated using NBuffersPending instead of NBuffers. Once
the shared memory is allocated, NBuffers is set to NBuffersPending.
NBuffers, thus shows the number of buffers in the buffer pool instead
of the value of the GUC.
b. "shared_buffers" is PGC_SIGHUP now so it can be changed using ALTER
SYSTEM ... SET shared_buffers = ...; followed by SELECT
pg_reload_config(). But this does not resize the buffer pool. It
merely sets NBuffersPending to the new value. A new function
pg_resize_buffer_pool() (described later) can be used to resize the
buffer pool to the pending value.
c. show "shared_buffers" shows the value of NBuffers, and
NBuffersPending if it differs from NBuffers. I think we need some
adjustment here when the resizing is in progress since the value of
NBuffers would be changed to the size of the active buffer pool
(explained later in the email), but I haven't worked out those details
yet.

A new GUC max_shared_buffers sets the upper limit on "shared_buffers".
It is PGC_POSTMASTER; requires a restart to change the value. This GUC
is used a. to reserve the address space for future expansion of the
buffer pool and b. allocate memory for a maximally sized buffer lookup
table at the server start. We may decide to use the GUC to maximally
allocate data structures other than buffer blocks as suggested by
Andres. But these patches don't do that. The default for this GUC is
0, which means it will be the same as shared_buffers. This maintains
backward compatibility and also allows systems, which do not want to
resize shared buffer pool, to allocate minimum memory. When it is set
to a value other than 0, it should be set to a value higher than the
shared_buffers at the start.

We need to support the ALTER SYSTEM ... SET shared_buffers = "" for
backward compatibility. The users will still be able to perform ALTER
SYSTEM and restart the server with a newer size of buffer pool. Also
this allows the new buffer pool size to be written to
postgresql.auto.conf and persist it. With this we can simply use
pg_reload_conf() to load the new value along with other GUC changes.
pg_resize_buffer_pool() merely picks the new value from the backend
where it is executed and resizes the buffer pool. It does not need the
new value to be loaded in all the backends.

We may want to use a new PGC_ for this GUC but PGC_SIGHUP suffices for
the time being and it might be acceptable with clear documentation.

pg_resize_buffer_pool() implements phase wise buffer pool resizing
operation, but it does not block all the backends till the buffer pool
resizing is finished. It works as follows: Pasting from the prologue
in patch 0018.

When resizing the buffer pool is divided into two portions

- active buffer pool, which is the part of the buffer pool which
remains active even during resizing. Its size is given by
activeNBuffers. Newly allocated buffers will have their buffer ids
less than activeNBuffers.

- in-transit buffer pool, which is the part of the buffer pool which
may be accessible to some backends but not others depending upon the
time when a given backend processes a shrink/expand barrier. When
shrinking the buffer pool this is the part of the buffer pool which
will be evicted. When expanding the buffer pool this is the expanded
portion. Its size is given by transitNBuffers. The backends may see
buffer ids upto transitNBuffers till the resizing finishes.

Before starting resizing, activeNBuffers = transitNBuffers = NBuffers
where NBuffers is the size of buffer pool before resizing. NewNBuffers
is the new size of the shared buffer pool. After resizing finishes
activeNBuffers = transitNBuffers = NBuffers = newNBuffers.

In order to synchronize with other running backends, the coordinator
sends following ProcSignalBarriers in the order given below:

1. When shrinking the shared buffer pool the coordinator sends
SHBUF_SHRINK ProcSignalBarrier. Every backend sets activeNBuffers =
NewNBuffers to restrict its buffer pool allocations to the new size of
the buffer pool and acknowledges the ProcSignalBarrrier. Once every
backend has acknowledged, the coordinator evicts the buffers in the
area being shrunk. Note that tansitNBuffers is still NBuffers, so the
backends may see buffer ids upto NBuffers from earlier allocations
till eviction completes.

2. In both cases, when expanding the buffer pool or shrinking the
buffer pool, the coordinator sends SHBUF_RESIZE_MAP_AND_MEM
ProcSignalBarrier after resizing the shared memory segments and
initializing the required data structures if any. Every backend is
expected to adjust their shared memory segment address maps (by
calling AnonymousShmemResize()) and validate that their pointers to
the shared buffers structure are valid and have the right size. When
shrinking shared buffer pool transitNBuffers is set to NewNBuffers and
the backends should no longer see buffer ids beyond NewNBuffers; the
buffer resizing operation is finished at this stage. When expanding
they should set transitNBuffers to NewNBuffers to accommodate for the
backends which may accept the next barrier earlier than the others.
Once every backend acknowledges this barrier, the coordinator sends
the next barrier when expanding the buffer pool.

3. When expanding the buffer pool, the coordinator sends SHBUF_EXPAND
ProcSignalBarrier. The backends are expected to set activeNBuffers =
NewNBuffers and start allocating buffers from the expanded range. The
coordinator uses this barrier to know when all the backends have
settled using the new size of the buffer pool.

For either operation, at most two barriers are sent.

All this together in action looks like (See tests in the patch for
more examples)
SHOW shared_buffers; -- default
shared_buffers
----------------
128MB
(1 row)

ALTER SYSTEM SET shared_buffers = '64MB';
SELECT pg_reload_conf();
pg_reload_conf
----------------
t
(1 row)

SHOW shared_buffers;
shared_buffers
-----------------------
128MB (pending: 64MB)
(1 row)

SELECT pg_resize_shared_buffers();
pg_resize_shared_buffers
--------------------------
t
(1 row)

SHOW shared_buffers;
shared_buffers
----------------
64MB
(1 row)

ALTER SYSTEM SET shared_buffers = '256MB';
SELECT pg_reload_conf();
pg_reload_conf
----------------
t
(1 row)

SHOW shared_buffers;
shared_buffers
-----------------------
64MB (pending: 256MB)
(1 row)

SELECT pg_resize_shared_buffers();
pg_resize_shared_buffers
--------------------------
t
(1 row)

SHOW shared_buffers;
shared_buffers
----------------
256MB
(1 row)

On Thu, Sep 18, 2025 at 7:22 PM Andres Freund <andres@anarazel.de> wrote:

From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sun, 6 Apr 2025 16:40:32 +0200
Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks

Currently an assing hook can perform some preprocessing of a new value,
but it cannot change the behavior, which dictates that the new value
will be applied immediately after the hook. Certain GUC options (like
shared_buffers, coming in subsequent patches) may need coordinating work
between backends to change, meaning we cannot apply it right away.

Add a new flag "pending" for an assign hook to allow the hook indicate
exactly that. If the pending flag is set after the hook, the new value
will not be applied and it's handling becomes the hook's implementation
responsibility.

I doubt it makes sense to add this to the GUC system. I think it'd be better
to just use the GUC value as the desired "target" configuration and have a
function or a show-only GUC for reporting the current size.

This has been taken care of in the new implementation with slightly
different approach to show command as described above.

I don't think you can't just block application of the GUC until the resize is
complete. E.g. what if the value was too big and the new configuration needs
to fixed to be lower?

With the above approach, the application of the GUC won't be blocked
but if the size being applied is taking too long, the operation will
be required to be cancelled before the new resize can happen. That's a
part that needs some work. Chasing a moving target requires a very
complex implementation, which would be good to avoid in the first
version at least. However, we should leave room for that future
enhancement. The current implementation gives that flexibility, I
think.

From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 4 Apr 2025 21:46:14 +0200
Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration

Currently WaitForProcSignalBarrier allows to make sure the message sent
via EmitProcSignalBarrier was processed by all ProcSignal mechanism
participants.

Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
which will be updated when a process has received the message, but not
processed it yet. This makes it possible to support a new mode of
waiting, when ProcSignal participants want to synchronize message
processing. To do that, a participant can wait via
WaitForProcSignalBarrierReceived when processing a message, effectively
making sure that all processes are going to start processing
ProcSignalBarrier simultaneously.

I doubt "online resizing" that requires synchronously processing the same
event, can really be called "online". There can be significant delays in
processing a barrier, stalling the entire server until that is reached seems
like a complete no-go for production systems?

From 78bc0a49f8ebe17927abd66164764745ecc6d563 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 14:16:55 +0200
Subject: [PATCH 11/16] Allow to resize shared memory without restart

Add assing hook for shared_buffers to resize shared memory using space,
introduced in the previous commits without requiring PostgreSQL restart.
Essentially the implementation is based on two mechanisms: a
ProcSignalBarrier is used to make sure all processes are starting the
resize procedure simultaneously, and a global Barrier is used to
coordinate after that and make sure all finished processes are waiting
for others that are in progress.

The resize process looks like this:

* The GUC assign hook sets a flag to let the Postmaster know that resize
was requested.

* Postmaster verifies the flag in the event loop, and starts the resize
by emitting a ProcSignal barrier.

* All processes, that participate in ProcSignal mechanism, begin to
process ProcSignal barrier. First a process waits until all processes
have confirmed they received the message and can start simultaneously.

As mentioned above, this basically makes the entire feature not really
online. Besides the latency of some processes not getting to the barrier
immediately, there's also the issue that actually reserving large amounts of
memory can take a long time - during which all processes would be unavailable.

I really don't see that being viable. It'd be one thing if that were a
"temporary" restriction, but the whole design seems to be fairly centered
around that.

In the new implementation regular backends are not stalled when the
resizing is going on. They continue their work with possible temporary
performance degradation (this needs to be measured).

From experiment it turns out that shared mappings have to be extended
separately for each process that uses them. Another rough edge is that a
backend blocked on ReadCommand will not apply shared_buffers change
until it receives something.

That's not a rough edge, that basically makes the feature unusable, no?

New synchronization doesn't have this problem since it doesn't require
every backend to load the new value. The value being loaded only in
the backend where pg_resize_buffer_pool() is being run is enough.

From 942b69a0876b0e83303e6704da54c4c002a5a2d8 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 11:22:02 +0200
Subject: [PATCH 07/16] Introduce multiple shmem segments for shared buffers

Add more shmem segments to split shared buffers into following chunks:
* BUFFERS_SHMEM_SEGMENT: contains buffer blocks
* BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
* BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
* CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
* STRATEGY_SHMEM_SEGMENT: contains buffer strategy status

Why do all these need to be separate segments? Afaict we'll have to maximally
size everything other than BUFFERS_SHMEM_SEGMENT at start?

I am leaning towards that. I will implement that soon.

On Wed, Oct 1, 2025 at 2:40 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

I see you folks are inclined to keep some small segments static and
allocate maximum allowed memory for it. It's an option, at the end of
the day we need to experiment and measure both approaches.

I did measure performance with a maximally sized buffer lookup table
(shared_buffers = 128MB, max_shared_buffers = 10GB) on my laptop.
There was no noticeable difference in the performance. I will post
formal numbers with the next patchset.

* Every process recalculates shared memory size based on the new
NBuffers, adjusts its size using ftruncate and adjust reservation
permissions with mprotect. One elected process signals the postmaster
to do the same.

If we just used a single memory mapping with all unused parts marked
MAP_NORESERVE, we wouldn't need this (and wouldn't need a fair bit of other
work in this patchset)..

On Sat, Sep 27, 2025 at 12:06 AM Andres Freund <andres@anarazel.de> wrote:

How do we return memory to the OS in that case? Currently it's done
explicitly via truncating the anonymous file.

madvise with MADV_DONTNEED or MADV_REMOVE.

The patchset still uses the ftruncate + mprotect. I have questions
apart from portability concerns about your proposal. MADV_DONTNEED
documentation says
After a successful MADV_DONTNEED operation, the
semantics of memory access in the specified region are changed:
subsequent accesses
of pages in the range will succeed, but will result
in either repopulating the memory contents from the up-to-date
contents of the
underlying mapped file (for shared file mappings, shared
anonymous mappings, and shmem-based techniques such as System V shared
mem‐
ory segments) or zero-fill-on-demand pages for anonymous
private mappings.

Note that, when applied to shared mappings,
MADV_DONTNEED might not lead to immediate freeing of the pages in the
range. The kernel
is free to delay freeing the pages until an appropriate
moment. The resident set size (RSS) of the calling process will be
immedi‐
ately reduced however.

MADV_DONTNEED cannot be applied to locked pages, Huge
TLB pages, or VM_PFNMAP pages. (Pages marked with the kernel-internal
VM_PFN‐
MAP flag are special memory areas that are not managed
by the virtual memory subsystem. Such pages are typically created by
device
drivers that map the pages into user space.)

and MADV_REMOVE (since Linux 2.6.16)
Free up a given range of pages and its associated
backing store. This is equivalent to punching a hole in the
corresponding byte
range of the backing store (see fallocate(2)).
Subsequent accesses in the specified address range will see bytes
containing zero.

The specified address range must be mapped shared and
writable. This flag cannot be applied to locked pages, Huge TLB
pages, or
VM_PFNMAP pages.

Combining these two,
1. The access to the freed memory doesn't give any error but returns
0. Won't that lead to silent corruption?
2. Those are not supported with huge tlb pages. So can not be used
when huge pages = on?

With the current approach, we get SIGBUS and SIG 11 when the process
tries to access the freed memory. That protection won't be there with
madvise().

The synchronization mechanism in this patch is inspired from Thomas's
implementation posted in [1]postgr.es/m/CA+hUKGL5hW3i_pk5y_gcbF_C5kP-pWFjCuM8bAyCeHo3xUaH8g@mail.gmail.com.

I still need to go through Tomas's detailed comments and address those
which still apply. And the patches are still WIP, with many TODOs. But
I wanted to get some feedback on the proposed UI and synchronization
as described above.

I will be looking into the cases below one by one
1. New backends join while the synchronization is going on. An
existing backend exiting.
2. Failure or crash in the backend which is executing pg_resize_buffer_pool()
3. Fix crashes in the tests.

[1]: postgr.es/m/CA+hUKGL5hW3i_pk5y_gcbF_C5kP-pWFjCuM8bAyCeHo3xUaH8g@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat

Attachments:

0003-Introduce-pending-flag-for-GUC-assign-hooks-20251013.patchapplication/x-patch; name=0003-Introduce-pending-flag-for-GUC-assign-hooks-20251013.patchDownload

From f616d9a4e88c9edabae143d6f402c3f54730cd0d Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sun, 6 Apr 2025 16:40:32 +0200
Subject: [PATCH 03/19] Introduce pending flag for GUC assign hooks

Currently an assing hook can perform some preprocessing of a new value,
but it cannot change the behavior, which dictates that the new value
will be applied immediately after the hook. Certain GUC options (like
shared_buffers, coming in subsequent patches) may need coordinating work
between backends to change, meaning we cannot apply it right away.

Add a new flag "pending" for an assign hook to allow the hook indicate
exactly that. If the pending flag is set after the hook, the new value
will not be applied and it's handling becomes the hook's implementation
responsibility.

Note, that this also requires changes in the way how GUCs are getting
reported, but the patch does not cover that yet.
---
 src/backend/access/transam/xlog.c    |  2 +-
 src/backend/commands/variable.c      |  6 +--
 src/backend/libpq/pqcomm.c           |  8 ++--
 src/backend/tcop/postgres.c          |  2 +-
 src/backend/utils/misc/guc.c         | 59 +++++++++++++++++++---------
 src/backend/utils/misc/stack_depth.c |  2 +-
 src/include/utils/guc.h              |  2 +-
 src/include/utils/guc_hooks.h        | 20 +++++-----
 8 files changed, 61 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..cc48b253bc8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2197,7 +2197,7 @@ CalculateCheckpointSegments(void)
 }
 
 void
-assign_max_wal_size(int newval, void *extra)
+assign_max_wal_size(int newval, void *extra, bool *pending)
 {
 	max_wal_size_mb = newval;
 	CalculateCheckpointSegments();
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 608f10d9412..e40dae2ddf2 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1143,7 +1143,7 @@ check_cluster_name(char **newval, void **extra, GucSource source)
  * GUC assign_hook for maintenance_io_concurrency
  */
 void
-assign_maintenance_io_concurrency(int newval, void *extra)
+assign_maintenance_io_concurrency(int newval, void *extra, bool *pending)
 {
 	/*
 	 * Reconfigure recovery prefetching, because a setting it depends on
@@ -1161,12 +1161,12 @@ assign_maintenance_io_concurrency(int newval, void *extra)
  * they may be assigned in either order.
  */
 void
-assign_io_max_combine_limit(int newval, void *extra)
+assign_io_max_combine_limit(int newval, void *extra, bool *pending)
 {
 	io_combine_limit = Min(newval, io_combine_limit_guc);
 }
 void
-assign_io_combine_limit(int newval, void *extra)
+assign_io_combine_limit(int newval, void *extra, bool *pending)
 {
 	io_combine_limit = Min(io_max_combine_limit, newval);
 }
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 25f739a6a17..1726a7c0993 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -1951,7 +1951,7 @@ pq_settcpusertimeout(int timeout, Port *port)
  * GUC assign_hook for tcp_keepalives_idle
  */
 void
-assign_tcp_keepalives_idle(int newval, void *extra)
+assign_tcp_keepalives_idle(int newval, void *extra, bool *pending)
 {
 	/*
 	 * The kernel API provides no way to test a value without setting it; and
@@ -1984,7 +1984,7 @@ show_tcp_keepalives_idle(void)
  * GUC assign_hook for tcp_keepalives_interval
  */
 void
-assign_tcp_keepalives_interval(int newval, void *extra)
+assign_tcp_keepalives_interval(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_setkeepalivesinterval(newval, MyProcPort);
@@ -2007,7 +2007,7 @@ show_tcp_keepalives_interval(void)
  * GUC assign_hook for tcp_keepalives_count
  */
 void
-assign_tcp_keepalives_count(int newval, void *extra)
+assign_tcp_keepalives_count(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_setkeepalivescount(newval, MyProcPort);
@@ -2030,7 +2030,7 @@ show_tcp_keepalives_count(void)
  * GUC assign_hook for tcp_user_timeout
  */
 void
-assign_tcp_user_timeout(int newval, void *extra)
+assign_tcp_user_timeout(int newval, void *extra, bool *pending)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_settcpusertimeout(newval, MyProcPort);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 7dd75a490aa..193efeb9022 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3597,7 +3597,7 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 
 /* GUC assign hook for transaction_timeout */
 void
-assign_transaction_timeout(int newval, void *extra)
+assign_transaction_timeout(int newval, void *extra, bool *pending)
 {
 	if (IsTransactionState())
 	{
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8794e26ef1d..c9361a0e423 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1681,6 +1681,7 @@ InitializeOneGUCOption(struct config_generic *gconf)
 				struct config_int *conf = (struct config_int *) gconf;
 				int			newval = conf->boot_val;
 				void	   *extra = NULL;
+				bool 	   pending = false;
 
 				Assert(newval >= conf->min);
 				Assert(newval <= conf->max);
@@ -1689,9 +1690,13 @@ InitializeOneGUCOption(struct config_generic *gconf)
 					elog(FATAL, "failed to initialize %s to %d",
 						 conf->gen.name, newval);
 				if (conf->assign_hook)
-					conf->assign_hook(newval, extra);
-				*conf->variable = conf->reset_val = newval;
-				conf->gen.extra = conf->reset_extra = extra;
+					conf->assign_hook(newval, extra, &pending);
+
+				if (!pending)
+				{
+					*conf->variable = conf->reset_val = newval;
+					conf->gen.extra = conf->reset_extra = extra;
+				}
 				break;
 			}
 		case PGC_REAL:
@@ -2047,13 +2052,18 @@ ResetAllOptions(void)
 			case PGC_INT:
 				{
 					struct config_int *conf = (struct config_int *) gconf;
+					bool 			  pending = false;
 
 					if (conf->assign_hook)
 						conf->assign_hook(conf->reset_val,
-										  conf->reset_extra);
-					*conf->variable = conf->reset_val;
-					set_extra_field(&conf->gen, &conf->gen.extra,
-									conf->reset_extra);
+										  conf->reset_extra,
+										  &pending);
+					if (!pending)
+					{
+						*conf->variable = conf->reset_val;
+						set_extra_field(&conf->gen, &conf->gen.extra,
+										conf->reset_extra);
+					}
 					break;
 				}
 			case PGC_REAL:
@@ -2430,16 +2440,21 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
 							struct config_int *conf = (struct config_int *) gconf;
 							int			newval = newvalue.val.intval;
 							void	   *newextra = newvalue.extra;
+							bool 	    pending = false;
 
 							if (*conf->variable != newval ||
 								conf->gen.extra != newextra)
 							{
 								if (conf->assign_hook)
-									conf->assign_hook(newval, newextra);
-								*conf->variable = newval;
-								set_extra_field(&conf->gen, &conf->gen.extra,
-												newextra);
-								changed = true;
+									conf->assign_hook(newval, newextra, &pending);
+
+								if (!pending)
+								{
+									*conf->variable = newval;
+									set_extra_field(&conf->gen, &conf->gen.extra,
+													newextra);
+									changed = true;
+								}
 							}
 							break;
 						}
@@ -3856,18 +3871,24 @@ set_config_with_handle(const char *name, config_handle *handle,
 
 				if (changeVal)
 				{
+					bool pending = false;
+
 					/* Save old value to support transaction abort */
 					if (!makeDefault)
 						push_old_value(&conf->gen, action);
 
 					if (conf->assign_hook)
-						conf->assign_hook(newval, newextra);
-					*conf->variable = newval;
-					set_extra_field(&conf->gen, &conf->gen.extra,
-									newextra);
-					set_guc_source(&conf->gen, source);
-					conf->gen.scontext = context;
-					conf->gen.srole = srole;
+						conf->assign_hook(newval, newextra, &pending);
+
+					if (!pending)
+					{
+						*conf->variable = newval;
+						set_extra_field(&conf->gen, &conf->gen.extra,
+										newextra);
+						set_guc_source(&conf->gen, source);
+						conf->gen.scontext = context;
+						conf->gen.srole = srole;
+					}
 				}
 				if (makeDefault)
 				{
diff --git a/src/backend/utils/misc/stack_depth.c b/src/backend/utils/misc/stack_depth.c
index 8f7cf531fbc..ef59ae62008 100644
--- a/src/backend/utils/misc/stack_depth.c
+++ b/src/backend/utils/misc/stack_depth.c
@@ -156,7 +156,7 @@ check_max_stack_depth(int *newval, void **extra, GucSource source)
 
 /* GUC assign hook for max_stack_depth */
 void
-assign_max_stack_depth(int newval, void *extra)
+assign_max_stack_depth(int newval, void *extra, bool *pending)
 {
 	ssize_t		newval_bytes = newval * (ssize_t) 1024;
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index f21ec37da89..c3056cd2da8 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -187,7 +187,7 @@ typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource sourc
 typedef bool (*GucEnumCheckHook) (int *newval, void **extra, GucSource source);
 
 typedef void (*GucBoolAssignHook) (bool newval, void *extra);
-typedef void (*GucIntAssignHook) (int newval, void *extra);
+typedef void (*GucIntAssignHook) (int newval, void *extra, bool *pending);
 typedef void (*GucRealAssignHook) (double newval, void *extra);
 typedef void (*GucStringAssignHook) (const char *newval, void *extra);
 typedef void (*GucEnumAssignHook) (int newval, void *extra);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 82ac8646a8d..658c799419e 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -81,12 +81,12 @@ extern bool check_log_stats(bool *newval, void **extra, GucSource source);
 extern bool check_log_timezone(char **newval, void **extra, GucSource source);
 extern void assign_log_timezone(const char *newval, void *extra);
 extern const char *show_log_timezone(void);
-extern void assign_maintenance_io_concurrency(int newval, void *extra);
-extern void assign_io_max_combine_limit(int newval, void *extra);
-extern void assign_io_combine_limit(int newval, void *extra);
-extern void assign_max_wal_size(int newval, void *extra);
+extern void assign_maintenance_io_concurrency(int newval, void *extra, bool *pending);
+extern void assign_io_max_combine_limit(int newval, void *extra, bool *pending);
+extern void assign_io_combine_limit(int newval, void *extra, bool *pending);
+extern void assign_max_wal_size(int newval, void *extra, bool *pending);
 extern bool check_max_stack_depth(int *newval, void **extra, GucSource source);
-extern void assign_max_stack_depth(int newval, void *extra);
+extern void assign_max_stack_depth(int newval, void *extra, bool *pending);
 extern bool check_multixact_member_buffers(int *newval, void **extra,
 										   GucSource source);
 extern bool check_multixact_offset_buffers(int *newval, void **extra,
@@ -141,13 +141,13 @@ extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 extern void assign_syslog_facility(int newval, void *extra);
 extern void assign_syslog_ident(const char *newval, void *extra);
-extern void assign_tcp_keepalives_count(int newval, void *extra);
+extern void assign_tcp_keepalives_count(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_count(void);
-extern void assign_tcp_keepalives_idle(int newval, void *extra);
+extern void assign_tcp_keepalives_idle(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_idle(void);
-extern void assign_tcp_keepalives_interval(int newval, void *extra);
+extern void assign_tcp_keepalives_interval(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_interval(void);
-extern void assign_tcp_user_timeout(int newval, void *extra);
+extern void assign_tcp_user_timeout(int newval, void *extra, bool *pending);
 extern const char *show_tcp_user_timeout(void);
 extern bool check_temp_buffers(int *newval, void **extra, GucSource source);
 extern bool check_temp_tablespaces(char **newval, void **extra,
@@ -163,7 +163,7 @@ extern bool check_transaction_buffers(int *newval, void **extra, GucSource sourc
 extern bool check_transaction_deferrable(bool *newval, void **extra, GucSource source);
 extern bool check_transaction_isolation(int *newval, void **extra, GucSource source);
 extern bool check_transaction_read_only(bool *newval, void **extra, GucSource source);
-extern void assign_transaction_timeout(int newval, void *extra);
+extern void assign_transaction_timeout(int newval, void *extra, bool *pending);
 extern const char *show_unix_socket_permissions(void);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern bool check_wal_consistency_checking(char **newval, void **extra,
-- 
2.34.1

0002-Process-config-reload-in-AIO-workers-20251013.patchapplication/x-patch; name=0002-Process-config-reload-in-AIO-workers-20251013.patchDownload

From 8fe9b13edfb2dc84047baa2fed9f48246b42af85 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 15:14:33 +0200
Subject: [PATCH 02/19] Process config reload in AIO workers

Currenly AIO workers process interrupts only via CHECK_FOR_INTERRUPTS,
which does not include ConfigReloadPending. Thus we need to check for it
explicitly.
---
 src/backend/storage/aio/method_worker.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index b5ac073a910..d1c6da89c4b 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -80,6 +80,7 @@ static void pgaio_worker_shmem_init(bool first_time);
 static bool pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh);
 static int	pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
 
+static void pgaio_worker_process_interrupts(void);
 
 const IoMethodOps pgaio_worker_ops = {
 	.shmem_size = pgaio_worker_shmem_size,
@@ -463,6 +464,8 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		int			nwakeups = 0;
 		int			worker;
 
+		pgaio_worker_process_interrupts();
+
 		/*
 		 * Try to get a job to do.
 		 *
@@ -592,3 +595,25 @@ pgaio_workers_enabled(void)
 {
 	return io_method == IOMETHOD_WORKER;
 }
+
+/*
+ * Process any new interrupts.
+ */
+static void
+pgaio_worker_process_interrupts(void)
+{
+	/*
+	 * Reloading config can trigger further signals, complicating interrupts
+	 * processing -- so let it run first.
+	 *
+	 * XXX: Is there any need in memory barrier after ProcessConfigFile?
+	 */
+	if (ConfigReloadPending)
+	{
+		ConfigReloadPending = false;
+		ProcessConfigFile(PGC_SIGHUP);
+	}
+
+	if (ProcSignalBarrierPending)
+		ProcessProcSignalBarrier();
+}
-- 
2.34.1

0001-Add-system-view-for-shared-buffer-lookup-ta-20251013.patchapplication/x-patch; name=0001-Add-system-view-for-shared-buffer-lookup-ta-20251013.patchDownload

From 1a13e00fd8b069d653f08132a3d35c7c17fdf5c9 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Mon, 25 Aug 2025 19:23:50 +0530
Subject: [PATCH 01/19] Add system view for shared buffer lookup table

The view exposes the contents of the shared buffer lookup table for
debugging, testing and investigation.

TODO:

It is better to place this view in pg_buffercache. But it's added as a
system view since BufHashTable is not exposed outside buf_table.c. To
move it to pg_buffercache, we should move the function
pg_get_buffer_lookup_table() to pg_buffercache which invokes
BufTableGetContent() by passing it the tuple store and tuple descriptor.
BufTableGetContent fills the tuple store. The partitions are locked by
pg_get_buffer_lookup_table().

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
---
 doc/src/sgml/system-views.sgml         | 89 ++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql   |  7 ++
 src/backend/storage/buffer/buf_table.c | 61 ++++++++++++++++++
 src/include/catalog/pg_proc.dat        | 11 ++++
 src/test/regress/expected/rules.out    |  7 ++
 5 files changed, 175 insertions(+)

diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 7971498fe75..8f3e2741051 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -71,6 +71,11 @@
       <entry>backend memory contexts</entry>
      </row>
 
+     <row>
+      <entry><link linkend="view-pg-buffer-lookup-table"><structname>pg_buffer_lookup_table</structname></link></entry>
+      <entry>shared buffer lookup table</entry>
+     </row>
+
      <row>
       <entry><link linkend="view-pg-config"><structname>pg_config</structname></link></entry>
       <entry>compile-time configuration parameters</entry>
@@ -901,6 +906,90 @@ AND c1.path[c2.level] = c2.path[c2.level];
   </para>
  </sect1>
 
+ <sect1 id="view-pg-buffer-lookup-table">
+  <title><structname>pg_buffer_lookup_table</structname></title>
+  <indexterm>
+   <primary>pg_buffer_lookup_table</primary>
+  </indexterm>
+  <para>
+   The <structname>pg_buffer_lookup_table</structname> view exposes the current
+   contents of the shared buffer lookup table. Each row represents an entry in
+   the lookup table mapping a relation page to the ID of buffer in which it is
+   cached. The shared buffer lookup table is locked for a short duration while
+   reading so as to ensure consistency. This may affect performance if this view
+   is queried very frequently.
+  </para>
+  <table id="pg-buffer-lookup-table-view" xreflabel="pg_buffer_lookup_table">
+   <title><structname>pg_buffer_lookup_table</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>tablespace</structfield> <type>oid</type>
+      </para>
+      <para>
+       OID of the tablespace containing the relation
+      </para></entry>
+     </row>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>database</structfield> <type>oid</type>
+      </para>
+      <para>
+       OID of the database containing the relation (zero for shared relations)
+      </para></entry>
+     </row>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>relfilenode</structfield> <type>oid</type>
+      </para>
+      <para>
+       relfilenode identifying the relation
+      </para></entry>
+     </row>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>forknum</structfield> <type>int2</type>
+      </para>
+      <para>
+       Fork number within the relation (see <xref linkend="storage-file-layout"/>)
+      </para></entry>
+     </row>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>blocknum</structfield> <type>int8</type>
+      </para>
+      <para>
+       Block number within the relation
+      </para></entry>
+     </row>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>bufferid</structfield> <type>int4</type>
+      </para>
+      <para>
+       ID of the buffer caching the page 
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+  <para>
+   Access to this view is restricted to members of the
+   <literal>pg_read_all_stats</literal> role by default.
+  </para>
+ </sect1>
+
  <sect1 id="view-pg-config">
   <title><structname>pg_config</structname></title>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 823776c1498..c7240250c07 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1436,3 +1436,10 @@ REVOKE ALL ON pg_aios FROM PUBLIC;
 GRANT SELECT ON pg_aios TO pg_read_all_stats;
 REVOKE EXECUTE ON FUNCTION pg_get_aios() FROM PUBLIC;
 GRANT EXECUTE ON FUNCTION pg_get_aios() TO pg_read_all_stats;
+
+CREATE VIEW pg_buffer_lookup_table AS
+    SELECT * FROM pg_get_buffer_lookup_table();
+REVOKE ALL ON pg_buffer_lookup_table FROM PUBLIC;
+GRANT SELECT ON pg_buffer_lookup_table TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_buffer_lookup_table() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_buffer_lookup_table() TO pg_read_all_stats;
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 9d256559bab..1f6e215a2ca 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -21,7 +21,12 @@
  */
 #include "postgres.h"
 
+#include "fmgr.h"
+#include "funcapi.h"
 #include "storage/buf_internals.h"
+#include "storage/lwlock.h"
+#include "utils/rel.h"
+#include "utils/builtins.h"
 
 /* entry for buffer lookup hashtable */
 typedef struct
@@ -159,3 +164,59 @@ BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
 	if (!result)				/* shouldn't happen */
 		elog(ERROR, "shared buffer hash table corrupted");
 }
+
+/*
+ * SQL callable function to report contents of the shared buffer lookup table.
+ */
+Datum
+pg_get_buffer_lookup_table(PG_FUNCTION_ARGS)
+{
+#define PG_GET_BUFFER_LOOKUP_TABLE_COLS 6
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	HASH_SEQ_STATUS hstat;
+	BufferLookupEnt *ent;
+	Datum		values[PG_GET_BUFFER_LOOKUP_TABLE_COLS];
+	bool		nulls[PG_GET_BUFFER_LOOKUP_TABLE_COLS];
+	int			i;
+
+	memset(nulls, 0, sizeof(nulls));
+
+	/*
+	 * We put all the tuples into a tuplestore in one scan of the hashtable.
+	 * This avoids any issue of the hashtable possibly changing between calls.
+	 */
+	InitMaterializedSRF(fcinfo, 0);
+
+	Assert(rsinfo->setDesc->natts == PG_GET_BUFFER_LOOKUP_TABLE_COLS);
+
+	/*
+	 * Lock all buffer mapping partitions to ensure a consistent view of the
+	 * hash table during the scan. Must grab LWLocks in partition-number order
+	 * to avoid LWLock deadlock.
+	 */
+	for (i = 0; i < NUM_BUFFER_PARTITIONS; i++)
+		LWLockAcquire(BufMappingPartitionLockByIndex(i), LW_SHARED);
+
+	hash_seq_init(&hstat, SharedBufHash);
+	while ((ent = (BufferLookupEnt *) hash_seq_search(&hstat)) != NULL)
+	{
+		values[0] = ObjectIdGetDatum(ent->key.spcOid);
+		values[1] = ObjectIdGetDatum(ent->key.dbOid);
+		values[2] = ObjectIdGetDatum(ent->key.relNumber);
+		values[3] = ObjectIdGetDatum(ent->key.forkNum);
+		values[4] = UInt32GetDatum(ent->key.blockNum);
+		values[5] = Int32GetDatum(ent->id);
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
+	}
+
+	/*
+	 * Release all buffer mapping partition locks in the reverse order so as
+	 * to avoid LWLock deadlock.
+	 */
+	for (i = NUM_BUFFER_PARTITIONS - 1; i >= 0; i--)
+		LWLockRelease(BufMappingPartitionLockByIndex(i));
+
+	return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b51d2b17379..e631323a325 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8600,6 +8600,17 @@
   proargmodes => '{o,o,o}', proargnames => '{name,type,size}',
   prosrc => 'pg_get_dsm_registry_allocations' },
 
+# buffer lookup table
+{ oid => '5102',
+  descr => 'shared buffer lookup table',
+  proname => 'pg_get_buffer_lookup_table', prorows => '6', proretset => 't',
+  provolatile => 'v', prorettype => 'record',
+  proargtypes => '', proallargtypes => '{oid,oid,oid,int2,int8,int4}',
+  proargmodes => '{o,o,o,o,o,o}',
+  proargnames => '{tablespace,database,relfilenode,forknum,blocknum,bufferid}',
+  prosrc => 'pg_get_buffer_lookup_table'
+},
+
 # memory context of local backend
 { oid => '2282',
   descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 16753b2e4c0..83f566d3218 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1330,6 +1330,13 @@ pg_backend_memory_contexts| SELECT name,
     free_chunks,
     used_bytes
    FROM pg_get_backend_memory_contexts() pg_get_backend_memory_contexts(name, ident, type, level, path, total_bytes, total_nblocks, free_bytes, free_chunks, used_bytes);
+pg_buffer_lookup_table| SELECT tablespace,
+    database,
+    relfilenode,
+    forknum,
+    blocknum,
+    bufferid
+   FROM pg_get_buffer_lookup_table() pg_get_buffer_lookup_table(tablespace, database, relfilenode, forknum, blocknum, bufferid);
 pg_config| SELECT name,
     setting
    FROM pg_config() pg_config(name, setting);

base-commit: 7a662a46ebf74e9fa15cb62b592b4bf00c96fc94
-- 
2.34.1

0004-Introduce-pss_barrierReceivedGeneration-20251013.patchapplication/x-patch; name=0004-Introduce-pss_barrierReceivedGeneration-20251013.patchDownload

From a7cdc1871e0626b0b3f60ea68044ee77eca192c3 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 4 Apr 2025 21:46:14 +0200
Subject: [PATCH 04/19] Introduce pss_barrierReceivedGeneration

Currently WaitForProcSignalBarrier allows to make sure the message sent
via EmitProcSignalBarrier was processed by all ProcSignal mechanism
participants.

Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
which will be updated when a process has received the message, but not
processed it yet. This makes it possible to support a new mode of
waiting, when ProcSignal participants want to synchronize message
processing. To do that, a participant can wait via
WaitForProcSignalBarrierReceived when processing a message, effectively
making sure that all processes are going to start processing
ProcSignalBarrier simultaneously.
---
 src/backend/storage/ipc/procsignal.c | 67 ++++++++++++++++++++++------
 src/include/storage/procsignal.h     |  1 +
 2 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 087821311cc..eb3ceaae809 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -58,7 +58,10 @@
  * of it. For such use cases, we set a bit in pss_barrierCheckMask and then
  * increment the current "barrier generation"; when the new barrier generation
  * (or greater) appears in the pss_barrierGeneration flag of every process,
- * we know that the message has been received everywhere.
+ * we know that the message has been received and processed everywhere. In case
+ * if we only need to know only that the message was received everywhere (e.g.
+ * receiving processes need to handle the message in a coordinated fashion)
+ * use pss_barrierReceivedGeneration in the same way.
  */
 typedef struct
 {
@@ -70,6 +73,7 @@ typedef struct
 
 	/* Barrier-related fields (not protected by pss_mutex) */
 	pg_atomic_uint64 pss_barrierGeneration;
+	pg_atomic_uint64 pss_barrierReceivedGeneration;
 	pg_atomic_uint32 pss_barrierCheckMask;
 	ConditionVariable pss_barrierCV;
 } ProcSignalSlot;
@@ -152,6 +156,8 @@ ProcSignalShmemInit(void)
 			slot->pss_cancel_key_len = 0;
 			MemSet(slot->pss_signalFlags, 0, sizeof(slot->pss_signalFlags));
 			pg_atomic_init_u64(&slot->pss_barrierGeneration, PG_UINT64_MAX);
+			pg_atomic_init_u64(&slot->pss_barrierReceivedGeneration,
+							   PG_UINT64_MAX);
 			pg_atomic_init_u32(&slot->pss_barrierCheckMask, 0);
 			ConditionVariableInit(&slot->pss_barrierCV);
 		}
@@ -199,6 +205,8 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
 	barrier_generation =
 		pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration);
 	pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation);
+	pg_atomic_write_u64(&slot->pss_barrierReceivedGeneration,
+						barrier_generation);
 
 	if (cancel_key_len > 0)
 		memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
@@ -263,6 +271,7 @@ CleanupProcSignalState(int status, Datum arg)
 	 * no barrier waits block on it.
 	 */
 	pg_atomic_write_u64(&slot->pss_barrierGeneration, PG_UINT64_MAX);
+	pg_atomic_write_u64(&slot->pss_barrierReceivedGeneration, PG_UINT64_MAX);
 
 	SpinLockRelease(&slot->pss_mutex);
 
@@ -416,12 +425,8 @@ EmitProcSignalBarrier(ProcSignalBarrierType type)
 	return generation;
 }
 
-/*
- * WaitForProcSignalBarrier - wait until it is guaranteed that all changes
- * requested by a specific call to EmitProcSignalBarrier() have taken effect.
- */
-void
-WaitForProcSignalBarrier(uint64 generation)
+static void
+WaitForProcSignalBarrierInternal(uint64 generation, bool receivedOnly)
 {
 	Assert(generation <= pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration));
 
@@ -436,12 +441,17 @@ WaitForProcSignalBarrier(uint64 generation)
 		uint64		oldval;
 
 		/*
-		 * It's important that we check only pss_barrierGeneration here and
-		 * not pss_barrierCheckMask. Bits in pss_barrierCheckMask get cleared
-		 * before the barrier is actually absorbed, but pss_barrierGeneration
+		 * It's important that we check only pss_barrierGeneration &
+		 * pss_barrierGeneration here and not pss_barrierCheckMask. Bits in
+		 * pss_barrierCheckMask get cleared before the barrier is actually
+		 * absorbed, but pss_barrierGeneration & pss_barrierReceivedGeneration
 		 * is updated only afterward.
 		 */
-		oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+		if (receivedOnly)
+			oldval = pg_atomic_read_u64(&slot->pss_barrierReceivedGeneration);
+		else
+			oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+
 		while (oldval < generation)
 		{
 			if (ConditionVariableTimedSleep(&slot->pss_barrierCV,
@@ -450,7 +460,11 @@ WaitForProcSignalBarrier(uint64 generation)
 				ereport(LOG,
 						(errmsg("still waiting for backend with PID %d to accept ProcSignalBarrier",
 								(int) pg_atomic_read_u32(&slot->pss_pid))));
-			oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+
+			if (receivedOnly)
+				oldval = pg_atomic_read_u64(&slot->pss_barrierReceivedGeneration);
+			else
+				oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
 		}
 		ConditionVariableCancelSleep();
 	}
@@ -464,12 +478,33 @@ WaitForProcSignalBarrier(uint64 generation)
 	 * The caller is probably calling this function because it wants to read
 	 * the shared state or perform further writes to shared state once all
 	 * backends are known to have absorbed the barrier. However, the read of
-	 * pss_barrierGeneration was performed unlocked; insert a memory barrier
-	 * to separate it from whatever follows.
+	 * pss_barrierGeneration & pss_barrierReceivedGeneration was performed
+	 * unlocked; insert a memory barrier to separate it from whatever follows.
 	 */
 	pg_memory_barrier();
 }
 
+/*
+ * WaitForProcSignalBarrier - wait until it is guaranteed that all changes
+ * requested by a specific call to EmitProcSignalBarrier() have taken effect.
+ */
+void
+WaitForProcSignalBarrier(uint64 generation)
+{
+	WaitForProcSignalBarrierInternal(generation, false);
+}
+
+/*
+ * WaitForProcSignalBarrierReceived - wait until it is guaranteed that all
+ * backends have observed the message sent by a specific call to
+ * EmitProcSignalBarrier().
+ */
+void
+WaitForProcSignalBarrierReceived(uint64 generation)
+{
+	WaitForProcSignalBarrierInternal(generation, true);
+}
+
 /*
  * Handle receipt of an interrupt indicating a global barrier event.
  *
@@ -523,6 +558,10 @@ ProcessProcSignalBarrier(void)
 	if (local_gen == shared_gen)
 		return;
 
+	/* The message is observed, record that */
+	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
+						shared_gen);
+
 	/*
 	 * Get and clear the flags that are set for this backend. Note that
 	 * pg_atomic_exchange_u32 is a full barrier, so we're guaranteed that the
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index afeeb1ca019..2733bbb8c5b 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -79,6 +79,7 @@ extern void SendCancelRequest(int backendPID, const uint8 *cancel_key, int cance
 
 extern uint64 EmitProcSignalBarrier(ProcSignalBarrierType type);
 extern void WaitForProcSignalBarrier(uint64 generation);
+extern void WaitForProcSignalBarrierReceived(uint64 generation);
 extern void ProcessProcSignalBarrier(void);
 
 extern void procsignal_sigusr1_handler(SIGNAL_ARGS);
-- 
2.34.1

0005-Allow-to-use-multiple-shared-memory-mapping-20251013.patchapplication/x-patch; name=0005-Allow-to-use-multiple-shared-memory-mapping-20251013.patchDownload

From bcb7a085a92f6b6c3bbd4c75819d5b4c4462ab03 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 28 Feb 2025 19:54:47 +0100
Subject: [PATCH 05/19] Allow to use multiple shared memory mappings

Currently all the work with shared memory is done via a single anonymous
memory mapping, which limits ways how the shared memory could be organized.

Introduce possibility to allocate multiple shared memory mappings, where
a single mapping is associated with a specified shared memory segment.
There is only fixed amount of available segments, currently only one
main shared memory segment is allocated. A new shared memory API is
introduces, extended with a segment as a new parameter. As a path of
least resistance, the original API is kept in place, utilizing the main
shared memory segment.
---
 src/backend/port/posix_sema.c     |   4 +-
 src/backend/port/sysv_sema.c      |   4 +-
 src/backend/port/sysv_shmem.c     | 138 +++++++++++++++++++---------
 src/backend/port/win32_sema.c     |   2 +-
 src/backend/storage/ipc/ipc.c     |   4 +-
 src/backend/storage/ipc/ipci.c    |  63 +++++++------
 src/backend/storage/ipc/shmem.c   | 148 +++++++++++++++++++++---------
 src/backend/storage/lmgr/lwlock.c |  15 ++-
 src/include/storage/ipc.h         |   2 +-
 src/include/storage/pg_sema.h     |   2 +-
 src/include/storage/pg_shmem.h    |  18 ++++
 src/include/storage/shmem.h       |  11 +++
 12 files changed, 283 insertions(+), 128 deletions(-)

diff --git a/src/backend/port/posix_sema.c b/src/backend/port/posix_sema.c
index 269c7460817..401e1113fa1 100644
--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -193,7 +193,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * we don't have to expose the counters to other processes.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	struct stat statbuf;
 
@@ -220,7 +220,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSegment(PGSemaphoreShmemSize(maxSemas), shmem_segment);
 #endif
 
 	numSems = 0;
diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c
index 6ac83ea1a82..7bb363989c4 100644
--- a/src/backend/port/sysv_sema.c
+++ b/src/backend/port/sysv_sema.c
@@ -327,7 +327,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * have clobbered.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	struct stat statbuf;
 
@@ -348,7 +348,7 @@ PGReserveSemaphores(int maxSemas)
 	 * ShmemAlloc() won't be ready yet.
 	 */
 	sharedSemas = (PGSemaphore)
-		ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+		ShmemAllocUnlockedInSegment(PGSemaphoreShmemSize(maxSemas), shmem_segment);
 	numSharedSemas = 0;
 	maxSharedSemas = maxSemas;
 
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..56af0231d24 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -94,8 +94,19 @@ typedef enum
 unsigned long UsedShmemSegID = 0;
 void	   *UsedShmemSegAddr = NULL;
 
-static Size AnonymousShmemSize;
-static void *AnonymousShmem = NULL;
+typedef struct AnonymousMapping
+{
+	int shmem_segment;
+	Size shmem_size; 			/* Size of the mapping */
+	Pointer shmem; 				/* Pointer to the start of the mapped memory */
+	Pointer seg_addr; 			/* SysV shared memory for the header */
+	unsigned long seg_id; 		/* IPC key */
+} AnonymousMapping;
+
+static AnonymousMapping Mappings[ANON_MAPPINGS];
+
+/* Keeps track of used mapping segments */
+static int next_free_segment = 0;
 
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
@@ -104,6 +115,28 @@ static IpcMemoryState PGSharedMemoryAttach(IpcMemoryId shmId,
 										   void *attachAt,
 										   PGShmemHeader **addr);
 
+static const char*
+MappingName(int shmem_segment)
+{
+	switch (shmem_segment)
+	{
+		case MAIN_SHMEM_SEGMENT:
+			return "main";
+		default:
+			return "unknown";
+	}
+}
+
+static void
+DebugMappings()
+{
+	for(int i = 0; i < next_free_segment; i++)
+	{
+		AnonymousMapping m = Mappings[i];
+		elog(DEBUG1, "Mapping[%s]: addr %p, size %zu",
+			 MappingName(i), m.shmem, m.shmem_size);
+	}
+}
 
 /*
  *	InternalIpcMemoryCreate(memKey, size)
@@ -591,14 +624,13 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
 /*
  * Creates an anonymous mmap()ed shared memory segment.
  *
- * Pass the requested size in *size.  This function will modify *size to the
- * actual size of the allocation, if it ends up allocating a segment that is
- * larger than requested.
+ * This function will modify mapping size to the actual size of the allocation,
+ * if it ends up allocating a segment that is larger than requested.
  */
-static void *
-CreateAnonymousSegment(Size *size)
+static void
+CreateAnonymousSegment(AnonymousMapping *mapping)
 {
-	Size		allocsize = *size;
+	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
 
@@ -623,8 +655,11 @@ CreateAnonymousSegment(Size *size)
 				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
-			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-				 allocsize);
+		{
+			DebugMappings();
+			elog(DEBUG1, "segment[%s]: mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
+				 MappingName(mapping->shmem_segment), allocsize);
+		}
 	}
 #endif
 
@@ -642,7 +677,7 @@ CreateAnonymousSegment(Size *size)
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
-		allocsize = *size;
+		allocsize = mapping->shmem_size;
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
 				   PG_MMAP_FLAGS, -1, 0);
 		mmap_errno = errno;
@@ -651,8 +686,10 @@ CreateAnonymousSegment(Size *size)
 	if (ptr == MAP_FAILED)
 	{
 		errno = mmap_errno;
+		DebugMappings();
 		ereport(FATAL,
-				(errmsg("could not map anonymous shared memory: %m"),
+				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
+						MappingName(mapping->shmem_segment)),
 				 (mmap_errno == ENOMEM) ?
 				 errhint("This error usually means that PostgreSQL's request "
 						 "for a shared memory segment exceeded available memory, "
@@ -663,8 +700,8 @@ CreateAnonymousSegment(Size *size)
 						 allocsize) : 0));
 	}
 
-	*size = allocsize;
-	return ptr;
+	mapping->shmem = ptr;
+	mapping->shmem_size = allocsize;
 }
 
 /*
@@ -674,13 +711,18 @@ CreateAnonymousSegment(Size *size)
 static void
 AnonymousShmemDetach(int status, Datum arg)
 {
-	/* Release anonymous shared memory block, if any. */
-	if (AnonymousShmem != NULL)
+	for(int i = 0; i < next_free_segment; i++)
 	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		AnonymousMapping m = Mappings[i];
+
+		/* Release anonymous shared memory block, if any. */
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
 
@@ -705,6 +747,7 @@ PGSharedMemoryCreate(Size size,
 	PGShmemHeader *hdr;
 	struct stat statbuf;
 	Size		sysvsize;
+	AnonymousMapping *mapping = &Mappings[next_free_segment];
 
 	/*
 	 * We use the data directory's ID info (inode and device numbers) to
@@ -733,11 +776,15 @@ PGSharedMemoryCreate(Size size,
 
 	/* Room for a header? */
 	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
+	mapping->shmem_size = size;
+	mapping->shmem_segment = next_free_segment;
 
 	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
-		AnonymousShmem = CreateAnonymousSegment(&size);
-		AnonymousShmemSize = size;
+		/* On success, mapping data will be modified. */
+		CreateAnonymousSegment(mapping);
+
+		next_free_segment++;
 
 		/* Register on-exit routine to unmap the anonymous segment */
 		on_shmem_exit(AnonymousShmemDetach, (Datum) 0);
@@ -760,7 +807,7 @@ PGSharedMemoryCreate(Size size,
 	 * loop simultaneously.  (CreateDataDirLockFile() does not entirely ensure
 	 * that, but prefer fixing it over coping here.)
 	 */
-	NextShmemSegID = statbuf.st_ino;
+	NextShmemSegID = statbuf.st_ino + next_free_segment;
 
 	for (;;)
 	{
@@ -852,13 +899,13 @@ PGSharedMemoryCreate(Size size,
 	/*
 	 * Initialize space allocation status for segment.
 	 */
-	hdr->totalsize = size;
+	hdr->totalsize = mapping->shmem_size;
 	hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader));
 	*shim = hdr;
 
 	/* Save info for possible future use */
-	UsedShmemSegAddr = memAddress;
-	UsedShmemSegID = (unsigned long) NextShmemSegID;
+	mapping->seg_addr = memAddress;
+	mapping->seg_id = (unsigned long) NextShmemSegID;
 
 	/*
 	 * If AnonymousShmem is NULL here, then we're not using anonymous shared
@@ -866,10 +913,10 @@ PGSharedMemoryCreate(Size size,
 	 * block. Otherwise, the System V shared memory block is only a shim, and
 	 * we must return a pointer to the real block.
 	 */
-	if (AnonymousShmem == NULL)
+	if (mapping->shmem == NULL)
 		return hdr;
-	memcpy(AnonymousShmem, hdr, sizeof(PGShmemHeader));
-	return (PGShmemHeader *) AnonymousShmem;
+	memcpy(mapping->shmem, hdr, sizeof(PGShmemHeader));
+	return (PGShmemHeader *) mapping->shmem;
 }
 
 #ifdef EXEC_BACKEND
@@ -969,23 +1016,28 @@ PGSharedMemoryNoReAttach(void)
 void
 PGSharedMemoryDetach(void)
 {
-	if (UsedShmemSegAddr != NULL)
+	for(int i = 0; i < next_free_segment; i++)
 	{
-		if ((shmdt(UsedShmemSegAddr) < 0)
+		AnonymousMapping m = Mappings[i];
+
+		if (m.seg_addr != NULL)
+		{
+			if ((shmdt(m.seg_addr) < 0)
 #if defined(EXEC_BACKEND) && defined(__CYGWIN__)
-		/* Work-around for cygipc exec bug */
-			&& shmdt(NULL) < 0
+			/* Work-around for cygipc exec bug */
+				&& shmdt(NULL) < 0
 #endif
-			)
-			elog(LOG, "shmdt(%p) failed: %m", UsedShmemSegAddr);
-		UsedShmemSegAddr = NULL;
-	}
+				)
+				elog(LOG, "shmdt(%p) failed: %m", m.seg_addr);
+			m.seg_addr = NULL;
+		}
 
-	if (AnonymousShmem != NULL)
-	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		if (m.shmem != NULL)
+		{
+			if (munmap(m.shmem, m.shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 m.shmem, m.shmem_size);
+			m.shmem = NULL;
+		}
 	}
 }
diff --git a/src/backend/port/win32_sema.c b/src/backend/port/win32_sema.c
index 5854ad1f54d..e7365ff8060 100644
--- a/src/backend/port/win32_sema.c
+++ b/src/backend/port/win32_sema.c
@@ -44,7 +44,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * process exits.
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	mySemSet = (HANDLE *) malloc(maxSemas * sizeof(HANDLE));
 	if (mySemSet == NULL)
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index 2704e80b3a7..1965b2d3eb4 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -61,6 +61,8 @@ static void proc_exit_prepare(int code);
  * but provide some additional features we need --- in particular,
  * we want to register callbacks to invoke when we are disconnecting
  * from a broken shared-memory context but not exiting the postmaster.
+ * Maximum number of such exit callbacks depends on the number of shared
+ * segments.
  *
  * Callback functions can take zero, one, or two args: the first passed
  * arg is the integer exitcode, the second is the Datum supplied when
@@ -68,7 +70,7 @@ static void proc_exit_prepare(int code);
  * ----------------------------------------------------------------
  */
 
-#define MAX_ON_EXITS 20
+#define MAX_ON_EXITS 40
 
 struct ONEXIT
 {
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..8b38e985327 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -86,7 +86,7 @@ RequestAddinShmemSpace(Size size)
  * required.
  */
 Size
-CalculateShmemSize(int *num_semaphores)
+CalculateShmemSize(int *num_semaphores, int shmem_segment)
 {
 	Size		size;
 	int			numSemas;
@@ -206,33 +206,38 @@ CreateSharedMemoryAndSemaphores(void)
 
 	Assert(!IsUnderPostmaster);
 
-	/* Compute the size of the shared-memory block */
-	size = CalculateShmemSize(&numSemas);
-	elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
-
-	/*
-	 * Create the shmem segment
-	 */
-	seghdr = PGSharedMemoryCreate(size, &shim);
-
-	/*
-	 * Make sure that huge pages are never reported as "unknown" while the
-	 * server is running.
-	 */
-	Assert(strcmp("unknown",
-				  GetConfigOption("huge_pages_status", false, false)) != 0);
-
-	InitShmemAccess(seghdr);
-
-	/*
-	 * Create semaphores
-	 */
-	PGReserveSemaphores(numSemas);
-
-	/*
-	 * Set up shared memory allocation mechanism
-	 */
-	InitShmemAllocation();
+	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
+	{
+		/* Compute the size of the shared-memory block */
+		size = CalculateShmemSize(&numSemas, segment);
+		elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
+
+		/*
+		 * Create the shmem segment.
+		 *
+		 * XXX: Do multiple shims are needed, one per segment?
+		 */
+		seghdr = PGSharedMemoryCreate(size, &shim);
+
+		/*
+		 * Make sure that huge pages are never reported as "unknown" while the
+		 * server is running.
+		 */
+		Assert(strcmp("unknown",
+					  GetConfigOption("huge_pages_status", false, false)) != 0);
+
+		InitShmemAccessInSegment(seghdr, segment);
+
+		/*
+		 * Create semaphores
+		 */
+		PGReserveSemaphores(numSemas, segment);
+
+		/*
+		 * Set up shared memory allocation mechanism
+		 */
+		InitShmemAllocationInSegment(segment);
+	}
 
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
@@ -363,7 +368,7 @@ InitializeShmemGUCs(void)
 	/*
 	 * Calculate the shared memory size and round up to the nearest megabyte.
 	 */
-	size_b = CalculateShmemSize(&num_semas);
+	size_b = CalculateShmemSize(&num_semas, MAIN_SHMEM_SEGMENT);
 	size_mb = add_size(size_b, (1024 * 1024) - 1) / (1024 * 1024);
 	sprintf(buf, "%zu", size_mb);
 	SetConfigOption("shared_memory_size", buf,
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index a0770e86796..f185ed28f95 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -76,19 +76,19 @@
 #include "utils/builtins.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
+static void *ShmemAllocRawInSegment(Size size, Size *allocated_size,
+								 int shmem_segment);
 
 /* shared memory global variables */
 
-static PGShmemHeader *ShmemSegHdr;	/* shared mem segment header */
+ShmemSegment Segments[ANON_MAPPINGS];
 
-static void *ShmemBase;			/* start address of shared memory */
-
-static void *ShmemEnd;			/* end+1 address of shared memory */
-
-slock_t    *ShmemLock;			/* spinlock for shared memory and LWLock
-								 * allocation */
-
-static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/*
+ * Primary index hashtable for shmem, for simplicity we use a single for all
+ * shared memory segments. There can be performance consequences of that, and
+ * an alternative option would be to have one index per shared memory segments.
+ */
+static HTAB *ShmemIndex = NULL;
 
 /* To get reliable results for NUMA inquiry we need to "touch pages" once */
 static bool firstNumaTouch = true;
@@ -101,9 +101,17 @@ Datum		pg_numa_available(PG_FUNCTION_ARGS);
 void
 InitShmemAccess(PGShmemHeader *seghdr)
 {
-	ShmemSegHdr = seghdr;
-	ShmemBase = seghdr;
-	ShmemEnd = (char *) ShmemBase + seghdr->totalsize;
+	InitShmemAccessInSegment(seghdr, MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAccessInSegment(PGShmemHeader *seghdr, int shmem_segment)
+{
+	PGShmemHeader *shmhdr = (PGShmemHeader *) seghdr;
+	ShmemSegment *seg = &Segments[shmem_segment];
+	seg->ShmemSegHdr = shmhdr;
+	seg->ShmemBase = (void *) shmhdr;
+	seg->ShmemEnd = (char *) seg->ShmemBase + shmhdr->totalsize;
 }
 
 /*
@@ -114,7 +122,13 @@ InitShmemAccess(PGShmemHeader *seghdr)
 void
 InitShmemAllocation(void)
 {
-	PGShmemHeader *shmhdr = ShmemSegHdr;
+	InitShmemAllocationInSegment(MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAllocationInSegment(int shmem_segment)
+{
+	PGShmemHeader *shmhdr = Segments[shmem_segment].ShmemSegHdr;
 	char	   *aligned;
 
 	Assert(shmhdr != NULL);
@@ -123,9 +137,9 @@ InitShmemAllocation(void)
 	 * Initialize the spinlock used by ShmemAlloc.  We must use
 	 * ShmemAllocUnlocked, since obviously ShmemAlloc can't be called yet.
 	 */
-	ShmemLock = (slock_t *) ShmemAllocUnlocked(sizeof(slock_t));
+	Segments[shmem_segment].ShmemLock = (slock_t *) ShmemAllocUnlockedInSegment(sizeof(slock_t), shmem_segment);
 
-	SpinLockInit(ShmemLock);
+	SpinLockInit(Segments[shmem_segment].ShmemLock);
 
 	/*
 	 * Allocations after this point should go through ShmemAlloc, which
@@ -150,11 +164,17 @@ InitShmemAllocation(void)
  */
 void *
 ShmemAlloc(Size size)
+{
+	return ShmemAllocInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocInSegment(Size size, int shmem_segment)
 {
 	void	   *newSpace;
 	Size		allocated_size;
 
-	newSpace = ShmemAllocRaw(size, &allocated_size);
+	newSpace = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
 	if (!newSpace)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -184,6 +204,12 @@ ShmemAllocNoError(Size size)
  */
 static void *
 ShmemAllocRaw(Size size, Size *allocated_size)
+{
+	return ShmemAllocRawInSegment(size, allocated_size, MAIN_SHMEM_SEGMENT);
+}
+
+static void *
+ShmemAllocRawInSegment(Size size, Size *allocated_size, int shmem_segment)
 {
 	Size		newStart;
 	Size		newFree;
@@ -203,22 +229,22 @@ ShmemAllocRaw(Size size, Size *allocated_size)
 	size = CACHELINEALIGN(size);
 	*allocated_size = size;
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-	SpinLockAcquire(ShmemLock);
+	SpinLockAcquire(Segments[shmem_segment].ShmemLock);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree <= ShmemSegHdr->totalsize)
+	if (newFree <= Segments[shmem_segment].ShmemSegHdr->totalsize)
 	{
-		newSpace = (char *) ShmemBase + newStart;
-		ShmemSegHdr->freeoffset = newFree;
+		newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
+		Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 	}
 	else
 		newSpace = NULL;
 
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[shmem_segment].ShmemLock);
 
 	/* note this assert is okay with newSpace == NULL */
 	Assert(newSpace == (void *) CACHELINEALIGN(newSpace));
@@ -236,6 +262,12 @@ ShmemAllocRaw(Size size, Size *allocated_size)
  */
 void *
 ShmemAllocUnlocked(Size size)
+{
+	return ShmemAllocUnlockedInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocUnlockedInSegment(Size size, int shmem_segment)
 {
 	Size		newStart;
 	Size		newFree;
@@ -246,19 +278,19 @@ ShmemAllocUnlocked(Size size)
 	 */
 	size = MAXALIGN(size);
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree > ShmemSegHdr->totalsize)
+	if (newFree > Segments[shmem_segment].ShmemSegHdr->totalsize)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of shared memory (%zu bytes requested)",
 						size)));
-	ShmemSegHdr->freeoffset = newFree;
+	Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 
-	newSpace = (char *) ShmemBase + newStart;
+	newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
 
 	Assert(newSpace == (void *) MAXALIGN(newSpace));
 
@@ -273,7 +305,13 @@ ShmemAllocUnlocked(Size size)
 bool
 ShmemAddrIsValid(const void *addr)
 {
-	return (addr >= ShmemBase) && (addr < ShmemEnd);
+	return ShmemAddrIsValidInSegment(addr, MAIN_SHMEM_SEGMENT);
+}
+
+bool
+ShmemAddrIsValidInSegment(const void *addr, int shmem_segment)
+{
+	return (addr >= Segments[shmem_segment].ShmemBase) && (addr < Segments[shmem_segment].ShmemEnd);
 }
 
 /*
@@ -334,6 +372,18 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 			  int64 max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
+{
+	return ShmemInitHashInSegment(name, init_size, max_size, infoP, hash_flags,
+							   MAIN_SHMEM_SEGMENT);
+}
+
+HTAB *
+ShmemInitHashInSegment(const char *name,		/* table string name for shmem index */
+			  long init_size,		/* initial table size */
+			  long max_size,		/* max size of the table */
+			  HASHCTL *infoP,		/* info about key and bucket size */
+			  int hash_flags,		/* info about infoP */
+			  int shmem_segment) 	/* in which segment to keep the table */
 {
 	bool		found;
 	void	   *location;
@@ -350,9 +400,9 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 	hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
 
 	/* look it up in the shmem index */
-	location = ShmemInitStruct(name,
+	location = ShmemInitStructInSegment(name,
 							   hash_get_shared_size(infoP, hash_flags),
-							   &found);
+							   &found, shmem_segment);
 
 	/*
 	 * if it already exists, attach to it rather than allocate and initialize
@@ -385,6 +435,13 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
  */
 void *
 ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+{
+	return ShmemInitStructInSegment(name, size, foundPtr, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
+					  int shmem_segment)
 {
 	ShmemIndexEnt *result;
 	void	   *structPtr;
@@ -393,7 +450,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	if (!ShmemIndex)
 	{
-		PGShmemHeader *shmemseghdr = ShmemSegHdr;
+		PGShmemHeader *shmemseghdr = Segments[shmem_segment].ShmemSegHdr;
 
 		/* Must be trying to create/attach to ShmemIndex itself */
 		Assert(strcmp(name, "ShmemIndex") == 0);
@@ -416,7 +473,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 			 * process can be accessing shared memory yet.
 			 */
 			Assert(shmemseghdr->index == NULL);
-			structPtr = ShmemAlloc(size);
+			structPtr = ShmemAllocInSegment(size, shmem_segment);
 			shmemseghdr->index = structPtr;
 			*foundPtr = false;
 		}
@@ -433,8 +490,8 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		LWLockRelease(ShmemIndexLock);
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("could not create ShmemIndex entry for data structure \"%s\"",
-						name)));
+				 errmsg("could not create ShmemIndex entry for data structure \"%s\" in segment %d",
+						name, shmem_segment)));
 	}
 
 	if (*foundPtr)
@@ -459,7 +516,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		Size		allocated_size;
 
 		/* It isn't in the table yet. allocate and initialize it */
-		structPtr = ShmemAllocRaw(size, &allocated_size);
+		structPtr = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
 		if (structPtr == NULL)
 		{
 			/* out of memory; remove the failed ShmemIndex entry */
@@ -478,14 +535,13 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	LWLockRelease(ShmemIndexLock);
 
-	Assert(ShmemAddrIsValid(structPtr));
+	Assert(ShmemAddrIsValidInSegment(structPtr, shmem_segment));
 
 	Assert(structPtr == (void *) CACHELINEALIGN(structPtr));
 
 	return structPtr;
 }
 
-
 /*
  * Add two Size values, checking for overflow
  */
@@ -542,10 +598,11 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 
 	/* output all allocated entries */
 	memset(nulls, 0, sizeof(nulls));
+	/* XXX: take all shared memory segments into account. */
 	while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
 	{
 		values[0] = CStringGetTextDatum(ent->key);
-		values[1] = Int64GetDatum((char *) ent->location - (char *) ShmemSegHdr);
+		values[1] = Int64GetDatum((char *) ent->location - (char *) Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr);
 		values[2] = Int64GetDatum(ent->size);
 		values[3] = Int64GetDatum(ent->allocated_size);
 		named_allocated += ent->allocated_size;
@@ -557,15 +614,15 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 	/* output shared memory allocated but not counted via the shmem index */
 	values[0] = CStringGetTextDatum("<anonymous>");
 	nulls[1] = true;
-	values[2] = Int64GetDatum(ShmemSegHdr->freeoffset - named_allocated);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset - named_allocated);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
 	/* output as-of-yet unused shared memory */
 	nulls[0] = true;
-	values[1] = Int64GetDatum(ShmemSegHdr->freeoffset);
+	values[1] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
 	nulls[1] = false;
-	values[2] = Int64GetDatum(ShmemSegHdr->totalsize - ShmemSegHdr->freeoffset);
+	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->totalsize - Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
 	values[3] = values[2];
 	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
@@ -630,7 +687,12 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
 	 * this is not very likely, and moreover we have more entries, each of
 	 * them using only fraction of the total pages.
 	 */
-	shm_total_page_count = (ShmemSegHdr->totalsize / os_page_size) + 1;
+	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
+	{
+		PGShmemHeader *shmhdr = Segments[segment].ShmemSegHdr;
+		shm_total_page_count += (shmhdr->totalsize / os_page_size) + 1;
+	}
+
 	page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
 	pages_status = palloc(sizeof(int) * shm_total_page_count);
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index b017880f5e4..c25dd13b63a 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -80,6 +80,8 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
+#include "postmaster/postmaster.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/proclist.h"
 #include "storage/procnumber.h"
@@ -612,12 +614,15 @@ LWLockNewTrancheId(const char *name)
 	/*
 	 * We use the ShmemLock spinlock to protect LWLockCounter and
 	 * LWLockTrancheNames.
+	 * 
+	 * XXX: Looks like this is the only use of Segments outside of shmem.c,
+	 * it's maybe worth it to reshape this part to hide Segments structure.
 	 */
-	SpinLockAcquire(ShmemLock);
+	SpinLockAcquire(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 
 	if (*LWLockCounter - LWTRANCHE_FIRST_USER_DEFINED >= MAX_NAMED_TRANCHES)
 	{
-		SpinLockRelease(ShmemLock);
+		SpinLockRelease(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 		ereport(ERROR,
 				(errmsg("maximum number of tranches already registered"),
 				 errdetail("No more than %d tranches may be registered.",
@@ -628,7 +633,7 @@ LWLockNewTrancheId(const char *name)
 	LocalLWLockCounter = *LWLockCounter;
 	strlcpy(LWLockTrancheNames[result - LWTRANCHE_FIRST_USER_DEFINED], name, NAMEDATALEN);
 
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 
 	return result;
 }
@@ -750,9 +755,9 @@ GetLWTrancheName(uint16 trancheId)
 	 */
 	if (trancheId >= LocalLWLockCounter)
 	{
-		SpinLockAcquire(ShmemLock);
+		SpinLockAcquire(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 		LocalLWLockCounter = *LWLockCounter;
-		SpinLockRelease(ShmemLock);
+		SpinLockRelease(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 
 		if (trancheId >= LocalLWLockCounter)
 			elog(ERROR, "tranche %d is not registered", trancheId);
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 3baf418b3d1..6ebda479ced 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -77,7 +77,7 @@ extern void check_on_shmem_exit_lists_are_empty(void);
 /* ipci.c */
 extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;
 
-extern Size CalculateShmemSize(int *num_semaphores);
+extern Size CalculateShmemSize(int *num_semaphores, int shmem_segment);
 extern void CreateSharedMemoryAndSemaphores(void);
 #ifdef EXEC_BACKEND
 extern void AttachSharedMemoryStructs(void);
diff --git a/src/include/storage/pg_sema.h b/src/include/storage/pg_sema.h
index fa6ca35a51f..8ae9637fcd0 100644
--- a/src/include/storage/pg_sema.h
+++ b/src/include/storage/pg_sema.h
@@ -41,7 +41,7 @@ typedef HANDLE PGSemaphore;
 extern Size PGSemaphoreShmemSize(int maxSemas);
 
 /* Module initialization (called during postmaster start or shmem reinit) */
-extern void PGReserveSemaphores(int maxSemas);
+extern void PGReserveSemaphores(int maxSemas, int shmem_segment);
 
 /* Allocate a PGSemaphore structure with initial count 1 */
 extern PGSemaphore PGSemaphoreCreate(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 5f7d4b83a60..2348c59b5a0 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -25,6 +25,7 @@
 #define PG_SHMEM_H
 
 #include "storage/dsm_impl.h"
+#include "storage/spin.h"
 
 typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 {
@@ -41,6 +42,20 @@ typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 #endif
 } PGShmemHeader;
 
+typedef struct ShmemSegment
+{
+	PGShmemHeader *ShmemSegHdr; 	/* shared mem segment header */
+	void *ShmemBase; 				/* start address of shared memory */
+	void *ShmemEnd; 				/* end+1 address of shared memory */
+	slock_t    *ShmemLock; 			/* spinlock for shared memory and LWLock
+									 * allocation */
+} ShmemSegment;
+
+/* Number of available segments for anonymous memory mappings */
+#define ANON_MAPPINGS 1
+
+extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -91,4 +106,7 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+/* The main segment, contains everything except buffer blocks and related data. */
+#define MAIN_SHMEM_SEGMENT 0
+
 #endif							/* PG_SHMEM_H */
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index cd683a9d2d9..910c43f54f4 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -30,15 +30,26 @@ extern PGDLLIMPORT slock_t *ShmemLock;
 typedef struct PGShmemHeader PGShmemHeader; /* avoid including
 											 * storage/pg_shmem.h here */
 extern void InitShmemAccess(PGShmemHeader *seghdr);
+extern void InitShmemAccessInSegment(struct PGShmemHeader *seghdr,
+									 int shmem_segment);
 extern void InitShmemAllocation(void);
+extern void InitShmemAllocationInSegment(int shmem_segment);
 extern void *ShmemAlloc(Size size);
+extern void *ShmemAllocInSegment(Size size, int shmem_segment);
 extern void *ShmemAllocNoError(Size size);
 extern void *ShmemAllocUnlocked(Size size);
+extern void *ShmemAllocUnlockedInSegment(Size size, int shmem_segment);
 extern bool ShmemAddrIsValid(const void *addr);
+extern bool ShmemAddrIsValidInSegment(const void *addr, int shmem_segment);
 extern void InitShmemIndex(void);
 extern HTAB *ShmemInitHash(const char *name, int64 init_size, int64 max_size,
 						   HASHCTL *infoP, int hash_flags);
+extern HTAB *ShmemInitHashInSegment(const char *name, long init_size,
+									long max_size, HASHCTL *infoP,
+									int hash_flags, int shmem_segment);
 extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
+extern void *ShmemInitStructInSegment(const char *name, Size size,
+									  bool *foundPtr, int shmem_segment);
 extern Size add_size(Size s1, Size s2);
 extern Size mul_size(Size s1, Size s2);
 
-- 
2.34.1

0008-Fix-compilation-failures-from-previous-comm-20251013.patchapplication/x-patch; name=0008-Fix-compilation-failures-from-previous-comm-20251013.patchDownload

From 385af90e5bc853f56ed30dd0031c8010f7a45d71 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Wed, 20 Aug 2025 11:35:20 +0530
Subject: [PATCH 08/19] Fix compilation failures from previous commits

shm_total_page_count is used unitialized. If this variable has a random
value to start with, the final sum would be wrong.

Also include pg_shmem.h where shared memory segment macros are used.

Author: Ashutosh Bapat
---
 src/backend/storage/buffer/buf_init.c | 1 +
 src/backend/storage/ipc/shmem.c       | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 5383442e213..6d703e18f8b 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -16,6 +16,7 @@
 
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
+#include "storage/pg_shmem.h"
 #include "storage/bufmgr.h"
 
 BufferDescPadded *BufferDescriptors;
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 9bb73f31052..e6cb919f0fc 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -649,7 +649,7 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
 	Size		os_page_size;
 	void	  **page_ptrs;
 	int		   *pages_status;
-	uint64		shm_total_page_count,
+	uint64		shm_total_page_count = 0,
 				shm_ent_page_count,
 				max_nodes;
 	Size	   *nodes;
-- 
2.34.1

0006-Address-space-reservation-for-shared-memory-20251013.patchapplication/x-patch; name=0006-Address-space-reservation-for-shared-memory-20251013.patchDownload

From 62a3e35f4e42bf9c586901e1f9c8f75869b0a13e Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 11:47:04 +0200
Subject: [PATCH 06/19] Address space reservation for shared memory

Currently the shared memory layout is designed to pack everything tight
together, leaving no space between mappings for resizing. Here is how it
looks like for one mapping in /proc/$PID/maps, /dev/zero represents the
anonymous shared memory we talk about:

    00400000-00490000         /path/bin/postgres
    ...
    012d9000-0133e000         [heap]
    7f443a800000-7f470a800000 /dev/zero (deleted)
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34
    ...

Make the layout more dynamic via splitting every shared memory segment
into two parts:

* An anonymous file, which actually contains shared memory content. Such
  an anonymous file is created via memfd_create, it lives in memory,
  behaves like a regular file and semantically equivalent to an
  anonymous memory allocated via mmap with MAP_ANONYMOUS.

* A reservation mapping, which size is much larger than required shared
  segment size. This mapping is created with flags PROT_NONE (which
  makes sure the reserved space is not used), and MAP_NORESERVE (to not
  count the reserved space against memory limits). The anonymous file is
  mapped into this reservation mapping.

The resulting layout looks like this:

    00400000-00490000         /path/bin/postgres
    ...
    3f526000-3f590000 rw-p 		[heap]
    7fbd827fe000-7fbd8bdde000 rw-s 	/memfd:main (deleted) -- anon file
    7fbd8bdde000-7fbe82800000 ---s 	/memfd:main (deleted) -- reservation
    7fbe82800000-7fbe90670000 r--p 	/usr/lib/locale/locale-archive
    7fbe90800000-7fbe90941000 r-xp 	/usr/lib64/libstdc++.so.6.0.34

To resize a shared memory segment in this layout it's possible to use ftruncate
on the anonymous file, adjusting access permissions on the reserved space as
needed.

This approach also do not impact the actual memory usage as reported by
the kernel. Here is the output of /proc/$PID/status for the master
version with shared_buffers = 128 MB:

    // Peak virtual memory size, which is described as total pages
    // mapped in mm_struct. It corresponds to the mapped reserved space
    // and is the only number that grows with it.
    VmPeak:          2043192 kB
    // Size of memory portions. It contains RssAnon + RssFile + RssShmem
    VmRSS:             22908 kB
    // Size of resident anonymous memory
    RssAnon:             768 kB
    // Size of resident file mappings
    RssFile:           10364 kB
    // Size of resident shmem memory (includes SysV shm, mapping of tmpfs and
    // shared anonymous mappings)
    RssShmem:          11776 kB

Here is the same for the patch when reserving 20GB of space:

    VmPeak:         21255824 kB
    VmRSS:             25020 kB
    RssAnon:             768 kB
    RssFile:           10812 kB
    RssShmem:          13440 kB

Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup
was created with the memory limit 256 MB, then PostgreSQL was launched withing
this cgroup with shared_buffers = 128 MB:

    $ cd /sys/fs/cgroup
    $ mkdir postgres
    $ cd postres
    $ echo 268435456 > memory.max

    $ echo $MASTER_PID_SHELL > cgroup.procs
    # postgres from the master branch has being successfully launched
    #  from that shell
    $ cat memory.current
    17465344 (~16.6 MB)
    # stop postgres

    $ echo $PATCH_PID_SHELL > cgroup.procs
    # postgres from the patch has being successfully launched from that shell
    $ cat memory.current
    20770816 (~19.8 MB)

To control the amount of space reserved a new GUC max_available_memory
is introduced. Ideally it should be based on the maximum available
memory, hense the name.

There are also few unrelated advantages of using anon files:

* We've got a file descriptor, which could be used for regular file
  operations (modification, truncation, you name it).

* The file could be given a name, which improves readability when it
  comes to process maps.

* By default, Linux will not add file-backed shared mappings into a core dump,
  making it more convenient to work with them in PostgreSQL: no more huge dumps
  to process.

The downside is that memfd_create is Linux specific.
---
 src/backend/port/sysv_shmem.c             | 290 ++++++++++++++++++----
 src/backend/port/win32_shmem.c            |   2 +-
 src/backend/storage/ipc/ipci.c            |   5 +-
 src/backend/storage/ipc/shmem.c           |   2 +-
 src/backend/utils/init/globals.c          |   1 +
 src/backend/utils/misc/guc_parameters.dat |  12 +
 src/include/miscadmin.h                   |   1 +
 src/include/portability/mem.h             |   2 +-
 src/include/storage/pg_shmem.h            |   5 +-
 9 files changed, 260 insertions(+), 60 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 56af0231d24..363ddfd1fca 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -97,10 +97,12 @@ void	   *UsedShmemSegAddr = NULL;
 typedef struct AnonymousMapping
 {
 	int shmem_segment;
-	Size shmem_size; 			/* Size of the mapping */
+	Size shmem_size; 			/* Size of the actually used memory */
+	Size shmem_reserved; 		/* Size of the reserved mapping */
 	Pointer shmem; 				/* Pointer to the start of the mapped memory */
 	Pointer seg_addr; 			/* SysV shared memory for the header */
 	unsigned long seg_id; 		/* IPC key */
+	int segment_fd; 			/* fd for the backing anon file */
 } AnonymousMapping;
 
 static AnonymousMapping Mappings[ANON_MAPPINGS];
@@ -108,6 +110,49 @@ static AnonymousMapping Mappings[ANON_MAPPINGS];
 /* Keeps track of used mapping segments */
 static int next_free_segment = 0;
 
+/*
+ * Anonymous mapping layout we use looks like this:
+ *
+ * 00400000-00c2a000 r-xp 			/bin/postgres
+ * ...
+ * 3f526000-3f590000 rw-p 			[heap]
+ * 7fbd827fe000-7fbd8bdde000 rw-s 	/memfd:main (deleted)
+ * 7fbd8bdde000-7fbe82800000 ---s 	/memfd:main (deleted)
+ * 7fbe82800000-7fbe90670000 r--p 	/usr/lib/locale/locale-archive
+ * 7fbe90800000-7fbe90941000 r-xp 	/usr/lib64/libstdc++.so.6.0.34
+ * ...
+ *
+ * We need to place shared memory mappings in such a way, that there will be
+ * gaps between them in the address space. Those gaps have to be large enough
+ * to resize the mapping up to certain size, without counting towards the total
+ * memory consumption.
+ *
+ * To achieve this, for each shared memory segment we first create an anonymous
+ * file of specified size using memfd_create, which will accomodate actual
+ * shared memory mapping content. It is represented by the first /memfd:main
+ * with rw permissions. Then we create a mapping for this file using mmap, with
+ * size much larger than required and flags PROT_NONE (allows to make sure the
+ * reserved space will not be used) and MAP_NORESERVE (prevents the space from
+ * being counted against memory limits). The mapping serves as an address space
+ * reservation, into which shared memory segment can be extended and is
+ * represented by the second /memfd:main with no permissions.
+ *
+ * The reserved space for each segment is calculated as a fraction of the total
+ * reserved space (MaxAvailableMemory), as specified in the SHMEM_RESIZE_RATIO
+ * array.
+ */
+static double SHMEM_RESIZE_RATIO[1] = {
+	1.0, 									/* MAIN_SHMEM_SLOT */
+};
+
+/*
+ * Flag telling that we have decided to use huge pages.
+ *
+ * XXX: It's possible to use GetConfigOption("huge_pages_status", false, false)
+ * instead, but it feels like an overkill.
+ */
+static bool huge_pages_on = false;
+
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
@@ -503,19 +548,20 @@ PGSharedMemoryAttach(IpcMemoryId shmId,
  * hugepage sizes, we might want to think about more invasive strategies,
  * such as increasing shared_buffers to absorb the extra space.
  *
- * Returns the (real, assumed or config provided) page size into
- * *hugepagesize, and the hugepage-related mmap flags to use into
- * *mmap_flags if requested by the caller.  If huge pages are not supported,
- * *hugepagesize and *mmap_flags are set to 0.
+ * Returns the (real, assumed or config provided) page size into *hugepagesize,
+ * the hugepage-related mmap and memfd flags to use into *mmap_flags and
+ * *memfd_flags if requested by the caller. If huge pages are not supported,
+ * *hugepagesize, *mmap_flags and *memfd_flags are set to 0.
  */
 void
-GetHugePageSize(Size *hugepagesize, int *mmap_flags)
+GetHugePageSize(Size *hugepagesize, int *mmap_flags, int *memfd_flags)
 {
 #ifdef MAP_HUGETLB
 
 	Size		default_hugepagesize = 0;
 	Size		hugepagesize_local = 0;
 	int			mmap_flags_local = 0;
+	int			memfd_flags_local = 0;
 
 	/*
 	 * System-dependent code to find out the default huge page size.
@@ -574,6 +620,7 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 	}
 
 	mmap_flags_local = MAP_HUGETLB;
+	memfd_flags_local = MFD_HUGETLB;
 
 	/*
 	 * On recent enough Linux, also include the explicit page size, if
@@ -584,7 +631,16 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 	{
 		int			shift = pg_ceil_log2_64(hugepagesize_local);
 
-		mmap_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
+		memfd_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
+	}
+#endif
+
+#if defined(MFD_HUGE_MASK) && defined(MFD_HUGE_SHIFT)
+	if (hugepagesize_local != default_hugepagesize)
+	{
+		int			shift = pg_ceil_log2_64(hugepagesize_local);
+
+		memfd_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
 	}
 #endif
 
@@ -593,6 +649,8 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 		*mmap_flags = mmap_flags_local;
 	if (hugepagesize)
 		*hugepagesize = hugepagesize_local;
+	if (memfd_flags)
+		*memfd_flags = memfd_flags_local;
 
 #else
 
@@ -600,6 +658,8 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 		*hugepagesize = 0;
 	if (mmap_flags)
 		*mmap_flags = 0;
+	if (memfd_flags)
+		*memfd_flags = 0;
 
 #endif							/* MAP_HUGETLB */
 }
@@ -625,72 +685,90 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
  * Creates an anonymous mmap()ed shared memory segment.
  *
  * This function will modify mapping size to the actual size of the allocation,
- * if it ends up allocating a segment that is larger than requested.
+ * if it ends up allocating a segment that is larger than requested. If needed,
+ * it also rounds up the mapping reserved size to be a multiple of huge page
+ * size.
+ *
+ * Note that we do not fallback from huge pages to regular pages in this
+ * function, this decision was already made in ReserveAnonymousMemory and we
+ * stick to it.
  */
 static void
 CreateAnonymousSegment(AnonymousMapping *mapping)
 {
 	Size		allocsize = mapping->shmem_size;
 	void	   *ptr = MAP_FAILED;
-	int			mmap_errno = 0;
+	int			save_errno = 0;
+	int			mmap_flags = PG_MMAP_FLAGS, memfd_flags = 0;
+
+	elog(DEBUG1, "segment[%s]: size %zu, reserved %zu",
+		 MappingName(mapping->shmem_segment), mapping->shmem_size,
+		 mapping->shmem_reserved);
 
 #ifndef MAP_HUGETLB
-	/* PGSharedMemoryCreate should have dealt with this case */
-	Assert(huge_pages != HUGE_PAGES_ON);
+	/* PrepareHugePages should have dealt with this case */
+	Assert(huge_pages != HUGE_PAGES_ON && !huge_pages_on);
 #else
-	if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
+	if (huge_pages_on)
 	{
-		/*
-		 * Round up the request size to a suitable large value.
-		 */
 		Size		hugepagesize;
-		int			mmap_flags;
 
-		GetHugePageSize(&hugepagesize, &mmap_flags);
+		/* Make sure nothing is messed up */
+		Assert(huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY);
+
+		/* Round up the request size to a suitable large value */
+		GetHugePageSize(&hugepagesize, &mmap_flags, &memfd_flags);
 
 		if (allocsize % hugepagesize != 0)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
-		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
-		mmap_errno = errno;
-		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
-		{
-			DebugMappings();
-			elog(DEBUG1, "segment[%s]: mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-				 MappingName(mapping->shmem_segment), allocsize);
-		}
+		/*
+		 * The reserved space is multiple of BLCKSZ. We know the huge page
+		 * size, round up the reserved space to it.
+		 */
+		mapping->shmem_reserved = mapping->shmem_reserved + hugepagesize -
+			(mapping->shmem_reserved % hugepagesize);
+
+		/* Verify that the new size is withing the reserved boundaries */
+		if (mapping->shmem_reserved < mapping->shmem_size)
+			ereport(ERROR,
+					(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+					 errmsg("not enough shared memory is reserved"),
+					 errhint("You may need to increase \"max_available_memory\".")));
+
+		mmap_flags = PG_MMAP_FLAGS | mmap_flags;
 	}
 #endif
 
 	/*
-	 * Report whether huge pages are in use.  This needs to be tracked before
-	 * the second mmap() call if attempting to use huge pages failed
-	 * previously.
+	 * Prepare an anonymous file backing the segment. Its size will be
+	 * specified later via ftruncate.
+	 *
+	 * The file behaves like a regular file, but lives in memory. Once all
+	 * references to the file are dropped,  it is automatically released.
+	 * Anonymous memory is used for all backing pages of the file, thus it has
+	 * the same semantics as anonymous memory allocations using mmap with the
+	 * MAP_ANONYMOUS flag.
 	 */
-	SetConfigOption("huge_pages_status", (ptr == MAP_FAILED) ? "off" : "on",
-					PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
+	mapping->segment_fd = memfd_create(MappingName(mapping->shmem_segment),
+									   memfd_flags);
 
-	if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
+	/*
+	 * Specify the segment file size using allocsize, which contains
+	 * potentially modified value.
+	 */
+	if(ftruncate(mapping->segment_fd, allocsize) == -1)
 	{
-		/*
-		 * Use the original size, not the rounded-up value, when falling back
-		 * to non-huge pages.
-		 */
-		allocsize = mapping->shmem_size;
-		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS, -1, 0);
-		mmap_errno = errno;
-	}
+		save_errno = errno;
 
-	if (ptr == MAP_FAILED)
-	{
-		errno = mmap_errno;
 		DebugMappings();
+		close(mapping->segment_fd);
+
+		errno = save_errno;
 		ereport(FATAL,
-				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
+				(errmsg("segment[%s]: could not truncate anonymous file: %m",
 						MappingName(mapping->shmem_segment)),
-				 (mmap_errno == ENOMEM) ?
+				 (save_errno == ENOMEM) ?
 				 errhint("This error usually means that PostgreSQL's request "
 						 "for a shared memory segment exceeded available memory, "
 						 "swap space, or huge pages. To reduce the request size "
@@ -700,10 +778,112 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 						 allocsize) : 0));
 	}
 
+	elog(DEBUG1, "segment[%s]: mmap(%zu)",
+		 MappingName(mapping->shmem_segment), allocsize);
+
+	/*
+	 * Create a reservation mapping.
+	 */
+	ptr = mmap(NULL, mapping->shmem_reserved, PROT_NONE,
+			   mmap_flags | MAP_NORESERVE, mapping->segment_fd, 0);
+	save_errno = errno;
+
+	if (ptr == MAP_FAILED)
+	{
+		DebugMappings();
+
+		errno = save_errno;
+		ereport(FATAL,
+				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
+						MappingName(mapping->shmem_segment))));
+	}
+
+	/* Make the memory accessible */
+	if(mprotect(ptr, allocsize, PROT_READ | PROT_WRITE) == -1)
+	{
+		save_errno = errno;
+		DebugMappings();
+
+		errno = save_errno;
+		ereport(FATAL,
+				(errmsg("segment[%s]: could not mprotect anonymous shared memory: %m",
+						MappingName(mapping->shmem_segment))));
+	}
+
 	mapping->shmem = ptr;
 	mapping->shmem_size = allocsize;
 }
 
+/*
+ * PrepareHugePages
+ *
+ * Figure out if there are enough huge pages to allocate all shared memory
+ * segments, and report that information via huge_pages_status and
+ * huge_pages_on. It needs to be called before creating shared memory segments.
+ *
+ * It is necessary to maintain the same semantic (simple on/off) for
+ * huge_pages_status, even if there are multiple shared memory segments: all
+ * segments either use huge pages or not, there is no mix of segments with
+ * different page size. The latter might be actually beneficial, in particular
+ * because only some segments may require large amount of memory, but for now
+ * we go with a simple solution.
+ */
+void
+PrepareHugePages()
+{
+	void	   *ptr = MAP_FAILED;
+
+	/* Reset to handle reinitialization */
+	next_free_segment = 0;
+
+	/* Complain if hugepages demanded but we can't possibly support them */
+#if !defined(MAP_HUGETLB)
+	if (huge_pages == HUGE_PAGES_ON)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("huge pages not supported on this platform")));
+#else
+	if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
+	{
+		Size		hugepagesize, total_size = 0;
+		int			mmap_flags;
+
+		GetHugePageSize(&hugepagesize, &mmap_flags, NULL);
+
+		/*
+		 * Figure out how much memory is needed for all segments, keeping in
+		 * mind that for every segment this value will be rounding up by the
+		 * huge page size. The resulting value will be used to probe memory and
+		 * decide whether we will allocate huge pages or not.
+		 */
+		for(int segment = 0; segment < ANON_MAPPINGS; segment++)
+		{
+			int	numSemas;
+			Size segment_size = CalculateShmemSize(&numSemas, segment);
+
+			if (segment_size % hugepagesize != 0)
+				segment_size += hugepagesize - (segment_size % hugepagesize);
+
+			total_size += segment_size;
+		}
+
+		/* Map total amount of memory to test its availability. */
+		elog(DEBUG1, "reserving space: probe mmap(%zu) with MAP_HUGETLB",
+					 total_size);
+		ptr = mmap(NULL, total_size, PROT_NONE,
+				   PG_MMAP_FLAGS | MAP_ANONYMOUS | mmap_flags, -1, 0);
+	}
+#endif
+
+	/*
+	 * Report whether huge pages are in use. This needs to be tracked before
+	 * creating shared memory segments.
+	 */
+	SetConfigOption("huge_pages_status", (ptr == MAP_FAILED) ? "off" : "on",
+					PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
+	huge_pages_on = ptr != MAP_FAILED;
+}
+
 /*
  * AnonymousShmemDetach --- detach from an anonymous mmap'd block
  * (called as an on_shmem_exit callback, hence funny argument list)
@@ -746,7 +926,7 @@ PGSharedMemoryCreate(Size size,
 	void	   *memAddress;
 	PGShmemHeader *hdr;
 	struct stat statbuf;
-	Size		sysvsize;
+	Size		sysvsize, total_reserved;
 	AnonymousMapping *mapping = &Mappings[next_free_segment];
 
 	/*
@@ -760,14 +940,6 @@ PGSharedMemoryCreate(Size size,
 				 errmsg("could not stat data directory \"%s\": %m",
 						DataDir)));
 
-	/* Complain if hugepages demanded but we can't possibly support them */
-#if !defined(MAP_HUGETLB)
-	if (huge_pages == HUGE_PAGES_ON)
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("huge pages not supported on this platform")));
-#endif
-
 	/* For now, we don't support huge pages in SysV memory */
 	if (huge_pages == HUGE_PAGES_ON && shared_memory_type != SHMEM_TYPE_MMAP)
 		ereport(ERROR,
@@ -776,8 +948,16 @@ PGSharedMemoryCreate(Size size,
 
 	/* Room for a header? */
 	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
+
+	/* Prepare the mapping information */
 	mapping->shmem_size = size;
 	mapping->shmem_segment = next_free_segment;
+	total_reserved = (Size) MaxAvailableMemory * BLCKSZ;
+	mapping->shmem_reserved = total_reserved * SHMEM_RESIZE_RATIO[next_free_segment];
+
+	/* Round up to be a multiple of BLCKSZ */
+	mapping->shmem_reserved = mapping->shmem_reserved + BLCKSZ -
+		(mapping->shmem_reserved % BLCKSZ);
 
 	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
diff --git a/src/backend/port/win32_shmem.c b/src/backend/port/win32_shmem.c
index 4dee856d6bd..732fedee87e 100644
--- a/src/backend/port/win32_shmem.c
+++ b/src/backend/port/win32_shmem.c
@@ -627,7 +627,7 @@ pgwin32_ReserveSharedMemoryRegion(HANDLE hChild)
  * use GetLargePageMinimum() instead.
  */
 void
-GetHugePageSize(Size *hugepagesize, int *mmap_flags)
+GetHugePageSize(Size *hugepagesize, int *mmap_flags, int *memfd_flags)
 {
 	if (hugepagesize)
 		*hugepagesize = 0;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 8b38e985327..b60f7ef9ce2 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -206,6 +206,9 @@ CreateSharedMemoryAndSemaphores(void)
 
 	Assert(!IsUnderPostmaster);
 
+	/* Decide if we use huge pages or regular size pages */
+	PrepareHugePages();
+
 	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
 	{
 		/* Compute the size of the shared-memory block */
@@ -377,7 +380,7 @@ InitializeShmemGUCs(void)
 	/*
 	 * Calculate the number of huge pages required.
 	 */
-	GetHugePageSize(&hp_size, NULL);
+	GetHugePageSize(&hp_size, NULL, NULL);
 	if (hp_size != 0)
 	{
 		Size		hp_required;
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index f185ed28f95..9bb73f31052 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -815,7 +815,7 @@ pg_get_shmem_pagesize(void)
 	Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
 
 	if (huge_pages_status == HUGE_PAGES_ON)
-		GetHugePageSize(&os_page_size, NULL);
+		GetHugePageSize(&os_page_size, NULL, NULL);
 
 	return os_page_size;
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..90d3feb547c 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -140,6 +140,7 @@ int			max_parallel_maintenance_workers = 2;
  * register background workers.
  */
 int			NBuffers = 16384;
+int			MaxAvailableMemory = 524288;
 int			MaxConnections = 100;
 int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index b176d5130e4..cff8bb815f9 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1114,6 +1114,18 @@
   max => 'INT_MAX / 2',
 },
 
+# TODO: should this be PGC_POSTMASTER?
+{ name => "max_available_memory", type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_MEM',
+  short_desc => 'Sets the upper limit for the shared_buffers value.',
+  long_desc => 'Shared memory could be resized at runtime, this parameters sets the upper limit for it, beyond which resizing would not be supported. Normally this value would be the same as the total available memory.',
+  flags => 'GUC_UNIT_BLOCKS',
+  variable => 'MaxAvailableMemory',
+  boot_val => '524288',
+  min => '16',
+  max => 'INT_MAX / 2',
+},
+
+
 { name => 'vacuum_buffer_usage_limit', type => 'int', context => 'PGC_USERSET', group => 'RESOURCES_MEM',
   short_desc => 'Sets the buffer pool size for VACUUM, ANALYZE, and autovacuum.',
   flags => 'GUC_UNIT_KB',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..a0c37a7749e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -173,6 +173,7 @@ extern PGDLLIMPORT char *DataDir;
 extern PGDLLIMPORT int data_directory_mode;
 
 extern PGDLLIMPORT int NBuffers;
+extern PGDLLIMPORT int MaxAvailableMemory;
 extern PGDLLIMPORT int MaxBackends;
 extern PGDLLIMPORT int MaxConnections;
 extern PGDLLIMPORT int max_worker_processes;
diff --git a/src/include/portability/mem.h b/src/include/portability/mem.h
index ef9800732d9..40588ff6968 100644
--- a/src/include/portability/mem.h
+++ b/src/include/portability/mem.h
@@ -38,7 +38,7 @@
 #define MAP_NOSYNC			0
 #endif
 
-#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_HASSEMAPHORE)
 
 /* Some really old systems don't define MAP_FAILED. */
 #ifndef MAP_FAILED
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 2348c59b5a0..79b0b1ef9eb 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
 extern PGDLLIMPORT int huge_page_size;
 extern PGDLLIMPORT int huge_pages_status;
+extern PGDLLIMPORT int MaxAvailableMemory;
 
 /* Possible values for huge_pages and huge_pages_status */
 typedef enum
@@ -104,7 +105,9 @@ extern PGShmemHeader *PGSharedMemoryCreate(Size size,
 										   PGShmemHeader **shim);
 extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
-extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
+extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
+							int *memfd_flags);
+void PrepareHugePages(void);
 
 /* The main segment, contains everything except buffer blocks and related data. */
 #define MAIN_SHMEM_SEGMENT 0
-- 
2.34.1

0007-Introduce-multiple-shmem-segments-for-share-20251013.patchapplication/x-patch; name=0007-Introduce-multiple-shmem-segments-for-share-20251013.patchDownload

From 3a419573e0a5fe41e6aa1a2530c0c661d8a7eaa8 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 11:22:02 +0200
Subject: [PATCH 07/19] Introduce multiple shmem segments for shared buffers

Add more shmem segments to split shared buffers into following chunks:
* BUFFERS_SHMEM_SEGMENT: contains buffer blocks
* BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
* BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
* CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
* STRATEGY_SHMEM_SEGMENT: contains buffer strategy status

Size of the corresponding shared data directly depends on NBuffers,
meaning that if we would like to change NBuffers, they have to be
resized correspondingly. Placing each of them in a separate shmem
segment allows to achieve that.

There are some asumptions made about each of shmem segments upper size
limit. The buffer blocks have the largest, while the rest claim less
extra room for resize. Ideally those limits have to be deduced from the
maximum allowed shared memory.
---
 src/backend/port/sysv_shmem.c          | 24 +++++++-
 src/backend/storage/buffer/buf_init.c  | 79 +++++++++++++++++---------
 src/backend/storage/buffer/buf_table.c |  6 +-
 src/backend/storage/buffer/freelist.c  |  5 +-
 src/backend/storage/ipc/ipci.c         |  2 +-
 src/include/storage/bufmgr.h           |  2 +-
 src/include/storage/pg_shmem.h         | 24 +++++++-
 7 files changed, 105 insertions(+), 37 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 363ddfd1fca..dac011b766b 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -139,10 +139,18 @@ static int next_free_segment = 0;
  *
  * The reserved space for each segment is calculated as a fraction of the total
  * reserved space (MaxAvailableMemory), as specified in the SHMEM_RESIZE_RATIO
- * array.
+ * array. E.g. we allow BUFFERS_SHMEM_SEGMENT to take up to 60% of the whole
+ * space when resizing, based on the fact that it most likely will be the main
+ * consumer of this memory. Those numbers are pulled out of thin air for now,
+ * makes sense to evaluate them more precise.
  */
-static double SHMEM_RESIZE_RATIO[1] = {
-	1.0, 									/* MAIN_SHMEM_SLOT */
+static double SHMEM_RESIZE_RATIO[6] = {
+	0.1,    /* MAIN_SHMEM_SEGMENT */
+	0.6,    /* BUFFERS_SHMEM_SEGMENT */
+	0.1,    /* BUFFER_DESCRIPTORS_SHMEM_SEGMENT */
+	0.1,    /* BUFFER_IOCV_SHMEM_SEGMENT */
+	0.05,   /* CHECKPOINT_BUFFERS_SHMEM_SEGMENT */
+	0.05,   /* STRATEGY_SHMEM_SEGMENT */
 };
 
 /*
@@ -167,6 +175,16 @@ MappingName(int shmem_segment)
 	{
 		case MAIN_SHMEM_SEGMENT:
 			return "main";
+		case BUFFERS_SHMEM_SEGMENT:
+			return "buffers";
+		case BUFFER_DESCRIPTORS_SHMEM_SEGMENT:
+			return "descriptors";
+		case BUFFER_IOCV_SHMEM_SEGMENT:
+			return "iocv";
+		case CHECKPOINT_BUFFERS_SHMEM_SEGMENT:
+			return "checkpoint";
+		case STRATEGY_SHMEM_SEGMENT:
+			return "strategy";
 		default:
 			return "unknown";
 	}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6fd3a6bbac5..5383442e213 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -62,7 +62,10 @@ CkptSortItem *CkptBufferIds;
  * Initialize shared buffer pool
  *
  * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend).
+ * postmaster, or in a standalone backend). Size of data structures initialized
+ * here depends on NBuffers, and to be able to change NBuffers without a
+ * restart we store each structure into a separate shared memory segment, which
+ * could be resized on demand.
  */
 void
 BufferManagerShmemInit(void)
@@ -74,22 +77,22 @@ BufferManagerShmemInit(void)
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
+		ShmemInitStructInSegment("Buffer Descriptors",
 						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+						&foundDescs, BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
 		TYPEALIGN(PG_IO_ALIGN_SIZE,
-				  ShmemInitStruct("Buffer Blocks",
+				  ShmemInitStructInSegment("Buffer Blocks",
 								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
-								  &foundBufs));
+								  &foundBufs, BUFFERS_SHMEM_SEGMENT));
 
 	/* Align condition variables to cacheline boundary. */
 	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
-		ShmemInitStruct("Buffer IO Condition Variables",
+		ShmemInitStructInSegment("Buffer IO Condition Variables",
 						NBuffers * sizeof(ConditionVariableMinimallyPadded),
-						&foundIOCV);
+						&foundIOCV, BUFFER_IOCV_SHMEM_SEGMENT);
 
 	/*
 	 * The array used to sort to-be-checkpointed buffer ids is located in
@@ -99,8 +102,9 @@ BufferManagerShmemInit(void)
 	 * painful.
 	 */
 	CkptBufferIds = (CkptSortItem *)
-		ShmemInitStruct("Checkpoint BufferIds",
-						NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
+		ShmemInitStructInSegment("Checkpoint BufferIds",
+						NBuffers * sizeof(CkptSortItem), &foundBufCkpt,
+						CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
 
 	if (foundDescs || foundBufs || foundIOCV || foundBufCkpt)
 	{
@@ -147,33 +151,54 @@ BufferManagerShmemInit(void)
  * BufferManagerShmemSize
  *
  * compute the size of shared memory for the buffer pool including
- * data pages, buffer descriptors, hash tables, etc.
+ * data pages, buffer descriptors, hash tables, etc. based on the
+ * shared memory segment. The main segment must not allocate anything
+ * related to buffers, every other segment will receive part of the
+ * data.
  */
 Size
-BufferManagerShmemSize(void)
+BufferManagerShmemSize(int shmem_segment)
 {
 	Size		size = 0;
 
-	/* size of buffer descriptors */
-	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
-	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_segment == MAIN_SHMEM_SEGMENT)
+		return size;
 
-	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
-	size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	if (shmem_segment == BUFFER_DESCRIPTORS_SHMEM_SEGMENT)
+	{
+		/* size of buffer descriptors */
+		size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
+		/* to allow aligning buffer descriptors */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
 
-	/* size of stuff controlled by freelist.c */
-	size = add_size(size, StrategyShmemSize());
+	if (shmem_segment == BUFFERS_SHMEM_SEGMENT)
+	{
+		/* size of data pages, plus alignment padding */
+		size = add_size(size, PG_IO_ALIGN_SIZE);
+		size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	}
 
-	/* size of I/O condition variables */
-	size = add_size(size, mul_size(NBuffers,
-								   sizeof(ConditionVariableMinimallyPadded)));
-	/* to allow aligning the above */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	if (shmem_segment == STRATEGY_SHMEM_SEGMENT)
+	{
+		/* size of stuff controlled by freelist.c */
+		size = add_size(size, StrategyShmemSize());
+	}
 
-	/* size of checkpoint sort array in bufmgr.c */
-	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	if (shmem_segment == BUFFER_IOCV_SHMEM_SEGMENT)
+	{
+		/* size of I/O condition variables */
+		size = add_size(size, mul_size(NBuffers,
+									   sizeof(ConditionVariableMinimallyPadded)));
+		/* to allow aligning the above */
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+	}
+
+	if (shmem_segment == CHECKPOINT_BUFFERS_SHMEM_SEGMENT)
+	{
+		/* size of checkpoint sort array in bufmgr.c */
+		size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	}
 
 	return size;
 }
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 1f6e215a2ca..18a78967138 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -25,6 +25,7 @@
 #include "funcapi.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
+#include "storage/pg_shmem.h"
 #include "utils/rel.h"
 #include "utils/builtins.h"
 
@@ -64,10 +65,11 @@ InitBufTable(int size)
 	info.entrysize = sizeof(BufferLookupEnt);
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
-	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
+	SharedBufHash = ShmemInitHashInSegment("Shared Buffer Lookup Table",
 								  size, size,
 								  &info,
-								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION | HASH_FIXED_SIZE);
+								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION | HASH_FIXED_SIZE,
+								  STRATEGY_SHMEM_SEGMENT);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7fe34d3ef4c..299f6aa8e7e 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
@@ -418,9 +419,9 @@ StrategyInitialize(bool init)
 	 * Get or create the shared strategy control block
 	 */
 	StrategyControl = (BufferStrategyControl *)
-		ShmemInitStruct("Buffer Strategy Status",
+		ShmemInitStructInSegment("Buffer Strategy Status",
 						sizeof(BufferStrategyControl),
-						&found);
+						&found, STRATEGY_SHMEM_SEGMENT);
 
 	if (!found)
 	{
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index b60f7ef9ce2..2dbd81afc87 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -113,7 +113,7 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
 											 sizeof(ShmemIndexEnt)));
 	size = add_size(size, dsm_estimate_size());
 	size = add_size(size, DSMRegistryShmemSize());
-	size = add_size(size, BufferManagerShmemSize());
+	size = add_size(size, BufferManagerShmemSize(shmem_segment));
 	size = add_size(size, LockManagerShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3f37b294af6..a222747b803 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -319,7 +319,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
-extern Size BufferManagerShmemSize(void);
+extern Size BufferManagerShmemSize(int);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 79b0b1ef9eb..a7b275b4db9 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -52,7 +52,7 @@ typedef struct ShmemSegment
 } ShmemSegment;
 
 /* Number of available segments for anonymous memory mappings */
-#define ANON_MAPPINGS 1
+#define ANON_MAPPINGS 6
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 
@@ -109,7 +109,29 @@ extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
 							int *memfd_flags);
 void PrepareHugePages(void);
 
+/*
+ * To be able to dynamically resize largest parts of the data stored in shared
+ * memory, we split it into multiple shared memory mappings segments. Each
+ * segment contains only certain part of the data, which size depends on
+ * NBuffers.
+ */
+
 /* The main segment, contains everything except buffer blocks and related data. */
 #define MAIN_SHMEM_SEGMENT 0
 
+/* Buffer blocks */
+#define BUFFERS_SHMEM_SEGMENT 1
+
+/* Buffer descriptors */
+#define BUFFER_DESCRIPTORS_SHMEM_SEGMENT 2
+
+/* Condition variables for buffers */
+#define BUFFER_IOCV_SHMEM_SEGMENT 3
+
+/* Checkpoint BufferIds */
+#define CHECKPOINT_BUFFERS_SHMEM_SEGMENT 4
+
+/* Buffer strategy status */
+#define STRATEGY_SHMEM_SEGMENT 5
+
 #endif							/* PG_SHMEM_H */
-- 
2.34.1

0010-WIP-Monitoring-views-20251013.patchapplication/x-patch; name=0010-WIP-Monitoring-views-20251013.patchDownload

From a249feb0e7654865f53f3853c310a5cec58e185e Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Wed, 20 Aug 2025 10:55:27 +0530
Subject: [PATCH 10/19] WIP: Monitoring views

Modifies pg_shmem_allocations to report shared memory segment as well.

Adds pg_shmem_segments to report shared memory segment information.

TODO:
This commit should be merged with the earlier commit introducing
multiple shared memory segments.

Author: Ashutosh Bapat
---
 doc/src/sgml/system-views.sgml       |  9 +++
 src/backend/catalog/system_views.sql |  7 +++
 src/backend/storage/ipc/shmem.c      | 90 ++++++++++++++++++++++------
 src/include/catalog/pg_proc.dat      | 12 +++-
 src/include/storage/pg_shmem.h       |  1 -
 src/include/storage/shmem.h          |  1 +
 src/test/regress/expected/rules.out  | 10 +++-
 7 files changed, 108 insertions(+), 22 deletions(-)

diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 8f3e2741051..bc70a3ee6c9 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -4233,6 +4233,15 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>segment</structfield> <type>text</type>
+      </para>
+      <para>
+       The name of the shared memory segment concerning the allocation. 
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>off</structfield> <type>int8</type>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index c7240250c07..94a2b5a9a67 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -668,6 +668,13 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
 REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
 GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
 
+CREATE VIEW pg_shmem_segments AS
+    SELECT * FROM pg_get_shmem_segments();
+
+REVOKE ALL ON pg_shmem_segments FROM PUBLIC;
+GRANT SELECT ON pg_shmem_segments TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_segments() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_segments() TO pg_read_all_stats;
 CREATE VIEW pg_shmem_allocations_numa AS
     SELECT * FROM pg_get_shmem_allocations_numa();
 
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 90c21a97225..9499f332e77 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -531,6 +531,7 @@ ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
 		result->size = size;
 		result->allocated_size = allocated_size;
 		result->location = structPtr;
+		result->shmem_segment = shmem_segment;
 	}
 
 	LWLockRelease(ShmemIndexLock);
@@ -582,13 +583,14 @@ mul_size(Size s1, Size s2)
 Datum
 pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 {
-#define PG_GET_SHMEM_SIZES_COLS 4
+#define PG_GET_SHMEM_SIZES_COLS 5
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	HASH_SEQ_STATUS hstat;
 	ShmemIndexEnt *ent;
-	Size		named_allocated = 0;
+	Size		named_allocated[ANON_MAPPINGS] = {0};
 	Datum		values[PG_GET_SHMEM_SIZES_COLS];
 	bool		nulls[PG_GET_SHMEM_SIZES_COLS];
+	int			i;
 
 	InitMaterializedSRF(fcinfo, 0);
 
@@ -598,33 +600,42 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 
 	/* output all allocated entries */
 	memset(nulls, 0, sizeof(nulls));
-	/* XXX: take all shared memory segments into account. */
 	while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
 	{
 		values[0] = CStringGetTextDatum(ent->key);
-		values[1] = Int64GetDatum((char *) ent->location - (char *) Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr);
-		values[2] = Int64GetDatum(ent->size);
-		values[3] = Int64GetDatum(ent->allocated_size);
-		named_allocated += ent->allocated_size;
+		values[1] = CStringGetTextDatum(MappingName(ent->shmem_segment));
+		values[2] = Int64GetDatum((char *) ent->location - (char *) Segments[ent->shmem_segment].ShmemSegHdr);
+		values[3] = Int64GetDatum(ent->size);
+		values[4] = Int64GetDatum(ent->allocated_size);
+		named_allocated[ent->shmem_segment] += ent->allocated_size;
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
 							 values, nulls);
 	}
 
 	/* output shared memory allocated but not counted via the shmem index */
-	values[0] = CStringGetTextDatum("<anonymous>");
-	nulls[1] = true;
-	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset - named_allocated);
-	values[3] = values[2];
-	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	for (i = 0; i < ANON_MAPPINGS; i++)
+	{
+		values[0] = CStringGetTextDatum("<anonymous>");
+		values[1] = CStringGetTextDatum(MappingName(i));
+		nulls[2] = true;
+		values[3] = Int64GetDatum(Segments[i].ShmemSegHdr->freeoffset - named_allocated[i]);
+		values[4] = values[3];
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
 
 	/* output as-of-yet unused shared memory */
-	nulls[0] = true;
-	values[1] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
-	nulls[1] = false;
-	values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->totalsize - Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
-	values[3] = values[2];
-	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	memset(nulls, 0, sizeof(nulls));
+
+	for (i = 0; i < ANON_MAPPINGS; i++)
+	{
+		nulls[0] = true;
+		values[1] = CStringGetTextDatum(MappingName(i));
+		values[2] = Int64GetDatum(Segments[i].ShmemSegHdr->freeoffset);
+		values[3] = Int64GetDatum(Segments[i].ShmemSegHdr->totalsize - Segments[i].ShmemSegHdr->freeoffset);
+		values[4] = values[3];
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
 
 	LWLockRelease(ShmemIndexLock);
 
@@ -825,3 +836,46 @@ pg_numa_available(PG_FUNCTION_ARGS)
 {
 	PG_RETURN_BOOL(pg_numa_init() != -1);
 }
+
+/* SQL SRF showing shared memory segments */
+Datum
+pg_get_shmem_segments(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_SEGS_COLS 6
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Datum		values[PG_GET_SHMEM_SEGS_COLS];
+	bool		nulls[PG_GET_SHMEM_SEGS_COLS];
+	int i;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	/* output all allocated entries */
+	for (i = 0; i < ANON_MAPPINGS; i++)
+	{
+		PGShmemHeader *shmhdr = Segments[i].ShmemSegHdr;
+		AnonymousMapping *segmapping = &Mappings[i];
+		int j;
+
+		if (shmhdr == NULL)
+		{
+			for (j = 0; j < PG_GET_SHMEM_SEGS_COLS; j++)
+				nulls[j] = true;
+		}
+		else
+		{
+			memset(nulls, 0, sizeof(nulls));
+			values[0] = Int32GetDatum(i);
+			values[1] = CStringGetTextDatum(MappingName(i));
+			values[2] = Int64GetDatum(shmhdr->totalsize);
+			values[3] = Int64GetDatum(shmhdr->freeoffset);
+			values[4] = Int64GetDatum(segmapping->shmem_size);
+			values[5] = Int64GetDatum(segmapping->shmem_reserved);
+		}
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
+	}
+
+	return (Datum) 0;
+}
+
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index e631323a325..8f1d0b7c031 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8576,8 +8576,8 @@
 { oid => '5052', descr => 'allocations from the main shared memory segment',
   proname => 'pg_get_shmem_allocations', prorows => '50', proretset => 't',
   provolatile => 'v', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,int8,int8,int8}', proargmodes => '{o,o,o,o}',
-  proargnames => '{name,off,size,allocated_size}',
+  proallargtypes => '{text,text,int8,int8,int8}', proargmodes => '{o,o,o,o,o}',
+  proargnames => '{name,segment,off,size,allocated_size}',
   prosrc => 'pg_get_shmem_allocations' },
 
 { oid => '4099', descr => 'Is NUMA support available?',
@@ -8600,6 +8600,14 @@
   proargmodes => '{o,o,o}', proargnames => '{name,type,size}',
   prosrc => 'pg_get_dsm_registry_allocations' },
 
+# shared memory segments 
+{ oid => '5101', descr => 'shared memory segments',
+  proname => 'pg_get_shmem_segments', prorows => '6', proretset => 't',
+  provolatile => 'v', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int4,text,int8,int8,int8,int8}', proargmodes => '{o,o,o,o,o,o}',
+  proargnames => '{id,name,size,freeoffset,mapping_size,mapping_reserved_size}',
+  prosrc => 'pg_get_shmem_segments' },
+
 # buffer lookup table
 { oid => '5102',
   descr => 'shared buffer lookup table',
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index a1fa6b43fe3..715f6acb5dd 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -69,7 +69,6 @@ typedef struct ShmemSegment
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 extern PGDLLIMPORT AnonymousMapping Mappings[ANON_MAPPINGS];
 
-
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index 910c43f54f4..64ff5a286ba 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -71,6 +71,7 @@ typedef struct
 	void	   *location;		/* location in shared mem */
 	Size		size;			/* # bytes requested for the structure */
 	Size		allocated_size; /* # bytes actually allocated */
+	int			shmem_segment;	/* segment in which the structure is allocated */
 } ShmemIndexEnt;
 
 #endif							/* SHMEM_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 83f566d3218..60c08081b69 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1772,14 +1772,22 @@ pg_shadow| SELECT pg_authid.rolname AS usename,
      LEFT JOIN pg_db_role_setting s ON (((pg_authid.oid = s.setrole) AND (s.setdatabase = (0)::oid))))
   WHERE pg_authid.rolcanlogin;
 pg_shmem_allocations| SELECT name,
+    segment,
     off,
     size,
     allocated_size
-   FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+   FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, segment, off, size, allocated_size);
 pg_shmem_allocations_numa| SELECT name,
     numa_node,
     size
    FROM pg_get_shmem_allocations_numa() pg_get_shmem_allocations_numa(name, numa_node, size);
+pg_shmem_segments| SELECT id,
+    name,
+    size,
+    freeoffset,
+    mapping_size,
+    mapping_reserved_size
+   FROM pg_get_shmem_segments() pg_get_shmem_segments(id, name, size, freeoffset, mapping_size, mapping_reserved_size);
 pg_stat_activity| SELECT s.datid,
     d.datname,
     s.pid,
-- 
2.34.1

0009-Refactor-CalculateShmemSize-20251013.patchapplication/x-patch; name=0009-Refactor-CalculateShmemSize-20251013.patchDownload

From 07377b5f6722dcfd60b91458ca03cef8a5230e4c Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Thu, 21 Aug 2025 11:56:09 +0530
Subject: [PATCH 09/19] Refactor CalculateShmemSize()

This function calls many functions which return the amount of shared
memory required for different shared memory data structures. Up until
now, the returned total of these sizes was used to create a single
shared memory segment. But starting the previous patch, we create
multiple shared memory segments each of which contain one shared memory
structure related to shared buffers and one main memory segment
containing rest of the structures. Since CalculateShmemSize() is called
for every shared memory segment, and its return value is added to the
memory required for all the shared memory segments, we end up allocating
more memory than required.

Instead, CalculateShmemSize() is called only once. Each of its callees
are expected to a. return the size required from the main segment b. add
sizes to the AnonymousMappings corresponding to the other memory
segments.

For individual modules to add memory to their respective
AnonymousMappings, we need to know the different mappings upfront. Hence
ANON_MAPPINGS replaces next_free_segment.

TODOs:

1. This change however requires that the AnonymousMappings array and
   macros defining identifiers of each of the segments be
platform-independent.  This patch doesn't achieve that goal for all the
platforms for example windows. We need to fix that.

2. If postgres is invoked with -C shared_memory_size, it reports 0.
   That's because it report the GUC values before share memory sizes are
set in AnonymousMappings. Fix that too.

3.  Eliminate this assymetry in CalculateShmemSize().  See TODO in
    prologue of CalculateShmemSize().

4. This is one way to avoid requesting more memory in each segment. But
   there may be other ways to design CalculateShmemSize(). Need to think
and implement it better.

Author: Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c         | 48 ++++++--------------
 src/backend/port/win32_shmem.c        |  7 +--
 src/backend/postmaster/postmaster.c   | 14 +++---
 src/backend/storage/buffer/buf_init.c | 55 ++++++++---------------
 src/backend/storage/ipc/ipci.c        | 65 ++++++++++++++++++++++-----
 src/backend/storage/ipc/shmem.c       |  8 ++--
 src/backend/tcop/postgres.c           | 14 +++---
 src/include/storage/bufmgr.h          |  2 +-
 src/include/storage/ipc.h             |  2 +-
 src/include/storage/pg_shmem.h        | 17 ++++++-
 10 files changed, 125 insertions(+), 107 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index dac011b766b..b85911bdfc4 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -94,21 +94,7 @@ typedef enum
 unsigned long UsedShmemSegID = 0;
 void	   *UsedShmemSegAddr = NULL;
 
-typedef struct AnonymousMapping
-{
-	int shmem_segment;
-	Size shmem_size; 			/* Size of the actually used memory */
-	Size shmem_reserved; 		/* Size of the reserved mapping */
-	Pointer shmem; 				/* Pointer to the start of the mapped memory */
-	Pointer seg_addr; 			/* SysV shared memory for the header */
-	unsigned long seg_id; 		/* IPC key */
-	int segment_fd; 			/* fd for the backing anon file */
-} AnonymousMapping;
-
-static AnonymousMapping Mappings[ANON_MAPPINGS];
-
-/* Keeps track of used mapping segments */
-static int next_free_segment = 0;
+AnonymousMapping Mappings[ANON_MAPPINGS];
 
 /*
  * Anonymous mapping layout we use looks like this:
@@ -168,7 +154,7 @@ static IpcMemoryState PGSharedMemoryAttach(IpcMemoryId shmId,
 										   void *attachAt,
 										   PGShmemHeader **addr);
 
-static const char*
+const char*
 MappingName(int shmem_segment)
 {
 	switch (shmem_segment)
@@ -193,7 +179,7 @@ MappingName(int shmem_segment)
 static void
 DebugMappings()
 {
-	for(int i = 0; i < next_free_segment; i++)
+	for(int i = 0; i < ANON_MAPPINGS; i++)
 	{
 		AnonymousMapping m = Mappings[i];
 		elog(DEBUG1, "Mapping[%s]: addr %p, size %zu",
@@ -851,9 +837,6 @@ PrepareHugePages()
 {
 	void	   *ptr = MAP_FAILED;
 
-	/* Reset to handle reinitialization */
-	next_free_segment = 0;
-
 	/* Complain if hugepages demanded but we can't possibly support them */
 #if !defined(MAP_HUGETLB)
 	if (huge_pages == HUGE_PAGES_ON)
@@ -876,8 +859,7 @@ PrepareHugePages()
 		 */
 		for(int segment = 0; segment < ANON_MAPPINGS; segment++)
 		{
-			int	numSemas;
-			Size segment_size = CalculateShmemSize(&numSemas, segment);
+			Size segment_size = Mappings[segment].shmem_req_size;
 
 			if (segment_size % hugepagesize != 0)
 				segment_size += hugepagesize - (segment_size % hugepagesize);
@@ -909,7 +891,7 @@ PrepareHugePages()
 static void
 AnonymousShmemDetach(int status, Datum arg)
 {
-	for(int i = 0; i < next_free_segment; i++)
+	for(int i = 0; i < ANON_MAPPINGS; i++)
 	{
 		AnonymousMapping m = Mappings[i];
 
@@ -927,7 +909,7 @@ AnonymousShmemDetach(int status, Datum arg)
 /*
  * PGSharedMemoryCreate
  *
- * Create a shared memory segment of the given size and initialize its
+ * Create a shared memory segment for the given mapping and initialize its
  * standard header.  Also, register an on_shmem_exit callback to release
  * the storage.
  *
@@ -937,7 +919,7 @@ AnonymousShmemDetach(int status, Datum arg)
  * postmaster or backend.
  */
 PGShmemHeader *
-PGSharedMemoryCreate(Size size,
+PGSharedMemoryCreate(AnonymousMapping *mapping,
 					 PGShmemHeader **shim)
 {
 	IpcMemoryKey NextShmemSegID;
@@ -945,7 +927,6 @@ PGSharedMemoryCreate(Size size,
 	PGShmemHeader *hdr;
 	struct stat statbuf;
 	Size		sysvsize, total_reserved;
-	AnonymousMapping *mapping = &Mappings[next_free_segment];
 
 	/*
 	 * We use the data directory's ID info (inode and device numbers) to
@@ -965,13 +946,12 @@ PGSharedMemoryCreate(Size size,
 				 errmsg("huge pages not supported with the current \"shared_memory_type\" setting")));
 
 	/* Room for a header? */
-	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
+	Assert(mapping->shmem_req_size > MAXALIGN(sizeof(PGShmemHeader)));
 
 	/* Prepare the mapping information */
-	mapping->shmem_size = size;
-	mapping->shmem_segment = next_free_segment;
+	mapping->shmem_size = mapping->shmem_req_size;
 	total_reserved = (Size) MaxAvailableMemory * BLCKSZ;
-	mapping->shmem_reserved = total_reserved * SHMEM_RESIZE_RATIO[next_free_segment];
+	mapping->shmem_reserved = total_reserved * SHMEM_RESIZE_RATIO[mapping->shmem_segment];
 
 	/* Round up to be a multiple of BLCKSZ */
 	mapping->shmem_reserved = mapping->shmem_reserved + BLCKSZ -
@@ -982,8 +962,6 @@ PGSharedMemoryCreate(Size size,
 		/* On success, mapping data will be modified. */
 		CreateAnonymousSegment(mapping);
 
-		next_free_segment++;
-
 		/* Register on-exit routine to unmap the anonymous segment */
 		on_shmem_exit(AnonymousShmemDetach, (Datum) 0);
 
@@ -992,7 +970,7 @@ PGSharedMemoryCreate(Size size,
 	}
 	else
 	{
-		sysvsize = size;
+		sysvsize = mapping->shmem_req_size;
 
 		/* huge pages are only available with mmap */
 		SetConfigOption("huge_pages_status", "off",
@@ -1005,7 +983,7 @@ PGSharedMemoryCreate(Size size,
 	 * loop simultaneously.  (CreateDataDirLockFile() does not entirely ensure
 	 * that, but prefer fixing it over coping here.)
 	 */
-	NextShmemSegID = statbuf.st_ino + next_free_segment;
+	NextShmemSegID = statbuf.st_ino + mapping->shmem_segment;
 
 	for (;;)
 	{
@@ -1214,7 +1192,7 @@ PGSharedMemoryNoReAttach(void)
 void
 PGSharedMemoryDetach(void)
 {
-	for(int i = 0; i < next_free_segment; i++)
+	for(int i = 0; i < ANON_MAPPINGS; i++)
 	{
 		AnonymousMapping m = Mappings[i];
 
diff --git a/src/backend/port/win32_shmem.c b/src/backend/port/win32_shmem.c
index 732fedee87e..1db07ff65d3 100644
--- a/src/backend/port/win32_shmem.c
+++ b/src/backend/port/win32_shmem.c
@@ -204,7 +204,7 @@ EnableLockPagesPrivilege(int elevel)
  * standard header.
  */
 PGShmemHeader *
-PGSharedMemoryCreate(Size size,
+PGSharedMemoryCreate(AnonymousMapping *mapping,
 					 PGShmemHeader **shim)
 {
 	void	   *memAddress;
@@ -216,7 +216,7 @@ PGSharedMemoryCreate(Size size,
 	DWORD		size_high;
 	DWORD		size_low;
 	SIZE_T		largePageSize = 0;
-	Size		orig_size = size;
+	Size		size = mapping->shmem_req_size;
 	DWORD		flProtect = PAGE_READWRITE;
 	DWORD		desiredAccess;
 
@@ -304,7 +304,7 @@ retry:
 				 * Use the original size, not the rounded-up value, when
 				 * falling back to non-huge pages.
 				 */
-				size = orig_size;
+				size = mapping->shmem_req_size;
 				flProtect = PAGE_READWRITE;
 				goto retry;
 			}
@@ -391,6 +391,7 @@ retry:
 	hdr->totalsize = size;
 	hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader));
 	hdr->dsm_control = 0;
+	mapping->shmem_size = size;
 
 	/* Save info for possible future use */
 	UsedShmemSegAddr = memAddress;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index e1d643b013d..b59d20b4ac2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -963,13 +963,6 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	process_shmem_requests();
 
-	/*
-	 * Now that loadable modules have had their chance to request additional
-	 * shared memory, determine the value of any runtime-computed GUCs that
-	 * depend on the amount of shared memory required.
-	 */
-	InitializeShmemGUCs();
-
 	/*
 	 * Now that modules have been loaded, we can process any custom resource
 	 * managers specified in the wal_consistency_checking GUC.
@@ -1005,6 +998,13 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	CreateSharedMemoryAndSemaphores();
 
+	/*
+	 * Now that loadable modules have had their chance to request additional
+	 * shared memory, determine the value of any runtime-computed GUCs that
+	 * depend on the amount of shared memory required.
+	 */
+	InitializeShmemGUCs();
+
 	/*
 	 * Estimate number of openable files.  This must happen after setting up
 	 * semaphores, because on some platforms semaphores count as open files.
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6d703e18f8b..6f148d1d80b 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -158,48 +158,31 @@ BufferManagerShmemInit(void)
  * data.
  */
 Size
-BufferManagerShmemSize(int shmem_segment)
+BufferManagerShmemSize(void)
 {
-	Size		size = 0;
+	size_t size;
 
-	if (shmem_segment == MAIN_SHMEM_SEGMENT)
-		return size;
+	/* size of buffer descriptors, plus alignment padding */
+	size = add_size(0, mul_size(NBuffers, sizeof(BufferDescPadded)));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	Mappings[BUFFER_DESCRIPTORS_SHMEM_SEGMENT].shmem_req_size = size;
 
-	if (shmem_segment == BUFFER_DESCRIPTORS_SHMEM_SEGMENT)
-	{
-		/* size of buffer descriptors */
-		size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
-		/* to allow aligning buffer descriptors */
-		size = add_size(size, PG_CACHE_LINE_SIZE);
-	}
+	/* size of data pages, plus alignment padding */
+	size = add_size(0, PG_IO_ALIGN_SIZE);
+	size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	Mappings[BUFFERS_SHMEM_SEGMENT].shmem_req_size = size;
 
-	if (shmem_segment == BUFFERS_SHMEM_SEGMENT)
-	{
-		/* size of data pages, plus alignment padding */
-		size = add_size(size, PG_IO_ALIGN_SIZE);
-		size = add_size(size, mul_size(NBuffers, BLCKSZ));
-	}
+	/* size of stuff controlled by freelist.c */
+	Mappings[STRATEGY_SHMEM_SEGMENT].shmem_req_size = StrategyShmemSize();
 
-	if (shmem_segment == STRATEGY_SHMEM_SEGMENT)
-	{
-		/* size of stuff controlled by freelist.c */
-		size = add_size(size, StrategyShmemSize());
-	}
+	/* size of I/O condition variables, plus alignment padding */
+	size = add_size(0, mul_size(NBuffers,
+								   sizeof(ConditionVariableMinimallyPadded)));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	Mappings[BUFFER_IOCV_SHMEM_SEGMENT].shmem_req_size = size;
 
-	if (shmem_segment == BUFFER_IOCV_SHMEM_SEGMENT)
-	{
-		/* size of I/O condition variables */
-		size = add_size(size, mul_size(NBuffers,
-									   sizeof(ConditionVariableMinimallyPadded)));
-		/* to allow aligning the above */
-		size = add_size(size, PG_CACHE_LINE_SIZE);
-	}
-
-	if (shmem_segment == CHECKPOINT_BUFFERS_SHMEM_SEGMENT)
-	{
-		/* size of checkpoint sort array in bufmgr.c */
-		size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
-	}
+	/* size of checkpoint sort array in bufmgr.c */
+	Mappings[CHECKPOINT_BUFFERS_SHMEM_SEGMENT].shmem_req_size = mul_size(NBuffers, sizeof(CkptSortItem));
 
 	return size;
 }
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2dbd81afc87..2cd278449f0 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -84,9 +84,23 @@ RequestAddinShmemSpace(Size size)
  *
  * If num_semaphores is not NULL, it will be set to the number of semaphores
  * required.
+ * 
+ * TODO: Right now the minions of this function return the size of shared memory
+ * required in the main shared memory segment but add sizes required from other
+ * segments in the respective mappings. I think we should change this assymetry.
+ * It's only the buffer manager which adds sizes for other segments, but in
+ * future there may be others. Further the buffer manager related other segments
+ * are expected to hold only one resizable structure thus their size should be
+ * set only once when changing shared buffer pool size (i.e. when changin
+ * shared_buffers GUC). We shouldn't allow adding more structures to these
+ * segments, and thus restrict adding sizes to the corresponding mappings after
+ * the initial size is set.
+ * 
+ * TODO: Also we should do something about numSemas, which is not required
+ * everywhere CalculateShmemSize is called.
  */
 Size
-CalculateShmemSize(int *num_semaphores, int shmem_segment)
+CalculateShmemSize(int *num_semaphores)
 {
 	Size		size;
 	int			numSemas;
@@ -113,7 +127,13 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
 											 sizeof(ShmemIndexEnt)));
 	size = add_size(size, dsm_estimate_size());
 	size = add_size(size, DSMRegistryShmemSize());
-	size = add_size(size, BufferManagerShmemSize(shmem_segment));
+
+	/*
+	 * Buffer manager adds estimates for memory requirements for every shared
+	 * memory segment that it uses in the corresponding AnonymousMappings.
+	 * Consider size required from only the main shared memory segment here.
+	 */
+	size = add_size(size, BufferManagerShmemSize());
 	size = add_size(size, LockManagerShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
@@ -154,8 +174,15 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
 
+	/*
+	 * All the shared memory allocations considered so far happen in the main
+	 * shared memory segment.
+	 */
+	Mappings[MAIN_SHMEM_SEGMENT].shmem_req_size = size;
+
 	/* might as well round it off to a multiple of a typical page size */
-	size = add_size(size, 8192 - (size % 8192));
+	for (int segment = 0; segment < ANON_MAPPINGS; segment++)
+		Mappings[segment].shmem_req_size = add_size(Mappings[segment].shmem_req_size, 8192 - (Mappings[segment].shmem_req_size % 8192));
 
 	return size;
 }
@@ -201,26 +228,30 @@ CreateSharedMemoryAndSemaphores(void)
 {
 	PGShmemHeader *shim;
 	PGShmemHeader *seghdr;
-	Size		size;
 	int			numSemas;
 
 	Assert(!IsUnderPostmaster);
 
+	CalculateShmemSize(&numSemas);
+
 	/* Decide if we use huge pages or regular size pages */
 	PrepareHugePages();
 
 	for(int segment = 0; segment < ANON_MAPPINGS; segment++)
 	{
+		AnonymousMapping *mapping = &Mappings[segment];
+
+		mapping->shmem_segment = segment;
+
 		/* Compute the size of the shared-memory block */
-		size = CalculateShmemSize(&numSemas, segment);
-		elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
+		elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", mapping->shmem_req_size);
 
 		/*
 		 * Create the shmem segment.
 		 *
 		 * XXX: Do multiple shims are needed, one per segment?
 		 */
-		seghdr = PGSharedMemoryCreate(size, &shim);
+		seghdr = PGSharedMemoryCreate(mapping, &shim);
 
 		/*
 		 * Make sure that huge pages are never reported as "unknown" while the
@@ -232,9 +263,13 @@ CreateSharedMemoryAndSemaphores(void)
 		InitShmemAccessInSegment(seghdr, segment);
 
 		/*
-		 * Create semaphores
+		 * Shared memory for semaphores is allocated in the main shared memory.
+		 * Hence they are allocated after the main segment is created. Patch
+		 * proposed at https://commitfest.postgresql.org/patch/5997/ simplifies
+		 * this.
 		 */
-		PGReserveSemaphores(numSemas, segment);
+		if (segment == MAIN_SHMEM_SEGMENT)
+			PGReserveSemaphores(numSemas, segment);
 
 		/*
 		 * Set up shared memory allocation mechanism
@@ -357,7 +392,9 @@ CreateOrAttachShmemStructs(void)
  * InitializeShmemGUCs
  *
  * This function initializes runtime-computed GUCs related to the amount of
- * shared memory required for the current configuration.
+ * shared memory required for the current configuration. It assumes that the
+ * memory required by the shared memory segments is already calculated and is
+ * available in AnonymousMappings.
  */
 void
 InitializeShmemGUCs(void)
@@ -366,12 +403,16 @@ InitializeShmemGUCs(void)
 	Size		size_b;
 	Size		size_mb;
 	Size		hp_size;
-	int			num_semas;
+	int			num_semas = ProcGlobalSemas();
+	int		i;
 
 	/*
 	 * Calculate the shared memory size and round up to the nearest megabyte.
 	 */
-	size_b = CalculateShmemSize(&num_semas, MAIN_SHMEM_SEGMENT);
+	size_b = 0;
+	for (i = 0; i < ANON_MAPPINGS; i++)
+		size_b = add_size(size_b, Mappings[i].shmem_req_size);
+
 	size_mb = add_size(size_b, (1024 * 1024) - 1) / (1024 * 1024);
 	sprintf(buf, "%zu", size_mb);
 	SetConfigOption("shared_memory_size", buf,
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index e6cb919f0fc..90c21a97225 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -178,8 +178,8 @@ ShmemAllocInSegment(Size size, int shmem_segment)
 	if (!newSpace)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("out of shared memory (%zu bytes requested)",
-						size)));
+				 errmsg("out of shared memory in segment %s (%zu bytes requested)",
+					MappingName(shmem_segment), size)));
 	return newSpace;
 }
 
@@ -286,8 +286,8 @@ ShmemAllocUnlockedInSegment(Size size, int shmem_segment)
 	if (newFree > Segments[shmem_segment].ShmemSegHdr->totalsize)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("out of shared memory (%zu bytes requested)",
-						size)));
+				 errmsg("out of shared memory in segment %s (%zu bytes requested)",
+						MappingName(shmem_segment), size)));
 	Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 
 	newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 193efeb9022..86ffe020c01 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4133,13 +4133,6 @@ PostgresSingleUserMain(int argc, char *argv[],
 	 */
 	process_shmem_requests();
 
-	/*
-	 * Now that loadable modules have had their chance to request additional
-	 * shared memory, determine the value of any runtime-computed GUCs that
-	 * depend on the amount of shared memory required.
-	 */
-	InitializeShmemGUCs();
-
 	/*
 	 * Now that modules have been loaded, we can process any custom resource
 	 * managers specified in the wal_consistency_checking GUC.
@@ -4152,6 +4145,13 @@ PostgresSingleUserMain(int argc, char *argv[],
 	 */
 	CreateSharedMemoryAndSemaphores();
 
+	/*
+	 * Now that loadable modules have had their chance to request additional
+	 * shared memory, determine the value of any runtime-computed GUCs that
+	 * depend on the amount of shared memory required.
+	 */
+	InitializeShmemGUCs();
+
 	/*
 	 * Estimate number of openable files.  This must happen after setting up
 	 * semaphores, because on some platforms semaphores count as open files.
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index a222747b803..3f37b294af6 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -319,7 +319,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
-extern Size BufferManagerShmemSize(int);
+extern Size BufferManagerShmemSize(void);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 6ebda479ced..3baf418b3d1 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -77,7 +77,7 @@ extern void check_on_shmem_exit_lists_are_empty(void);
 /* ipci.c */
 extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;
 
-extern Size CalculateShmemSize(int *num_semaphores, int shmem_segment);
+extern Size CalculateShmemSize(int *num_semaphores);
 extern void CreateSharedMemoryAndSemaphores(void);
 #ifdef EXEC_BACKEND
 extern void AttachSharedMemoryStructs(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index a7b275b4db9..a1fa6b43fe3 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -27,6 +27,18 @@
 #include "storage/dsm_impl.h"
 #include "storage/spin.h"
 
+typedef struct AnonymousMapping
+{
+	int shmem_segment;			/* TODO: Do we really need it? */
+	Size shmem_req_size;		/* Required size of the segment */
+	Size shmem_size; 			/* Size of the actually used memory */
+	Size shmem_reserved; 		/* Size of the reserved mapping */
+	Pointer shmem; 				/* Pointer to the start of the mapped memory */
+	Pointer seg_addr; 			/* SysV shared memory for the header */
+	unsigned long seg_id; 		/* IPC key */
+	int segment_fd; 			/* fd for the backing anon file */
+} AnonymousMapping;
+
 typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 {
 	int32		magic;			/* magic # to identify Postgres segments */
@@ -55,6 +67,8 @@ typedef struct ShmemSegment
 #define ANON_MAPPINGS 6
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
+extern PGDLLIMPORT AnonymousMapping Mappings[ANON_MAPPINGS];
+
 
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
@@ -101,10 +115,11 @@ extern void PGSharedMemoryReAttach(void);
 extern void PGSharedMemoryNoReAttach(void);
 #endif
 
-extern PGShmemHeader *PGSharedMemoryCreate(Size size,
+extern PGShmemHeader *PGSharedMemoryCreate(AnonymousMapping *mapping,
 										   PGShmemHeader **shim);
 extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
+extern const char *MappingName(int shmem_segment);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
 							int *memfd_flags);
 void PrepareHugePages(void);
-- 
2.34.1

0012-Initial-value-of-shared_buffers-or-NBuffers-20251013.patchapplication/x-patch; name=0012-Initial-value-of-shared_buffers-or-NBuffers-20251013.patchDownload

From 132e1155b6b2ad1522087a11deb1029fa0dcdb4b Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Mon, 1 Sep 2025 15:40:41 +0530
Subject: [PATCH 12/19] Initial value of shared_buffers (or NBuffers)

The assign_hook for shared_buffers (assign_shared_buffers()) is called twice
during server startup. First time it sets the default value of shared_buffers,
followed by a second time when it sets the value specified in the configuration
file or on the command line.  At those times the shared buffer pool is yet to be
initialized. Hence there is no need to keep the GUC change pending or going
through the entire process of resizing memory maps, reinitializing the shared memory
and process synchronization. Instead the given value should be assigned directly to
NBuffers, which will be used when creating the shared
memory and also when initializing the buffer pool the first time.  Any changes
to shared_buffer after that will need remapping the shared memory segment and
synchronize buffer pool reinitialization across the backends.

If BufferBlocks is not initilized assign_shared_buffers() sets the given
value to NBuffers directly. Otherwise it marks the change as pending and
sets the flag pending_pm_shmem_resize so that Postmaster can start the
buffer pool reinitialization.

TODO:

1. The change depends upon the C convention that the global pointer variables
being initialized to NULL. May be initialize BufferBlocks to NULL explicitly.

2. We might think of a better way to check whether buffer pool has been
initialized or not. See comment in assign_shared_buffers().

Author: Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c | 42 ++++++++++++++++++++++++++---------
 1 file changed, 32 insertions(+), 10 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index dc4eeeee56a..ba8613678f6 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -1168,20 +1168,42 @@ ProcessBarrierShmemResize(Barrier *barrier)
 }

 /*
- * GUC assign hook for shared_buffers. It's recommended for an assign hook to
- * be as minimal as possible, thus we just request shared memory resize and
- * remember the previous value.
+ * GUC assign hook for shared_buffers.
+ *
+ * When setting the GUC first time after starting the server, the GUC value is
+ * changed immediately since there is not shared memory setup yet.
+ *
+ * After the shared memory is setup, changing the GUC value requires resizing and
+ * reiniatializing (at least parts of) the shared memory structures related to
+ * shared buffers. That's a long and complicated process.  It's recommended for
+ * an assign hook to be as minimal as possible, thus we just request shared
+ * memory resize and remember the previous value.
  */
 void
 assign_shared_buffers(int newval, void *extra, bool *pending)
 {
-	elog(DEBUG1, "Received SIGHUP for shmem resizing");
-
-	pending_pm_shmem_resize = true;
-	*pending = true;
-	NBuffersPending = newval;
-
-	NBuffersOld = NBuffers;
+	/*
+	 * TODO: If a backend joins while the buffer resizing is in progress or it
+	 * reads a value of shared_buffers from configuration which is different from
+	 * the value being used by existing backends, this method may not work. Need
+	 * to think of a better solution. 
+	 */
+	if (BufferBlocks)
+	{
+		elog(DEBUG1, "bufferpool is already initialized with size = %d, reinitializing it with size = %d",
+			NBuffers, newval);
+		pending_pm_shmem_resize = true;
+		*pending = true;
+		NBuffersPending = newval;
+		NBuffersOld = NBuffers;
+	}
+	else
+	{
+		elog(DEBUG1, "initializing buffer pool with size = %d", newval);
+		NBuffers = newval;
+		*pending = false;
+		pending_pm_shmem_resize = false;
+	}
 }

 /*
-- 
2.34.1

0013-Update-sizes-and-addresses-of-shared-memory-20251013.patchapplication/x-patch; name=0013-Update-sizes-and-addresses-of-shared-memory-20251013.patchDownload

From 9548894435757b4a542ba172490b757f46eae8fa Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Thu, 21 Aug 2025 15:44:24 +0530
Subject: [PATCH 13/19] Update sizes and addresses of shared memory mapping and
 shared memory structures

Update totalsize and end address in segment and mapping: Once a shared
memory segment has been resized, the total size and end address of the
same needs to be updated in the corresponding AnonymousMapping and
Segment structure.

Update allocated_size for resized shared memory structure: Reallocating
the shared memory structure after resizing needs a bit more work. But at
least update the allocated_size as well along with the size of shared
memory structure.

Author: Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c   | 4 ++++
 src/backend/storage/ipc/shmem.c | 6 +++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index ba8613678f6..54d335b2e5d 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -1021,6 +1021,8 @@ AnonymousShmemResize(void)
 	for(int i = 0; i < ANON_MAPPINGS; i++)
 	{
 		AnonymousMapping *m = &Mappings[i];
+		ShmemSegment *segment = &Segments[i];
+		PGShmemHeader *shmem_hdr = segment->ShmemSegHdr;
 
 #ifdef MAP_HUGETLB
 		if (huge_pages_on && (m->shmem_req_size % hugepagesize != 0))
@@ -1067,6 +1069,8 @@ AnonymousShmemResize(void)
 
 		reinit = true;
 		m->shmem_size = m->shmem_req_size;
+		shmem_hdr->totalsize = m->shmem_size;
+		segment->ShmemEnd = m->shmem + m->shmem_size;
 	}
 
 	if (reinit)
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 2a197540300..0f9abf69fd5 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -504,13 +504,17 @@ ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
 		 *
 		 * XXX: There is an implicit assumption this can only happen in
 		 * "resizable" segments, where only one shared structure is allowed.
-		 * This has to be implemented more cleanly.
+		 * This has to be implemented more cleanly. Probably we should implement
+		 * ShmemReallocRawInSegment functionality just to adjust the size
+		 * according to alignment, return the allocated size and update the
+		 * mapping offset.
 		 */
 		if (result->size != size)
 		{
 			Size delta = size - result->size;
 
 			result->size = size;
+			result->allocated_size = size;
 
 			/* Reflect size change in the shared segment */
 			SpinLockAcquire(Segments[shmem_segment].ShmemLock);
-- 
2.34.1

0011-Allow-to-resize-shared-memory-without-resta-20251013.patchapplication/x-patch; name=0011-Allow-to-resize-shared-memory-without-resta-20251013.patchDownload

From 46a7cabd8c0e8b92c5f7515856ad1d8f11a410a7 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 14:16:55 +0200
Subject: [PATCH 11/19] Allow to resize shared memory without restart

Add assing hook for shared_buffers to resize shared memory using space,
introduced in the previous commits without requiring PostgreSQL restart.
Essentially the implementation is based on two mechanisms: a
ProcSignalBarrier is used to make sure all processes are starting the
resize procedure simultaneously, and a global Barrier is used to
coordinate after that and make sure all finished processes are waiting
for others that are in progress.

The resize process looks like this:

* The GUC assign hook sets a flag to let the Postmaster know that resize
  was requested.

* Postmaster verifies the flag in the event loop, and starts the resize
  by emitting a ProcSignal barrier.

* All processes, that participate in ProcSignal mechanism, begin to
  process ProcSignal barrier. First a process waits until all processes
  have confirmed they received the message and can start simultaneously.

* Every process recalculates shared memory size based on the new
  NBuffers, adjusts its size using ftruncate and adjust reservation
  permissions with mprotect. One elected process signals the postmaster
  to do the same.

* When finished, every process waits on a global ShmemControl barrier,
  untill all others are finished as well. This way we ensure three
  stages with clear boundaries: before the resize, when all processes
  use old NBuffers; during the resize, when processes have mix of old
  and new NBuffers, and wait until it's done; after the resize, when all
  processes use new NBuffers.

* After all processes are using new value, one of them will initialize
  new shared structures (buffer blocks, descriptors, etc) as needed and
  broadcast new value of NBuffers via ShmemControl in shared memory.
  Other backends are waiting for this operation to finish as well. Then
  the barrier is lifted and everything goes as usual.

Since resizing takes time, we need to take into account that during that time:

- New backends can be spawned. They will check status of the barrier
  early during the bootstrap, and wait until everything is over to work
  with the new NBuffers value.

- Old backends can exit before attempting to resize. Synchronization
  used between backends relies on ProcSignalBarrier and waits for all
  participants received the message at the beginning to gather all
  existing backends.

- Some backends might be blocked and not responsing either before or
  after receiving the message. In the first case such backend still
  have ProcSignalSlot and should be waited for, in the second case
  shared barrier will make sure we still waiting for those backends. In
  any case there is an unbounded wait.

- Backends might join barrier in disjoint groups with some time in
  between. That means that relying only on the shared dynamic barrier is
  not enough -- it will only synchronize resize procedure withing those
  groups. That's why we wait first for all participants of ProcSignal
  mechanism who received the message.

Here is how it looks like after raising shared_buffers from 128 MB to
512 MB and calling pg_reload_conf():

    -- 128 MB
    7f87909fc000-7f8798248000 rw-s /memfd:strategy (deleted)
    7f8798248000-7f879d6ca000 ---s /memfd:strategy (deleted)
    7f879d6ca000-7f87a4e84000 rw-s /memfd:checkpoint (deleted)
    7f87a4e84000-7f87aa398000 ---s /memfd:checkpoint (deleted)
    7f87aa398000-7f87b1b42000 rw-s /memfd:iocv (deleted)
    7f87b1b42000-7f87c3d32000 ---s /memfd:iocv (deleted)
    7f87c3d32000-7f87cb59c000 rw-s /memfd:descriptors (deleted)
    7f87cb59c000-7f87dd6cc000 ---s /memfd:descriptors (deleted)
    7f87dd6cc000-7f87ece38000 rw-s /memfd:buffers (deleted)
    ^ buffers content, ~247 MB
    7f87ece38000-7f8877066000 ---s /memfd:buffers (deleted)
    ^ reserved space, ~2210 MB
    7f8877066000-7f887e7d0000 rw-s /memfd:main (deleted)
    7f887e7d0000-7f8890a00000 ---s /memfd:main (deleted)

    -- 512 MB
    7f87909fc000-7f879866a000 rw-s /memfd:strategy (deleted)
    7f879866a000-7f879d6ca000 ---s /memfd:strategy (deleted)
    7f879d6ca000-7f87a50f4000 rw-s /memfd:checkpoint (deleted)
    7f87a50f4000-7f87aa398000 ---s /memfd:checkpoint (deleted)
    7f87aa398000-7f87b1d82000 rw-s /memfd:iocv (deleted)
    7f87b1d82000-7f87c3d32000 ---s /memfd:iocv (deleted)
    7f87c3d32000-7f87cba1c000 rw-s /memfd:descriptors (deleted)
    7f87cba1c000-7f87dd6cc000 ---s /memfd:descriptors (deleted)
    7f87dd6cc000-7f8804fb8000 rw-s /memfd:buffers (deleted)
    ^ buffers content, ~632 MB
    7f8804fb8000-7f8877066000 ---s /memfd:buffers (deleted)
    ^ reserved space, ~1824 MB
    7f8877066000-7f887e950000 rw-s /memfd:main (deleted)
    7f887e950000-7f8890a00000 ---s /memfd:main (deleted)

The implementation supports only increasing of shared_buffers. For
decreasing the value a similar procedure is needed. But the buffer
blocks with data have to be drained first, so that the actual data set
fits into the new smaller space.

From experiment it turns out that shared mappings have to be extended
separately for each process that uses them. Another rough edge is that a
backend blocked on ReadCommand will not apply shared_buffers change
until it receives something.

Authors: Dmitrii Dolgov, Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c                 | 443 ++++++++++++++++++
 src/backend/postmaster/checkpointer.c         |  12 +-
 src/backend/postmaster/postmaster.c           |  18 +
 src/backend/storage/buffer/buf_init.c         |  60 ++-
 src/backend/storage/ipc/ipci.c                |  15 +-
 src/backend/storage/ipc/procsignal.c          |  46 ++
 src/backend/storage/ipc/shmem.c               |  23 +-
 src/backend/tcop/postgres.c                   |  10 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/backend/utils/misc/guc_parameters.dat     |   3 +-
 src/include/storage/bufmgr.h                  |   2 +-
 src/include/storage/ipc.h                     |   3 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/storage/pg_shmem.h                |  26 +
 src/include/storage/pmsignal.h                |   3 +-
 src/include/storage/procsignal.h              |   1 +
 src/tools/pgindent/typedefs.list              |   1 +
 17 files changed, 632 insertions(+), 38 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index b85911bdfc4..dc4eeeee56a 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -30,13 +30,19 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "portability/mem.h"
+#include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
+#include "storage/pmsignal.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
 #include "utils/pidfile.h"
+#include "utils/wait_event.h"
 
 
 /*
@@ -96,6 +102,13 @@ void	   *UsedShmemSegAddr = NULL;
 
 AnonymousMapping Mappings[ANON_MAPPINGS];
 
+/* Flag telling postmaster that resize is needed */
+volatile bool pending_pm_shmem_resize = false;
+
+/* Keeps track of the previous NBuffers value */
+static int NBuffersOld = -1;
+static int NBuffersPending = -1;
+
 /*
  * Anonymous mapping layout we use looks like this:
  *
@@ -147,6 +160,49 @@ static double SHMEM_RESIZE_RATIO[6] = {
  */
 static bool huge_pages_on = false;
 
+/*
+ * Flag telling that we have prepared the memory layout to be resizable. If
+ * false after all shared memory segments creation, it means we failed to setup
+ * needed layout and falled back to the regular non-resizable approach.
+ */
+static bool shmem_resizable = false;
+
+/*
+ * Currently broadcasted value of NBuffers in shared memory.
+ *
+ * Most of the time this value is going to be equal to NBuffers. But if
+ * postmaster is resizing shared memory and a new backend was created
+ * at the same time, there is a possibility for the new backend to inherit the
+ * old NBuffers value, but miss the resize signal if ProcSignal infrastructure
+ * was not initialized yet. Consider this situation:
+ *
+ *     Postmaster ------> New Backend
+ *         |                   |
+ *         |                Launch
+ *         |                   |
+ *         |             Inherit NBuffers
+ *         |                   |
+ *     Resize NBuffers         |
+ *         |                   |
+ *     Emit Barrier            |
+ *         |            Init ProcSignal
+ *         |                   |
+ *     Finish resize           |
+ *         |                   |
+ *     New NBuffers       Old NBuffers
+ *
+ * In this case the backend is not yet ready to receive a signal from
+ * EmitProcSignalBarrier, and will be ignored. The same happens if ProcSignal
+ * is initialized even later, after the resizing was finished.
+ *
+ * To address resulting inconsistency, postmaster broadcasts the current
+ * NBuffers value via shared memory. Every new backend has to verify this value
+ * before it will access the buffer pool: if it differs from its own value,
+ * this indicates a shared memory resize has happened and the backend has to
+ * first synchronize with rest of the pack.
+ */
+ShmemControl *ShmemCtrl = NULL;
+
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
@@ -906,6 +962,346 @@ AnonymousShmemDetach(int status, Datum arg)
 	}
 }
 
+/*
+ * Resize all shared memory segments based on the current NBuffers value, which
+ * is is applied from NBuffersPending. The actual segment resizing is done via
+ * ftruncate, which will fail if is not sufficient space to expand the anon
+ * file. When finished, based on the new and old values initialize new buffer
+ * blocks if any.
+ *
+ * If reinitializing took place, as the last step this function does buffers
+ * reinitialization as well and broadcasts the new value of NSharedBuffers. All
+ * of that needs to be done only by one backend, the first one that managed to
+ * grab the ShmemResizeLock.
+ */
+bool
+AnonymousShmemResize(void)
+{
+	int		numSemas;
+	bool 	reinit = false;
+	int		mmap_flags = PG_MMAP_FLAGS;
+	Size 	hugepagesize;
+
+	NBuffers = NBuffersPending;
+
+	elog(DEBUG1, "Resize shmem from %d to %d", NBuffersOld, NBuffers);
+
+	/*
+	 * XXX: Where to reset the flag is still an open question. E.g. do we
+	 * consider a no-op when NBuffers is equal to NBuffersOld a genuine resize
+	 * and reset the flag?
+	 */
+	pending_pm_shmem_resize = false;
+
+	/*
+	 * XXX: Currently only increasing of shared_buffers is supported. For
+	 * decreasing something similar has to be done, but buffer blocks with
+	 * data have to be drained first.
+	 */
+	if(NBuffersOld > NBuffers)
+		return false;
+
+#ifndef MAP_HUGETLB
+	/* PrepareHugePages should have dealt with this case */
+	Assert(huge_pages != HUGE_PAGES_ON && !huge_pages_on);
+#else
+	if (huge_pages_on)
+	{
+		/* Make sure nothing is messed up */
+		Assert(huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY);
+
+		/* Round up the new size to a suitable large value */
+		GetHugePageSize(&hugepagesize, &mmap_flags, NULL);
+	}
+#endif
+
+	/* Note that CalculateShmemSize indirectly depends on NBuffers */
+	CalculateShmemSize(&numSemas);
+
+	for(int i = 0; i < ANON_MAPPINGS; i++)
+	{
+		AnonymousMapping *m = &Mappings[i];
+
+#ifdef MAP_HUGETLB
+		if (huge_pages_on && (m->shmem_req_size % hugepagesize != 0))
+			m->shmem_req_size += hugepagesize - (m->shmem_req_size % hugepagesize);
+#endif
+
+		if (m->shmem == NULL)
+			continue;
+
+		if (m->shmem_size == m->shmem_req_size)
+			continue;
+
+		if (m->shmem_reserved < m->shmem_req_size)
+			ereport(ERROR,
+					(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+					 errmsg("not enough shared memory is reserved"),
+					 errhint("You may need to increase \"max_available_memory\".")));
+
+		elog(DEBUG1, "segment[%s]: resize from %zu to %zu at address %p",
+					 MappingName(m->shmem_segment), m->shmem_size,
+					 m->shmem_req_size, m->shmem);
+
+		/* Resize the backing anon file. */
+		if(ftruncate(m->segment_fd, m->shmem_req_size) == -1)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not truncase anonymous file for \"%s\": %m",
+							MappingName(m->shmem_segment))));
+
+		/* Adjust memory accessibility */
+		if(mprotect(m->shmem, m->shmem_req_size, PROT_READ | PROT_WRITE) == -1)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not mprotect anonymous shared memory for \"%s\": %m",
+							MappingName(m->shmem_segment))));
+
+		/* If shrinking, make reserved space unavailable again */
+		if(m->shmem_req_size < m->shmem_size &&
+		   mprotect(m->shmem + m->shmem_req_size, m->shmem_size - m->shmem_req_size, PROT_NONE) == -1)
+			ereport(FATAL,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not mprotect reserved shared memory for \"%s\": %m",
+							MappingName(m->shmem_segment))));
+
+		reinit = true;
+		m->shmem_size = m->shmem_req_size;
+	}
+
+	if (reinit)
+	{
+		if(IsUnderPostmaster &&
+			LWLockConditionalAcquire(ShmemResizeLock, LW_EXCLUSIVE))
+		{
+			/*
+			 * If the new NBuffers was already broadcasted, the buffer pool was
+			 * already initialized before.
+			 *
+			 * Since we're not on a hot path, we use lwlocks and do not need to
+			 * involve memory barrier.
+			 */
+			if(pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers)
+			{
+				/*
+				 * Allow the first backend that managed to get the lock to
+				 * reinitialize the new portion of buffer pool. Every other
+				 * process will wait on the shared barrier for that to finish,
+				 * since it's a part of the SHMEM_RESIZE_DONE phase.
+				 *
+				 * Note that it's enough when only one backend will do that,
+				 * even the ShmemInitStruct part. The reason is that resized
+				 * shared memory will maintain the same addresses, meaning that
+				 * all the pointers are still valid, and we only need to update
+				 * structures size in the ShmemIndex once -- any other backend
+				 * will pick up this shared structure from the index.
+				 *
+				 * XXX: This is the right place for buffer eviction as well.
+				 */
+				BufferManagerShmemInit(NBuffersOld);
+
+				/* If all fine, broadcast the new value */
+				pg_atomic_write_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
+			}
+
+			LWLockRelease(ShmemResizeLock);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * We are asked to resize shared memory. Wait for all ProcSignal participants
+ * to join the barrier, then do the resize and wait on the barrier until all
+ * participating finish resizing as well -- otherwise we face danger of
+ * inconsistency between backends.
+ *
+ * XXX: If a backend is blocked on ReadCommand in PostgresMain, it will not
+ * proceed with AnonymousShmemResize after receiving SIGHUP, until something
+ * will be sent.
+ */
+bool
+ProcessBarrierShmemResize(Barrier *barrier)
+{
+	Assert(IsUnderPostmaster);
+
+	elog(DEBUG1, "Handle a barrier for shmem resizing from %d to %d, %d",
+		 NBuffersOld, NBuffersPending, pending_pm_shmem_resize);
+
+	/* Wait until we have seen the new NBuffers value */
+	if (!pending_pm_shmem_resize)
+		return false;
+
+	/*
+	 * First thing to do after attaching to the barrier is to wait for others.
+	 * We can't simply use BarrierArriveAndWait, because backends might arrive
+	 * here in disjoint groups, e.g. first two backends, pause, then second two
+	 * backends. If the resize is quick enough that can lead to a situation
+	 * when the first group is already finished before the second has appeared,
+	 * and the barrier will only synchonize withing those groups.
+	 */
+	if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED)
+		WaitForProcSignalBarrierReceived(
+				pg_atomic_read_u64(&ShmemCtrl->Generation));
+
+	/*
+	 * Now start the procedure, and elect one backend to ping postmaster to do
+	 * the same.
+	 *
+	 * XXX: If we need to be able to abort resizing, this has to be done later,
+	 * after the SHMEM_RESIZE_DONE.
+	 */
+	if (BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START))
+	{
+		Assert(IsUnderPostmaster);
+		SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
+	}
+
+	AnonymousShmemResize();
+
+	/* The second phase means the resize has finished, SHMEM_RESIZE_DONE */
+	BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_DONE);
+
+	BarrierDetach(barrier);
+	return true;
+}
+
+/*
+ * GUC assign hook for shared_buffers. It's recommended for an assign hook to
+ * be as minimal as possible, thus we just request shared memory resize and
+ * remember the previous value.
+ */
+void
+assign_shared_buffers(int newval, void *extra, bool *pending)
+{
+	elog(DEBUG1, "Received SIGHUP for shmem resizing");
+
+	pending_pm_shmem_resize = true;
+	*pending = true;
+	NBuffersPending = newval;
+
+	NBuffersOld = NBuffers;
+}
+
+/*
+ * Test if we have somehow missed a shmem resize signal and NBuffers value
+ * differs from NSharedBuffers. If yes, catchup and do resize.
+ */
+void
+AdjustShmemSize(void)
+{
+	uint32 NSharedBuffers = pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers);
+
+	if (NSharedBuffers != NBuffers)
+	{
+		/*
+		 * If the broadcasted shared_buffers is different from the one we see,
+		 * it could be that the backend has missed a resize signal. To avoid
+		 * any inconsistency, adjust the shared mappings, before having a
+		 * chance to access the buffer pool.
+		 */
+		ereport(LOG,
+				(errmsg("shared_buffers has been changed from %d to %d, "
+						"resize shared memory",
+						NBuffers, NSharedBuffers)));
+		NBuffers = NSharedBuffers;
+		AnonymousShmemResize();
+	}
+}
+
+/*
+ * Start resizing procedure, making sure all existing processes will have
+ * consistent view of shared memory size. Must be called only in postmaster.
+ */
+void
+CoordinateShmemResize(void)
+{
+	elog(DEBUG1, "Coordinating shmem resize from %d to %d",
+		 NBuffersOld, NBuffers);
+	Assert(!IsUnderPostmaster);
+
+	/*
+	 * We use dynamic barrier to help dealing with backends that were spawned
+	 * during the resize.
+	 */
+	BarrierInit(&ShmemCtrl->Barrier, 0);
+
+	/*
+	 * If the value did not change, or shared memory segments are not
+	 * initialized yet, skip the resize.
+	 */
+	if (NBuffersPending == NBuffersOld)
+	{
+		elog(DEBUG1, "Skip resizing, new %d, old %d",
+			 NBuffers, NBuffersOld);
+		return;
+	}
+
+	/*
+	 * Shared memory resize requires some coordination done by postmaster,
+	 * and consists of three phases:
+	 *
+	 * - Before the resize all existing backends have the same old NBuffers.
+	 * - When resize is in progress, backends are expected to have a
+	 *   mixture of old a new values. They're not allowed to touch buffer
+	 *   pool during this time frame.
+	 * - After resize has been finished, all existing backends, that can access
+	 *   the buffer pool, are expected to have the same new value of NBuffers.
+	 *
+	 * Those phases are ensured by joining the shared barrier associated with
+	 * the procedure. Since resizing takes time, we need to take into account
+	 * that during that time:
+	 *
+	 * - New backends can be spawned. They will check status of the barrier
+	 *   early during the bootstrap, and wait until everything is over to work
+	 *   with the new NBuffers value.
+	 *
+	 * - Old backends can exit before attempting to resize. Synchronization
+	 *   used between backends relies on ProcSignalBarrier and waits for all
+	 *   participants received the message at the beginning to gather all
+	 *   existing backends.
+	 *
+	 * - Some backends might be blocked and not responsing either before or
+	 *   after receiving the message. In the first case such backend still
+	 *   have ProcSignalSlot and should be waited for, in the second case
+	 *   shared barrier will make sure we still waiting for those backends. In
+	 *   any case there is an unbounded wait.
+	 *
+	 * - Backends might join barrier in disjoint groups with some time in
+	 *   between. That means that relying only on the shared dynamic barrier is
+	 *   not enough -- it will only synchronize resize procedure withing those
+	 *   groups. That's why we wait first for all participants of ProcSignal
+	 *   mechanism who received the message.
+	 */
+	elog(DEBUG1, "Emit a barrier for shmem resizing");
+	pg_atomic_init_u64(&ShmemCtrl->Generation,
+					   EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SHMEM_RESIZE));
+
+	/* To order everything after setting Generation value */
+	pg_memory_barrier();
+
+	/*
+	 * After that postmaster waits for PMSIGNAL_SHMEM_RESIZE as a sign that all
+	 * the rest of the pack has started the procedure and it can resize shared
+	 * memory as well.
+	 *
+	 * Normally we would call WaitForProcSignalBarrier here to wait until every
+	 * backend has reported on the ProcSignalBarrier. But for shared memory
+	 * resize we don't need this, as every participating backend will
+	 * synchronize on the ProcSignal barrier. In fact even if we would like to
+	 * wait here, it wouldn't be possible -- we're in the postmaster, without
+	 * any waiting infrastructure available.
+	 *
+	 * If at some point it will turn out that waiting is essential, we would
+	 * need to consider some alternatives. E.g. it could be a designated
+	 * coordination process, which is not a postmaster. Another option would be
+	 * to introduce a CoordinateShmemResize lock and allow only one process to
+	 * take it (this probably would have to be something different than
+	 * LWLocks, since they block interrupts, and coordination relies on them).
+	 */
+}
+
 /*
  * PGSharedMemoryCreate
  *
@@ -1217,3 +1613,50 @@ PGSharedMemoryDetach(void)
 		}
 	}
 }
+
+void
+WaitOnShmemBarrier()
+{
+	Barrier *barrier = &ShmemCtrl->Barrier;
+
+	/* Nothing to do if resizing is not started */
+	if (BarrierPhase(barrier) < SHMEM_RESIZE_START)
+		return;
+
+	BarrierAttach(barrier);
+
+	/* Otherwise wait through all available phases */
+	while (BarrierPhase(barrier) < SHMEM_RESIZE_DONE)
+	{
+		ereport(LOG, (errmsg("ProcSignal barrier is in phase %d, waiting",
+							 BarrierPhase(barrier))));
+
+		BarrierArriveAndWait(barrier, 0);
+	}
+
+	BarrierDetach(barrier);
+}
+
+void
+ShmemControlInit(void)
+{
+	bool foundShmemCtrl;
+
+	ShmemCtrl = (ShmemControl *)
+	ShmemInitStruct("Shmem Control", sizeof(ShmemControl),
+									 &foundShmemCtrl);
+
+	if (!foundShmemCtrl)
+	{
+		/*
+		 * The barrier is missing here, it will be initialized right before
+		 * starting the resizing process as a convenient way to reset it.
+		 */
+
+		/* Initialize with the currently known value */
+		pg_atomic_init_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
+
+		/* shmem_resizable should be initialized by now */
+		ShmemCtrl->Resizable = shmem_resizable;
+	}
+}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e84e8663e96..ef3f84a55f5 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -654,9 +654,12 @@ CheckpointerMain(const void *startup_data, size_t startup_data_len)
 static void
 ProcessCheckpointerInterrupts(void)
 {
-	if (ProcSignalBarrierPending)
-		ProcessProcSignalBarrier();
-
+	/*
+	 * Reloading config can trigger further signals, complicating interrupts
+	 * processing -- so let it run first.
+	 *
+	 * XXX: Is there any need in memory barrier after ProcessConfigFile?
+	 */
 	if (ConfigReloadPending)
 	{
 		ConfigReloadPending = false;
@@ -676,6 +679,9 @@ ProcessCheckpointerInterrupts(void)
 		UpdateSharedMemoryConfig();
 	}
 
+	if (ProcSignalBarrierPending)
+		ProcessProcSignalBarrier();
+
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
 		ProcessLogMemoryContextInterrupt();
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b59d20b4ac2..ba9528d5dfa 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -426,6 +426,7 @@ static void process_pm_pmsignal(void);
 static void process_pm_child_exit(void);
 static void process_pm_reload_request(void);
 static void process_pm_shutdown_request(void);
+static void process_pm_shmem_resize(void);
 static void dummy_handler(SIGNAL_ARGS);
 static void CleanupBackend(PMChild *bp, int exitstatus);
 static void HandleChildCrash(int pid, int exitstatus, const char *procname);
@@ -1697,6 +1698,9 @@ ServerLoop(void)
 			if (pending_pm_pmsignal)
 				process_pm_pmsignal();
 
+			if (pending_pm_shmem_resize)
+				process_pm_shmem_resize();
+
 			if (events[i].events & WL_SOCKET_ACCEPT)
 			{
 				ClientSocket s;
@@ -2042,6 +2046,17 @@ process_pm_reload_request(void)
 	}
 }
 
+static void
+process_pm_shmem_resize(void)
+{
+	/*
+	 * Failure to resize is considered to be fatal and will not be
+	 * retried, which means we can disable pending flag right here.
+	 */
+	pending_pm_shmem_resize = false;
+	CoordinateShmemResize();
+}
+
 /*
  * pg_ctl uses SIGTERM, SIGINT and SIGQUIT to request different types of
  * shutdown.
@@ -3862,6 +3877,9 @@ process_pm_pmsignal(void)
 		request_state_update = true;
 	}
 
+	if (CheckPostmasterSignal(PMSIGNAL_SHMEM_RESIZE))
+		AnonymousShmemResize();
+
 	/*
 	 * Try to advance postmaster's state machine, if a child requests it.
 	 */
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6f148d1d80b..0e72e373193 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -18,6 +18,7 @@
 #include "storage/buf_internals.h"
 #include "storage/pg_shmem.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -63,18 +64,28 @@ CkptSortItem *CkptBufferIds;
  * Initialize shared buffer pool
  *
  * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend). Size of data structures initialized
- * here depends on NBuffers, and to be able to change NBuffers without a
- * restart we store each structure into a separate shared memory segment, which
- * could be resized on demand.
+ * postmaster, or in a standalone backend) or during shared-memory resize. Size
+ * of data structures initialized here depends on NBuffers, and to be able to
+ * change NBuffers without a restart we store each structure into a separate
+ * shared memory segment, which could be resized on demand.
+ *
+ * FirstBufferToInit tells where to start initializing buffers. For
+ * initialization it always will be zero, but when resizing shared-memory it
+ * indicates the number of already initialized buffers.
+ *
+ * No locks are taking in this function, it is the caller responsibility to
+ * make sure only one backend can work with new buffers.
  */
 void
-BufferManagerShmemInit(void)
+BufferManagerShmemInit(int FirstBufferToInit)
 {
 	bool		foundBufs,
 				foundDescs,
 				foundIOCV,
 				foundBufCkpt;
+	int			i;
+	elog(DEBUG1, "BufferManagerShmemInit from %d to %d",
+				 FirstBufferToInit, NBuffers);
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
@@ -111,34 +122,35 @@ BufferManagerShmemInit(void)
 	{
 		/* should find all of these, or none of them */
 		Assert(foundDescs && foundBufs && foundIOCV && foundBufCkpt);
-		/* note: this path is only taken in EXEC_BACKEND case */
-	}
-	else
-	{
-		int			i;
-
 		/*
-		 * Initialize all the buffer headers.
+		 * note: this path is only taken in EXEC_BACKEND case when initializing
+		 * shared memory, or in all cases when resizing shared memory.
 		 */
-		for (i = 0; i < NBuffers; i++)
-		{
-			BufferDesc *buf = GetBufferDescriptor(i);
+	}
+
+#ifndef EXEC_BACKEND
+	/*
+	 * Initialize all the buffer headers.
+	 */
+	for (i = FirstBufferToInit; i < NBuffers; i++)
+	{
+		BufferDesc *buf = GetBufferDescriptor(i);
 
-			ClearBufferTag(&buf->tag);
+		ClearBufferTag(&buf->tag);
 
-			pg_atomic_init_u32(&buf->state, 0);
-			buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
+		pg_atomic_init_u32(&buf->state, 0);
+		buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
 
-			buf->buf_id = i;
+		buf->buf_id = i;
 
-			pgaio_wref_clear(&buf->io_wref);
+		pgaio_wref_clear(&buf->io_wref);
 
-			LWLockInitialize(BufferDescriptorGetContentLock(buf),
-							 LWTRANCHE_BUFFER_CONTENT);
+		LWLockInitialize(BufferDescriptorGetContentLock(buf),
+						 LWTRANCHE_BUFFER_CONTENT);
 
-			ConditionVariableInit(BufferDescriptorGetIOCV(buf));
-		}
+		ConditionVariableInit(BufferDescriptorGetIOCV(buf));
 	}
+#endif
 
 	/* Init other shared buffer-management stuff */
 	StrategyInitialize(!foundDescs);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2cd278449f0..bd75f06047e 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -171,6 +171,14 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
 
+	/*
+	 * XXX: For some reason slightly more memory is needed for larger
+	 * shared_buffers, but this size is enough for any large value I've tested
+	 * with. Is it a mistake in how slots are split, or there was a hidden
+	 * inconsistency in shmem calculation?
+	 */
+	size = add_size(size, 1024 * 1024 * 100);
+
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
 
@@ -333,7 +341,7 @@ CreateOrAttachShmemStructs(void)
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
-	BufferManagerShmemInit();
+	BufferManagerShmemInit(0);
 
 	/*
 	 * Set up lock manager
@@ -345,6 +353,11 @@ CreateOrAttachShmemStructs(void)
 	 */
 	PredicateLockShmemInit();
 
+	/*
+	 * Set up shared memory resize manager
+	 */
+	ShmemControlInit();
+
 	/*
 	 * Set up process table
 	 */
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index eb3ceaae809..2160d258fa7 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -27,6 +27,7 @@
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
+#include "storage/pg_shmem.h"
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "storage/smgr.h"
@@ -113,6 +114,10 @@ static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
 
+#ifdef DEBUG_SHMEM_RESIZE
+bool delay_proc_signal_init = false;
+#endif
+
 /*
  * ProcSignalShmemSize
  *		Compute space needed for ProcSignal's shared memory
@@ -176,6 +181,43 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
 	uint32		old_pss_pid;
 
 	Assert(cancel_key_len >= 0 && cancel_key_len <= MAX_CANCEL_KEY_LENGTH);
+
+#ifdef DEBUG_SHMEM_RESIZE
+	/*
+	 * Introduced for debugging purposes. You can change the variable at
+	 * runtime using gdb, then start new backends with delayed ProcSignal
+	 * initialization. Simple pg_usleep wont work here due to SIGHUP interrupt
+	 * needed for testing. Taken from pg_sleep;
+	 */
+	if (delay_proc_signal_init)
+	{
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+		float8		endtime = GetNowFloat() + 5;
+
+		for (;;)
+		{
+			float8		delay;
+			long		delay_ms;
+
+			CHECK_FOR_INTERRUPTS();
+
+			delay = endtime - GetNowFloat();
+			if (delay >= 600.0)
+				delay_ms = 600000;
+			else if (delay > 0.0)
+				delay_ms = (long) (delay * 1000.0);
+			else
+				break;
+
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 delay_ms,
+							 WAIT_EVENT_PG_SLEEP);
+			ResetLatch(MyLatch);
+		}
+	}
+#endif
+
 	if (MyProcNumber < 0)
 		elog(ERROR, "MyProcNumber not set");
 	if (MyProcNumber >= NumProcSignalSlots)
@@ -615,6 +657,10 @@ ProcessProcSignalBarrier(void)
 					case PROCSIGNAL_BARRIER_SMGRRELEASE:
 						processed = ProcessBarrierSmgrRelease();
 						break;
+					case PROCSIGNAL_BARRIER_SHMEM_RESIZE:
+						processed = ProcessBarrierShmemResize(
+								&ShmemCtrl->Barrier);
+						break;
 				}
 
 				/*
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 9499f332e77..2a197540300 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -498,17 +498,26 @@ ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
 	{
 		/*
 		 * Structure is in the shmem index so someone else has allocated it
-		 * already.  The size better be the same as the size we are trying to
-		 * initialize to, or there is a name conflict (or worse).
+		 * already. Verify the structure's size:
+		 * - If it's the same, we've found the expected structure.
+		 * - If it's different, we're resizing the expected structure.
+		 *
+		 * XXX: There is an implicit assumption this can only happen in
+		 * "resizable" segments, where only one shared structure is allowed.
+		 * This has to be implemented more cleanly.
 		 */
 		if (result->size != size)
 		{
-			LWLockRelease(ShmemIndexLock);
-			ereport(ERROR,
-					(errmsg("ShmemIndex entry size is wrong for data structure"
-							" \"%s\": expected %zu, actual %zu",
-							name, size, result->size)));
+			Size delta = size - result->size;
+
+			result->size = size;
+
+			/* Reflect size change in the shared segment */
+			SpinLockAcquire(Segments[shmem_segment].ShmemLock);
+			Segments[shmem_segment].ShmemSegHdr->freeoffset += delta;
+			SpinLockRelease(Segments[shmem_segment].ShmemLock);
 		}
+
 		structPtr = result->location;
 	}
 	else
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 86ffe020c01..81881ef56c1 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -63,6 +63,7 @@
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -4318,6 +4319,15 @@ PostgresMain(const char *dbname, const char *username)
 	 */
 	BeginReportingGUCOptions();
 
+	/* Verify the shared barrier, if it's still active: join and wait. */
+	WaitOnShmemBarrier();
+
+	/*
+	 * After waiting on the barrier above we guaranteed to have NSharedBuffers
+	 * broadcasted, so we can use it in the function below.
+	 */
+	AdjustShmemSize();
+
 	/*
 	 * Also set up handler to log session end; we have to wait till now to be
 	 * sure Log_disconnections has its final value.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..82cee6b8877 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -155,6 +155,8 @@ REPLICATION_ORIGIN_DROP	"Waiting for a replication origin to become inactive so
 REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it can be dropped."
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
+SHMEM_RESIZE_START	"Waiting for other backends to start resizing shared memory."
+SHMEM_RESIZE_DONE	"Waiting for other backends to finish resizing shared memory."
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
 WAL_RECEIVER_EXIT	"Waiting for the WAL receiver to exit."
 WAL_RECEIVER_WAIT_START	"Waiting for startup process to send initial data for streaming replication."
@@ -355,6 +357,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+ShmemResize	"Waiting to resize shared memory."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index cff8bb815f9..7b3ac5f3716 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1105,13 +1105,14 @@
 
 # We sometimes multiply the number of shared buffers by two without
 # checking for overflow, so we mustn't allow more than INT_MAX / 2.
-{ name => 'shared_buffers', type => 'int', context => 'PGC_POSTMASTER', group => 'RESOURCES_MEM',
+{ name => 'shared_buffers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_MEM',
   short_desc => 'Sets the number of shared memory buffers used by the server.',
   flags => 'GUC_UNIT_BLOCKS',
   variable => 'NBuffers',
   boot_val => '16384',
   min => '16',
   max => 'INT_MAX / 2',
+  assign_hook => 'assign_shared_buffers'
 },
 
 # TODO: should this be PGC_POSTMASTER?
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3f37b294af6..e2e97866b40 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -318,7 +318,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 									int32 *buffers_skipped);
 
 /* in buf_init.c */
-extern void BufferManagerShmemInit(void);
+extern void BufferManagerShmemInit(int);
 extern Size BufferManagerShmemSize(void);
 
 /* in localbuf.c */
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 3baf418b3d1..847f56a36dc 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -64,6 +64,7 @@ typedef void (*shmem_startup_hook_type) (void);
 /* ipc.c */
 extern PGDLLIMPORT bool proc_exit_inprogress;
 extern PGDLLIMPORT bool shmem_exit_inprogress;
+extern PGDLLIMPORT volatile bool pending_pm_shmem_resize;
 
 pg_noreturn extern void proc_exit(int code);
 extern void shmem_exit(int code);
@@ -83,5 +84,7 @@ extern void CreateSharedMemoryAndSemaphores(void);
 extern void AttachSharedMemoryStructs(void);
 #endif
 extern void InitializeShmemGUCs(void);
+extern void CoordinateShmemResize(void);
+extern bool AnonymousShmemResize(void);
 
 #endif							/* IPC_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..cba586027a7 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, ShmemResize)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 715f6acb5dd..eba28ce8a5c 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -24,6 +24,7 @@
 #ifndef PG_SHMEM_H
 #define PG_SHMEM_H
 
+#include "storage/barrier.h"
 #include "storage/dsm_impl.h"
 #include "storage/spin.h"
 
@@ -69,6 +70,25 @@ typedef struct ShmemSegment
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 extern PGDLLIMPORT AnonymousMapping Mappings[ANON_MAPPINGS];
 
+/*
+ * ShmemControl is shared between backends and helps to coordinate shared
+ * memory resize.
+ */
+typedef struct
+{
+	pg_atomic_uint32 	NSharedBuffers;
+	Barrier 			Barrier;
+	pg_atomic_uint64 	Generation;
+	bool                Resizable;
+} ShmemControl;
+
+extern PGDLLIMPORT ShmemControl *ShmemCtrl;
+
+/* The phases for shared memory resizing, used by for ProcSignal barrier. */
+#define SHMEM_RESIZE_REQUESTED			0
+#define SHMEM_RESIZE_START				1
+#define SHMEM_RESIZE_DONE				2
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -123,6 +143,12 @@ extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
 							int *memfd_flags);
 void PrepareHugePages(void);
 
+bool ProcessBarrierShmemResize(Barrier *barrier);
+void assign_shared_buffers(int newval, void *extra, bool *pending);
+void AdjustShmemSize(void);
+extern void WaitOnShmemBarrier(void);
+extern void ShmemControlInit(void);
+
 /*
  * To be able to dynamically resize largest parts of the data stored in shared
  * memory, we split it into multiple shared memory mappings segments. Each
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 428aa3fd68a..5ced2a83537 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -42,9 +42,10 @@ typedef enum
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
 	PMSIGNAL_XLOG_IS_SHUTDOWN,	/* ShutdownXLOG() completed */
+	PMSIGNAL_SHMEM_RESIZE,	/* resize shared memory */
 } PMSignalReason;
 
-#define NUM_PMSIGNALS (PMSIGNAL_XLOG_IS_SHUTDOWN+1)
+#define NUM_PMSIGNALS (PMSIGNAL_SHMEM_RESIZE+1)
 
 /*
  * Reasons why the postmaster would send SIGQUIT to its children.
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 2733bbb8c5b..97033f84dce 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -54,6 +54,7 @@ typedef enum
 typedef enum
 {
 	PROCSIGNAL_BARRIER_SMGRRELEASE, /* ask smgr to close files */
+	PROCSIGNAL_BARRIER_SHMEM_RESIZE, /* ask backends to resize shared memory */
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..691c39c8ad2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2770,6 +2770,7 @@ ShellTypeInfo
 ShippableCacheEntry
 ShippableCacheKey
 ShmemIndexEnt
+ShmemControl
 ShutdownForeignScan_function
 ShutdownInformation
 ShutdownMode
-- 
2.34.1

0014-Support-shrinking-shared-buffers-20251013.patchapplication/x-patch; name=0014-Support-shrinking-shared-buffers-20251013.patchDownload

From 63dd4ccbb0ae0de2eefb72ae5a4a7bf7b5a6455b Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Thu, 19 Jun 2025 17:38:29 +0200
Subject: [PATCH 14/19] Support shrinking shared buffers

Buffer eviction
===============
When shrinking the shared buffers pool, each buffer in the area being
shrunk needs to be flushed if it's dirty so as not to loose the changes
to that buffer after shrinking. Also, each such buffer needs to be
removed from the buffer mapping table so that backends do not access it
after shrinking.

Buffer eviction requires a separate barrier phase for two reasons:

1. No other backend should map a new page to any of  buffers being
   evicted when eviction is in progress. So they wait while eviction is
   in progress.

2. Since a pinned buffer has the pin recorded in the backend local
   memory as well as the buffer descriptor (which is in shared memory),
   eviction should not coincide with remapping the shared memory of a
   backend. Otherwise we might loose consistency of local and shared
   pinning records. Hence it needs to be carried out in
   ProcessBarrierShmemResize() and not in AnonymousShmemResize() as
   indicated by now removed comment.

If a buffer being evicted is pinned, we raise a FATAL error but this should
improve. There are multiple options 1. to wait for the pinned buffer to get
unpinned, 2. the backend is killed or it itself cancels the query  or 3.
rollback the operation. Note that option 1 and 2 would require the pinning
related local and shared records to be accessed. But we need infrastructure to
do either of this right now.

Removing the evicted buffers from buffer ring
=============================================
If the buffer pool has been shrunk, the buffers in the buffer ring may
not be valid anymore. Modify GetBufferFromRing to check if the buffer is
still valid before using it. This makes GetBufferFromRing() a bit more
expensive because of additional boolean condition and masks any bug that
introduces an invalid buffer into the ring. The alternative fix is more
complex as explained below.

The strategy object is created in CurrentMemoryContext and is not
available in any global structure thus accessible when processing buffer
resizing barriers. We may modify GetAccessStrategy() to register
strategy in a global linked list and then arrange to deregister it once
it's no more in use. Looking at the places which use
GetAccessStrategy(), fixing all those may be some work.

Author: Ashutosh Bapat
Reviewed-by: Tomas Vondra
---
 src/backend/port/sysv_shmem.c                 | 42 ++++++---
 src/backend/storage/buffer/bufmgr.c           | 93 +++++++++++++++++++
 src/backend/storage/buffer/freelist.c         | 18 +++-
 .../utils/activity/wait_event_names.txt       |  1 +
 src/include/storage/bufmgr.h                  |  1 +
 src/include/storage/pg_shmem.h                |  1 +
 6 files changed, 139 insertions(+), 17 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 54d335b2e5d..9e1b2c3201f 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -993,14 +993,6 @@ AnonymousShmemResize(void)
 	 */
 	pending_pm_shmem_resize = false;
 
-	/*
-	 * XXX: Currently only increasing of shared_buffers is supported. For
-	 * decreasing something similar has to be done, but buffer blocks with
-	 * data have to be drained first.
-	 */
-	if(NBuffersOld > NBuffers)
-		return false;
-
 #ifndef MAP_HUGETLB
 	/* PrepareHugePages should have dealt with this case */
 	Assert(huge_pages != HUGE_PAGES_ON && !huge_pages_on);
@@ -1099,11 +1091,14 @@ AnonymousShmemResize(void)
 				 * all the pointers are still valid, and we only need to update
 				 * structures size in the ShmemIndex once -- any other backend
 				 * will pick up this shared structure from the index.
-				 *
-				 * XXX: This is the right place for buffer eviction as well.
 				 */
 				BufferManagerShmemInit(NBuffersOld);
 
+				/*
+				 * Wipe out the evictor PID so that it can be used for the next
+				 * buffer resizing operation.
+				*/
+				ShmemCtrl->evictor_pid = 0;
 				/* If all fine, broadcast the new value */
 				pg_atomic_write_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
 			}
@@ -1156,11 +1151,31 @@ ProcessBarrierShmemResize(Barrier *barrier)
 	 * XXX: If we need to be able to abort resizing, this has to be done later,
 	 * after the SHMEM_RESIZE_DONE.
 	 */
-	if (BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START))
+
+	/*
+	 * Evict extra buffers when shrinking shared buffers. We need to do this
+	 * while the memory for extra buffers is still mapped i.e. before remapping
+	 * the shared memory segments to a smaller memory area.
+	 */
+	if (NBuffersOld > NBuffersPending)
 	{
-		Assert(IsUnderPostmaster);
-		SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
+		BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START);
+
+		/*
+		 * TODO: If the buffer eviction fails for any reason, we should
+		 * gracefully rollback the shared buffer resizing and try again. But the
+		 * infrastructure to do so is not available right now. Hence just raise
+		 * a FATAL so that the system restarts.
+		 */
+		if (!EvictExtraBuffers(NBuffersPending, NBuffersOld))
+			elog(FATAL, "buffer eviction failed");
+
+		if (BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_EVICT))
+			SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
 	}
+	else
+		if (BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START))
+			SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
 
 	AnonymousShmemResize();
 
@@ -1684,5 +1699,6 @@ ShmemControlInit(void)
 
 		/* shmem_resizable should be initialized by now */
 		ShmemCtrl->Resizable = shmem_resizable;
+		ShmemCtrl->evictor_pid = 0;
 	}
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index edf17ce3ea1..467d9880f7b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -57,6 +57,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/read_stream.h"
 #include "storage/smgr.h"
@@ -7457,3 +7458,95 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+/*
+ * When shrinking shared buffers pool, evict the buffers which will not be part
+ * of the shrunk buffer pool.
+ */
+bool
+EvictExtraBuffers(int newBufSize, int oldBufSize)
+{
+	bool result = true;
+
+	/*
+	 * If the buffer being evicated is locked, this function will need to wait.
+	 * This function should not be called from a Postmaster since it can not wait on a lock.
+	 */
+	Assert(IsUnderPostmaster);
+
+	/*
+	 * Let only one backend perform eviction. We could split the work across all
+	 * the backends but that doesn't seem necessary.
+	 *
+	 * The first backend to acquire ShmemResizeLock, sets its own PID as the
+	 * evictor PID for other backends to know that the eviction is in progress or
+	 * has already been performed. The evictor backend releases the lock when it
+	 * finishes eviction.  While the eviction is in progress, backends other than
+	 * evictor backend won't be able to take the lock. They won't perform
+	 * eviction. A backend may acquire the lock after eviction has completed, but
+	 * it will not perform eviction since the evictor PID is already set. Evictor
+	 * PID is reset only when the buffer resizing finishes. Thus only one backend
+	 * will perform eviction in a given instance of shared buffers resizing.
+	 *
+	 * Any backend which acquires this lock will release it before the eviction
+	 * phase finishes, hence the same lock can be reused for the next phase of
+	 * resizing buffers.
+	 */
+	if (LWLockConditionalAcquire(ShmemResizeLock, LW_EXCLUSIVE))
+	{
+		if (ShmemCtrl->evictor_pid == 0)
+		{
+			ShmemCtrl->evictor_pid = MyProcPid;
+
+			/*
+			 * TODO: Before evicting any buffer, we should check whether any of the
+			 * buffers are pinned. If we find that a buffer is pinned after evicting
+			 * most of them, that will impact performance since all those evicted
+			 * buffers might need to be read again.
+			 */
+			for (Buffer buf = newBufSize + 1; buf <= oldBufSize; buf++)
+			{
+				BufferDesc *desc = GetBufferDescriptor(buf - 1);
+				uint32		buf_state;
+				bool		buffer_flushed;
+
+				buf_state = pg_atomic_read_u32(&desc->state);
+
+				/*
+				 * Nobody is expected to touch the buffers while resizing is
+				 * going one hence unlocked precheck should be safe and saves
+				 * some cycles.
+				 */
+				if (!(buf_state & BM_VALID))
+					continue;
+
+				/*
+				 * XXX: Looks like CurrentResourceOwner can be NULL here, find
+				 * another one in that case?
+				 * */
+				if (CurrentResourceOwner)
+					ResourceOwnerEnlarge(CurrentResourceOwner);
+
+				ReservePrivateRefCountEntry();
+
+				LockBufHdr(desc);
+
+				/*
+				 * Now that we have locked buffer descriptor, make sure that the
+				 * buffer without valid data has been skipped above.
+				 */
+				Assert(buf_state & BM_VALID);
+
+				if (!EvictUnpinnedBufferInternal(desc, &buffer_flushed))
+				{
+					elog(WARNING, "could not remove buffer %u, it is pinned", buf);
+					result = false;
+					break;
+				}
+			}
+		}
+		LWLockRelease(ShmemResizeLock);
+	}
+
+	return result;
+}
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 299f6aa8e7e..0da8fbb580e 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -669,12 +669,22 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 		strategy->current = 0;
 
 	/*
-	 * If the slot hasn't been filled yet, tell the caller to allocate a new
-	 * buffer with the normal allocation strategy.  He will then fill this
-	 * slot by calling AddBufferToRing with the new buffer.
+	 * If the slot hasn't been filled yet or the buffer in the slot has been
+	 * invalidated when buffer pool was shrunk, tell the caller to allocate a new
+	 * buffer with the normal allocation strategy.  He will then fill this slot
+	 * by calling AddBufferToRing with the new buffer.
+	 * 
+	 * TODO: Ideally we would want to check for bufnum > NBuffers only once
+	 * after every time the buffer pool is shrunk so as to catch any runtime
+	 * bugs that introduce invalid buffers in the ring. But that is complicated.
+	 * The BufferAccessStrategy objects are not accessible outside the
+	 * ScanState. Hence we can not purge the buffers while evicting the buffers.
+	 * After the resizing is finished, it's not possible to notice when we touch
+	 * the first of those objects and the last of objects. See if this can
+	 * fixed. 
 	 */
 	bufnum = strategy->buffers[strategy->current];
-	if (bufnum == InvalidBuffer)
+	if (bufnum == InvalidBuffer || bufnum > NBuffers)
 		return NULL;
 
 	buf = GetBufferDescriptor(bufnum - 1);
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 82cee6b8877..9a6a6275305 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -156,6 +156,7 @@ REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it c
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
 SHMEM_RESIZE_START	"Waiting for other backends to start resizing shared memory."
+SHMEM_RESIZE_EVICT	"Waiting for other backends to finish buffer evication phase."
 SHMEM_RESIZE_DONE	"Waiting for other backends to finish resizing shared memory."
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
 WAL_RECEIVER_EXIT	"Waiting for the WAL receiver to exit."
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index e2e97866b40..e7c973adca8 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -316,6 +316,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 									int32 *buffers_evicted,
 									int32 *buffers_flushed,
 									int32 *buffers_skipped);
+extern bool EvictExtraBuffers(int fromBuf, int toBuf);
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(int);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index eba28ce8a5c..0a59746b472 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -77,6 +77,7 @@ extern PGDLLIMPORT AnonymousMapping Mappings[ANON_MAPPINGS];
 typedef struct
 {
 	pg_atomic_uint32 	NSharedBuffers;
+	pid_t				evictor_pid;
 	Barrier 			Barrier;
 	pg_atomic_uint64 	Generation;
 	bool                Resizable;
-- 
2.34.1

0015-Reinitialize-StrategyControl-after-resizing-20251013.patchapplication/x-patch; name=0015-Reinitialize-StrategyControl-after-resizing-20251013.patchDownload

From 34c5477568b4a31863aeba57c4c0cb700e274f4e Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Thu, 19 Jun 2025 17:38:51 +0200
Subject: [PATCH 15/19] Reinitialize StrategyControl after resizing buffers

... and BgBufferSync and ClockSweepTick adjustments

Reinitializing strategry control area
=====================================
The commit introduces a separate function StrategyReInitialize() instead
of reusing StrategyInitialize() since some of the things that the second
one does are not required in the first one. Here's list of what
StrategyReInitialize() does and how does it differ from
StrategyInitialize().

1. StrategyControl pointer needn't be fetched again since it should not
   change. But added an Assert to make sure the pointer is valid.
2. &StrategyControl->buffer_strategy_lock need not be initialized again.
3. nextVictimBuffer, completePasses and numBufferAllocs are viewed in
   the context of NBuffers. Now that NBuffers itself has changed, those
   three do not make sense. Reset them as if the server has restarted
   again.

Ability to delay resizing operation
===================================
This commit introduces a flag delay_shmem_resize, which postgresql
backends and workers can use to signal the coordinator to delay resizing
operation. Background writer sets this flag when its scanning buffers.

Background writer operation
===========================
Background writer is blocked when the actual resizing is in progress. It
stops a scan in progress when it sees that the resizing has begun or is
about to begin. Once the buffer resizing is finished, before resuming
the regular operation, bgwriter resets the information saved so far.
This information is viewed in the context of NBuffers and hence does not
make sense after resizing which chanegs NBuffers.

Buffer lookup table
===================
Right now there is no way to free shared memory. Even if we shrink the
buffer lookup table when shrinking the buffer pool the unused hash table
entries can not be freed. When we expand the buffer pool, more entries
can be allocated but we can not resize the hash table directory without
rehashing all the entries. Just allocating more entries will lead to
more contention. Hence we setup the buffer lookup table considering the
maximum possible size of the buffer pool which is MaxAvailableMemory
only once at the beginning.  Shared buffer lookup table and
StrategyControl are not resized even if the buffer pool is resized hence
they are allocated in the main shared memory segment

TODO:
====
1. The way BgBufferSync is written today, it packs four functionalities:
   setting up the buffer sync state, performing the buffer sync,
   resetting the buffer sync state when bgwriter_lru_maxpages <= 0 and
   setting it up again after bgwriter_lru_maxpages > 0. That makes the
   code hard to read.  It will be good to divide this function into 3/4
   different functions each performing one functionality. Then pack all
   the state (the local variables from that function converted to static
   global) into a structure, which is passed to these functions. Once
   that happens BgBufferSyncReset() will call one of the functions to
   reset the state when buffer pool is resized.

2. The condition (pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) ==
   NBuffers) checked in BgBufferSync() to check whether buffer resizing
   is "about to begin" is wrong. NBuffers it not changed, until
   AnonymousShmemResize() is called and it wont' be called unless
   BgBufferSync() finishes if it has already begun. Need a better
   condition to check whether buffer resizing is about to begin.

Author: Ashutosh Bapat
Reviewed-by: Tomas Vondra
---
 src/backend/port/sysv_shmem.c          | 23 ++++++--
 src/backend/storage/buffer/buf_init.c  | 19 +++++--
 src/backend/storage/buffer/buf_table.c |  9 ++-
 src/backend/storage/buffer/bufmgr.c    | 72 ++++++++++++++++++------
 src/backend/storage/buffer/freelist.c  | 77 ++++++++++++++++++++++++--
 src/include/storage/buf_internals.h    |  1 +
 src/include/storage/bufmgr.h           |  1 +
 src/include/storage/ipc.h              |  1 +
 src/include/storage/pg_shmem.h         |  5 +-
 9 files changed, 170 insertions(+), 38 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 9e1b2c3201f..3be28e228ae 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -104,6 +104,7 @@ AnonymousMapping Mappings[ANON_MAPPINGS];
 
 /* Flag telling postmaster that resize is needed */
 volatile bool pending_pm_shmem_resize = false;
+volatile bool delay_shmem_resize = false;
 
 /* Keeps track of the previous NBuffers value */
 static int NBuffersOld = -1;
@@ -144,12 +145,11 @@ static int NBuffersPending = -1;
  * makes sense to evaluate them more precise.
  */
 static double SHMEM_RESIZE_RATIO[6] = {
-	0.1,    /* MAIN_SHMEM_SEGMENT */
+	0.15,    /* MAIN_SHMEM_SEGMENT */
 	0.6,    /* BUFFERS_SHMEM_SEGMENT */
 	0.1,    /* BUFFER_DESCRIPTORS_SHMEM_SEGMENT */
 	0.1,    /* BUFFER_IOCV_SHMEM_SEGMENT */
 	0.05,   /* CHECKPOINT_BUFFERS_SHMEM_SEGMENT */
-	0.05,   /* STRATEGY_SHMEM_SEGMENT */
 };
 
 /*
@@ -225,8 +225,6 @@ MappingName(int shmem_segment)
 			return "iocv";
 		case CHECKPOINT_BUFFERS_SHMEM_SEGMENT:
 			return "checkpoint";
-		case STRATEGY_SHMEM_SEGMENT:
-			return "strategy";
 		default:
 			return "unknown";
 	}
@@ -1125,13 +1123,17 @@ ProcessBarrierShmemResize(Barrier *barrier)
 {
 	Assert(IsUnderPostmaster);
 
-	elog(DEBUG1, "Handle a barrier for shmem resizing from %d to %d, %d",
-		 NBuffersOld, NBuffersPending, pending_pm_shmem_resize);
+	elog(DEBUG1, "Handle a barrier for shmem resizing from %d to %d, %d, %d",
+		 NBuffersOld, NBuffersPending, pending_pm_shmem_resize, delay_shmem_resize);
 
 	/* Wait until we have seen the new NBuffers value */
 	if (!pending_pm_shmem_resize)
 		return false;
 
+	/* Wait till this process becomes ready to resize buffers. */
+	if (delay_shmem_resize)
+		return false;
+
 	/*
 	 * First thing to do after attaching to the barrier is to wait for others.
 	 * We can't simply use BarrierArriveAndWait, because backends might arrive
@@ -1182,6 +1184,15 @@ ProcessBarrierShmemResize(Barrier *barrier)
 	/* The second phase means the resize has finished, SHMEM_RESIZE_DONE */
 	BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_DONE);
 
+	if (MyBackendType == B_BG_WRITER)
+	{
+		/*
+		 * Before resuming regular background writer activity, adjust the
+		 * statistics collected so far.
+		 */
+		BgBufferSyncReset(NBuffersOld, NBuffers);
+	}
+
 	BarrierDetach(barrier);
 	return true;
 }
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 0e72e373193..be64fa5a136 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -152,8 +152,15 @@ BufferManagerShmemInit(int FirstBufferToInit)
 	}
 #endif
 
-	/* Init other shared buffer-management stuff */
-	StrategyInitialize(!foundDescs);
+	/*
+	 * Init other shared buffer-management stuff from scratch configuring buffer
+	 * pool the first time. If we are just resizing buffer pool adjust only the
+	 * required structures.
+	 */
+	if (FirstBufferToInit == 0)
+		StrategyInitialize(!foundDescs);
+	else
+		StrategyReInitialize(FirstBufferToInit);
 
 	/* Initialize per-backend file flush context */
 	WritebackContextInit(&BackendWritebackContext,
@@ -184,9 +191,6 @@ BufferManagerShmemSize(void)
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
 	Mappings[BUFFERS_SHMEM_SEGMENT].shmem_req_size = size;
 
-	/* size of stuff controlled by freelist.c */
-	Mappings[STRATEGY_SHMEM_SEGMENT].shmem_req_size = StrategyShmemSize();
-
 	/* size of I/O condition variables, plus alignment padding */
 	size = add_size(0, mul_size(NBuffers,
 								   sizeof(ConditionVariableMinimallyPadded)));
@@ -196,5 +200,10 @@ BufferManagerShmemSize(void)
 	/* size of checkpoint sort array in bufmgr.c */
 	Mappings[CHECKPOINT_BUFFERS_SHMEM_SEGMENT].shmem_req_size = mul_size(NBuffers, sizeof(CkptSortItem));
 
+	/* Allocations in the main memory segment, at the end. */
+
+	/* size of stuff controlled by freelist.c */
+	size = add_size(0, StrategyShmemSize());
+
 	return size;
 }
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 18a78967138..e5a97e557d9 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -65,11 +65,18 @@ InitBufTable(int size)
 	info.entrysize = sizeof(BufferLookupEnt);
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
+	/*
+	 * The shared buffer look up table is set up only once with maximum possible
+	 * entries considering maximum size of the buffer pool. It is not resized
+	 * after that even if the buffer pool is resized. Hence it is allocated in
+	 * the main shared memory segment and not in a resizeable shared memory
+	 * segment.
+	 */
 	SharedBufHash = ShmemInitHashInSegment("Shared Buffer Lookup Table",
 								  size, size,
 								  &info,
 								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION | HASH_FIXED_SIZE,
-								  STRATEGY_SHMEM_SEGMENT);
+								  MAIN_SHMEM_SEGMENT);
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 467d9880f7b..fdcb5556235 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3614,6 +3614,32 @@ BufferSync(int flags)
 	TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_scan);
 }
 
+/*
+ * Information saved between BgBufferSync() calls so we can determine the
+ * strategy point's advance rate and avoid scanning already-cleaned buffers. The
+ * variables are global instead of static local so that BgBufferSyncReset() can
+ * adjust it when resizing shared buffers.
+ */
+static bool saved_info_valid = false;
+static int	prev_strategy_buf_id;
+static uint32 prev_strategy_passes;
+static int	next_to_clean;
+static uint32 next_passes;
+
+/* Moving averages of allocation rate and clean-buffer density */
+static float smoothed_alloc = 0;
+static float smoothed_density = 10.0;
+
+void
+BgBufferSyncReset(int NBuffersOld, int NBuffersNew)
+{
+	saved_info_valid = false;
+#ifdef BGW_DEBUG
+	elog(DEBUG2, "invalidated background writer status after resizing buffers from %d to %d",
+		 NBuffersOld, NBuffersNew);
+#endif
+}
+
 /*
  * BgBufferSync -- Write out some dirty buffers in the pool.
  *
@@ -3633,20 +3659,6 @@ BgBufferSync(WritebackContext *wb_context)
 	uint32		strategy_passes;
 	uint32		recent_alloc;
 
-	/*
-	 * Information saved between calls so we can determine the strategy
-	 * point's advance rate and avoid scanning already-cleaned buffers.
-	 */
-	static bool saved_info_valid = false;
-	static int	prev_strategy_buf_id;
-	static uint32 prev_strategy_passes;
-	static int	next_to_clean;
-	static uint32 next_passes;
-
-	/* Moving averages of allocation rate and clean-buffer density */
-	static float smoothed_alloc = 0;
-	static float smoothed_density = 10.0;
-
 	/* Potentially these could be tunables, but for now, not */
 	float		smoothing_samples = 16;
 	float		scan_whole_pool_milliseconds = 120000.0;
@@ -3669,6 +3681,22 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	/*
+	 * If buffer pool is being shrunk the buffer being written out may not remain
+	 * valid. If the buffer pool is being expanded, more buffers will become
+	 * available without even this function writing out any. Hence wait till
+	 * buffer resizing finishes i.e. go into hibernation mode.
+	 */
+	if (pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers)
+		return true;
+
+	/*
+	 * Resizing shared buffers while this function is performing an LRU scan on
+	 * them may lead to wrong results. Indicate that the resizing should wait for
+	 * the LRU scan to complete.
+	 */
+	delay_shmem_resize = true;
+
 	/*
 	 * Find out where the clock-sweep currently is, and how many buffer
 	 * allocations have happened since our last call.
@@ -3845,8 +3873,17 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
-	/* Execute the LRU scan */
-	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
+	/*
+	 * Execute the LRU scan.
+	 *
+	 * If buffer pool is being shrunk, the buffer being written may not remain
+	 * valid. If the buffer pool is being expanded, more buffers will become
+	 * available without even this function writing any. Hence stop what we are doing. This
+	 * also unblocks other processes that are waiting for buffer resizing to
+	 * finish.
+	 */
+	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est &&
+			pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) == NBuffers)
 	{
 		int			sync_state = SyncOneBuffer(next_to_clean, true,
 											   wb_context);
@@ -3905,6 +3942,9 @@ BgBufferSync(WritebackContext *wb_context)
 #endif
 	}
 
+	/* Let the resizing commence. */
+	delay_shmem_resize = false;
+
 	/* Return true if OK to hibernate */
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 0da8fbb580e..55be5eebe0a 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -408,12 +408,21 @@ StrategyInitialize(bool init)
 	 *
 	 * Since we can't tolerate running out of lookup table entries, we must be
 	 * sure to specify an adequate table size here.  The maximum steady-state
-	 * usage is of course NBuffers entries, but BufferAlloc() tries to insert
-	 * a new entry before deleting the old.  In principle this could be
-	 * happening in each partition concurrently, so we could need as many as
-	 * NBuffers + NUM_BUFFER_PARTITIONS entries.
+	 * usage is of course is as many number of entries as the number of buffers
+	 * in the buffer pool.  Right now there is no way to free shared memory. Even
+	 * if we shrink the buffer lookup table when shrinking the buffer pool the
+	 * unused hash table entries can not be freed. When we expand the buffer
+	 * pool, more entries can be allocated but we can not resize the hash table
+	 * directory without rehashing all the entries. Just allocating more entries
+	 * will lead to more contention. Hence we setup the buffer lookup table
+	 * considering the maximum possible size of the buffer pool which is
+	 * MaxAvailableMemory.
+	 *
+	 * Additionally BufferAlloc() tries to insert a new entry before deleting the
+	 * old.  In principle this could be happening in each partition concurrently,
+	 * so we need extra NUM_BUFFER_PARTITIONS entries.
 	 */
-	InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
+	InitBufTable(MaxAvailableMemory + NUM_BUFFER_PARTITIONS);
 
 	/*
 	 * Get or create the shared strategy control block
@@ -421,7 +430,7 @@ StrategyInitialize(bool init)
 	StrategyControl = (BufferStrategyControl *)
 		ShmemInitStructInSegment("Buffer Strategy Status",
 						sizeof(BufferStrategyControl),
-						&found, STRATEGY_SHMEM_SEGMENT);
+						&found, MAIN_SHMEM_SEGMENT);
 
 	if (!found)
 	{
@@ -446,6 +455,62 @@ StrategyInitialize(bool init)
 		Assert(!init);
 }
 
+/*
+ * StrategyReInitialize -- re-initialize the buffer cache replacement
+ *		strategy.
+ *
+ * To be called when resizing buffer manager and only from the coordinator.
+ * TODO: Assess the differences between this function and StrategyInitialize().
+ */
+void
+StrategyReInitialize(int FirstBufferIdToInit)
+{
+	bool		found;
+
+	/*
+	 * Resizing memory for buffer pools should not affect the address of
+	 * StrategyControl.
+	 */
+	if (StrategyControl != (BufferStrategyControl *)
+		ShmemInitStructInSegment("Buffer Strategy Status",
+						sizeof(BufferStrategyControl),
+						&found, MAIN_SHMEM_SEGMENT))
+		elog(FATAL, "something went wrong while re-initializing the buffer strategy");
+
+	Assert(found);
+
+	/* TODO: Buffer lookup table adjustment: There are two options:
+	 *
+	 * 1. Resize the buffer lookup table to match the new number of buffers. But
+	 * this requires rehashing all the entries in the buffer lookup table with
+	 * the new table size.
+	 *
+	 * 2. Allocate maximum size of the buffer lookup table at the beginning and
+	 * never resize it. This leaves sparse buffer lookup table which is
+	 * inefficient from both memory and time perspective. According to David
+	 * Rowley, the sparse entries in the buffer look up table cause frequent
+	 * cacheline reload which affect performance. If the impact of that
+	 * inefficiency in a benchmark is significant, we will need to consider first
+	 * option.
+	 */
+	/*
+	 * The clock sweep tick pointer might have got invalidated. Reset it as if
+	 * starting a fresh server.
+	 */
+	pg_atomic_write_u32(&StrategyControl->nextVictimBuffer, 0);
+
+	/*
+	 * The old statistics is viewed in the context of the number of shared
+	 * buffers. It does not make sense now that the number of shared buffers
+	 * itself has changed.
+	 */
+	StrategyControl->completePasses = 0;
+	pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
+
+	/* No pending notification */
+	StrategyControl->bgwprocno = -1;
+}
+
 
 /* ----------------------------------------------------------------
  *				Backend-private buffer ring management
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c1206a46aba..20bea8132fd 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -447,6 +447,7 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
+extern void StrategyReInitialize(int FirstBufferToInit);
 
 /* buf_table.c */
 extern Size BufTableShmemSize(int size);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index e7c973adca8..74e226269af 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -300,6 +300,7 @@ extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(WritebackContext *wb_context);
+extern void BgBufferSyncReset(int NBuffersOld, int NBuffersNew);
 
 extern uint32 GetPinLimit(void);
 extern uint32 GetLocalPinLimit(void);
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 847f56a36dc..6e7b0abb625 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -65,6 +65,7 @@ typedef void (*shmem_startup_hook_type) (void);
 extern PGDLLIMPORT bool proc_exit_inprogress;
 extern PGDLLIMPORT bool shmem_exit_inprogress;
 extern PGDLLIMPORT volatile bool pending_pm_shmem_resize;
+extern PGDLLIMPORT volatile bool delay_shmem_resize;
 
 pg_noreturn extern void proc_exit(int code);
 extern void shmem_exit(int code);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 0a59746b472..704b065f9e9 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -65,7 +65,7 @@ typedef struct ShmemSegment
 } ShmemSegment;
 
 /* Number of available segments for anonymous memory mappings */
-#define ANON_MAPPINGS 6
+#define ANON_MAPPINGS 5
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 extern PGDLLIMPORT AnonymousMapping Mappings[ANON_MAPPINGS];
@@ -172,7 +172,4 @@ extern void ShmemControlInit(void);
 /* Checkpoint BufferIds */
 #define CHECKPOINT_BUFFERS_SHMEM_SEGMENT 4
 
-/* Buffer strategy status */
-#define STRATEGY_SHMEM_SEGMENT 5
-
 #endif							/* PG_SHMEM_H */
-- 
2.34.1

0016-Tests-for-dynamic-shared_buffers-resizing-20251013.patchapplication/x-patch; name=0016-Tests-for-dynamic-shared_buffers-resizing-20251013.patchDownload

From 603178a5c369456014cbd0fb6c171074f66aa163 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Wed, 3 Sep 2025 10:59:20 +0530
Subject: [PATCH 16/19] Tests for dynamic shared_buffers resizing

The commit adds two tests:

1. TAP test to stress test buffer pool resizing under concurrent load.

2. SQL test to test sanity of shared memory allocations and mappings
   after buffer pool resizing operation.

Author: Palak Chaturvedi <chaturvedipalak1911@gmail.com>
Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
---
 src/test/Makefile                             |   2 +-
 src/test/README                               |   3 +
 src/test/buffermgr/Makefile                   |  27 ++
 src/test/buffermgr/README                     |  26 ++
 src/test/buffermgr/expected/buffer_resize.out | 237 ++++++++++++++++++
 src/test/buffermgr/meson.build                |  17 ++
 src/test/buffermgr/sql/buffer_resize.sql      |  73 ++++++
 src/test/buffermgr/t/001_resize_buffer.pl     | 126 ++++++++++
 src/test/meson.build                          |   1 +
 9 files changed, 511 insertions(+), 1 deletion(-)
 create mode 100644 src/test/buffermgr/Makefile
 create mode 100644 src/test/buffermgr/README
 create mode 100644 src/test/buffermgr/expected/buffer_resize.out
 create mode 100644 src/test/buffermgr/meson.build
 create mode 100644 src/test/buffermgr/sql/buffer_resize.sql
 create mode 100644 src/test/buffermgr/t/001_resize_buffer.pl

diff --git a/src/test/Makefile b/src/test/Makefile
index 511a72e6238..95f8858a818 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,7 +12,7 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = perl postmaster regress isolation modules authentication recovery subscription
+SUBDIRS = perl postmaster regress isolation modules authentication recovery subscription buffermgr
 
 ifeq ($(with_icu),yes)
 SUBDIRS += icu
diff --git a/src/test/README b/src/test/README
index afdc7676519..77f11607ff7 100644
--- a/src/test/README
+++ b/src/test/README
@@ -15,6 +15,9 @@ examples/
   Demonstration programs for libpq that double as regression tests via
   "make check"
 
+buffermgr/
+  Tests for resizing buffer pool without restarting the server
+
 isolation/
   Tests for concurrent behavior at the SQL level
 
diff --git a/src/test/buffermgr/Makefile b/src/test/buffermgr/Makefile
new file mode 100644
index 00000000000..97c3da9e20a
--- /dev/null
+++ b/src/test/buffermgr/Makefile
@@ -0,0 +1,27 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/test/buffermgr
+#
+# Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/test/buffermgr/Makefile
+#
+#-------------------------------------------------------------------------
+
+EXTRA_INSTALL = contrib/pg_buffercache
+
+REGRESS = buffer_resize
+
+subdir = src/test/buffermgr
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
+clean distclean:
+	rm -rf tmp_check
diff --git a/src/test/buffermgr/README b/src/test/buffermgr/README
new file mode 100644
index 00000000000..c375ad80989
--- /dev/null
+++ b/src/test/buffermgr/README
@@ -0,0 +1,26 @@
+src/test/buffermgr/README
+
+Regression tests for buffer manager
+===================================
+
+This directory contains a test suite for resizing buffer manager without restarting the server.
+
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
+
+See src/test/perl/README for more info about running these tests.
diff --git a/src/test/buffermgr/expected/buffer_resize.out b/src/test/buffermgr/expected/buffer_resize.out
new file mode 100644
index 00000000000..a986be9a5da
--- /dev/null
+++ b/src/test/buffermgr/expected/buffer_resize.out
@@ -0,0 +1,237 @@
+-- Test buffer pool resizing and shared memory allocation tracking
+-- This test resizes the buffer pool multiple times and monitors
+-- shared memory allocations related to buffer management
+-- Create a separate schema for this test
+CREATE SCHEMA buffer_resize_test;
+SET search_path TO buffer_resize_test, public;
+-- Create a view for buffer-related shared memory allocations
+CREATE VIEW buffer_allocations AS
+SELECT name, segment, size, allocated_size 
+FROM pg_shmem_allocations 
+WHERE name IN ('Buffer Blocks', 'Buffer Descriptors', 'Buffer IO Condition Variables', 
+               'Checkpoint BufferIds')
+ORDER BY name;
+-- Note: We exclude the 'main' segment even if it contains the shared buffer
+-- lookup table because it contains other shared structures whose total sizes
+-- may vary as the code changes.
+CREATE VIEW buffer_segments AS
+SELECT name, size, mapping_size, mapping_reserved_size
+FROM pg_shmem_segments
+WHERE name <> 'main'
+ORDER BY name;
+-- Enable pg_buffercache for buffer count verification
+CREATE EXTENSION IF NOT EXISTS pg_buffercache;
+-- Test 1: Default shared_buffers 
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 128MB
+(1 row)
+
+SELECT * FROM buffer_allocations;
+             name              |   segment   |   size    | allocated_size 
+-------------------------------+-------------+-----------+----------------
+ Buffer Blocks                 | buffers     | 134221824 |      134221824
+ Buffer Descriptors            | descriptors |   1048576 |        1048576
+ Buffer IO Condition Variables | iocv        |    262144 |         262144
+ Checkpoint BufferIds          | checkpoint  |    327680 |         327680
+(4 rows)
+
+SELECT * FROM buffer_segments;
+    name     |   size    | mapping_size | mapping_reserved_size 
+-------------+-----------+--------------+-----------------------
+ buffers     | 134225920 |    134225920 |            2576982016
+ checkpoint  |    335872 |       335872 |             214753280
+ descriptors |   1056768 |      1056768 |             429498368
+ iocv        |    270336 |       270336 |             429498368
+(4 rows)
+
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+ buffer_count 
+--------------
+        16384
+(1 row)
+
+-- Test 2: Set to 64MB  
+ALTER SYSTEM SET shared_buffers = '64MB';
+SELECT pg_reload_conf();
+ pg_reload_conf 
+----------------
+ t
+(1 row)
+
+SELECT pg_sleep(1);
+ pg_sleep 
+----------
+ 
+(1 row)
+
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 64MB
+(1 row)
+
+SELECT * FROM buffer_allocations;
+             name              |   segment   |   size   | allocated_size 
+-------------------------------+-------------+----------+----------------
+ Buffer Blocks                 | buffers     | 67112960 |       67112960
+ Buffer Descriptors            | descriptors |   524288 |         524288
+ Buffer IO Condition Variables | iocv        |   131072 |         131072
+ Checkpoint BufferIds          | checkpoint  |   163840 |         163840
+(4 rows)
+
+SELECT * FROM buffer_segments;
+    name     |   size   | mapping_size | mapping_reserved_size 
+-------------+----------+--------------+-----------------------
+ buffers     | 67117056 |     67117056 |            2576982016
+ checkpoint  |   172032 |       172032 |             214753280
+ descriptors |   532480 |       532480 |             429498368
+ iocv        |   139264 |       139264 |             429498368
+(4 rows)
+
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+ buffer_count 
+--------------
+         8192
+(1 row)
+
+-- Test 3: Set to 256MB
+ALTER SYSTEM SET shared_buffers = '256MB';
+SELECT pg_reload_conf();
+ pg_reload_conf 
+----------------
+ t
+(1 row)
+
+SELECT pg_sleep(1);
+ pg_sleep 
+----------
+ 
+(1 row)
+
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 256MB
+(1 row)
+
+SELECT * FROM buffer_allocations;
+             name              |   segment   |   size    | allocated_size 
+-------------------------------+-------------+-----------+----------------
+ Buffer Blocks                 | buffers     | 268439552 |      268439552
+ Buffer Descriptors            | descriptors |   2097152 |        2097152
+ Buffer IO Condition Variables | iocv        |    524288 |         524288
+ Checkpoint BufferIds          | checkpoint  |    655360 |         655360
+(4 rows)
+
+SELECT * FROM buffer_segments;
+    name     |   size    | mapping_size | mapping_reserved_size 
+-------------+-----------+--------------+-----------------------
+ buffers     | 268443648 |    268443648 |            2576982016
+ checkpoint  |    663552 |       663552 |             214753280
+ descriptors |   2105344 |      2105344 |             429498368
+ iocv        |    532480 |       532480 |             429498368
+(4 rows)
+
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+ buffer_count 
+--------------
+        32768
+(1 row)
+
+-- Test 4: Set to 100MB (non-power-of-two)
+ALTER SYSTEM SET shared_buffers = '100MB';
+SELECT pg_reload_conf();
+ pg_reload_conf 
+----------------
+ t
+(1 row)
+
+SELECT pg_sleep(1);
+ pg_sleep 
+----------
+ 
+(1 row)
+
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 100MB
+(1 row)
+
+SELECT * FROM buffer_allocations;
+             name              |   segment   |   size    | allocated_size 
+-------------------------------+-------------+-----------+----------------
+ Buffer Blocks                 | buffers     | 104861696 |      104861696
+ Buffer Descriptors            | descriptors |    819200 |         819200
+ Buffer IO Condition Variables | iocv        |    204800 |         204800
+ Checkpoint BufferIds          | checkpoint  |    256000 |         256000
+(4 rows)
+
+SELECT * FROM buffer_segments;
+    name     |   size    | mapping_size | mapping_reserved_size 
+-------------+-----------+--------------+-----------------------
+ buffers     | 104865792 |    104865792 |            2576982016
+ checkpoint  |    262144 |       262144 |             214753280
+ descriptors |    827392 |       827392 |             429498368
+ iocv        |    212992 |       212992 |             429498368
+(4 rows)
+
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+ buffer_count 
+--------------
+        12800
+(1 row)
+
+-- Test 5: Set to minimum 128kB
+ALTER SYSTEM SET shared_buffers = '128kB';
+SELECT pg_reload_conf();
+ pg_reload_conf 
+----------------
+ t
+(1 row)
+
+SELECT pg_sleep(1);
+ pg_sleep 
+----------
+ 
+(1 row)
+
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 128kB
+(1 row)
+
+SELECT * FROM buffer_allocations;
+             name              |   segment   |  size  | allocated_size 
+-------------------------------+-------------+--------+----------------
+ Buffer Blocks                 | buffers     | 135168 |         135168
+ Buffer Descriptors            | descriptors |   1024 |           1024
+ Buffer IO Condition Variables | iocv        |    256 |            256
+ Checkpoint BufferIds          | checkpoint  |    320 |            320
+(4 rows)
+
+SELECT * FROM buffer_segments;
+    name     |  size  | mapping_size | mapping_reserved_size 
+-------------+--------+--------------+-----------------------
+ buffers     | 139264 |       139264 |            2576982016
+ checkpoint  |   8192 |         8192 |             214753280
+ descriptors |   8192 |         8192 |             429498368
+ iocv        |   8192 |         8192 |             429498368
+(4 rows)
+
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+ buffer_count 
+--------------
+           16
+(1 row)
+
+-- Clean up the schema and all its objects
+RESET search_path;
+DROP SCHEMA buffer_resize_test CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to view buffer_resize_test.buffer_allocations
+drop cascades to view buffer_resize_test.buffer_segments
+drop cascades to extension pg_buffercache
diff --git a/src/test/buffermgr/meson.build b/src/test/buffermgr/meson.build
new file mode 100644
index 00000000000..e71dcdea685
--- /dev/null
+++ b/src/test/buffermgr/meson.build
@@ -0,0 +1,17 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'buffermgr',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'buffer_resize',
+    ],
+  },
+  'tap': {
+    'tests': [
+      't/001_resize_buffer.pl',
+    ],
+  },
+}
diff --git a/src/test/buffermgr/sql/buffer_resize.sql b/src/test/buffermgr/sql/buffer_resize.sql
new file mode 100644
index 00000000000..45f5bb6d78b
--- /dev/null
+++ b/src/test/buffermgr/sql/buffer_resize.sql
@@ -0,0 +1,73 @@
+-- Test buffer pool resizing and shared memory allocation tracking
+-- This test resizes the buffer pool multiple times and monitors
+-- shared memory allocations related to buffer management
+
+-- Create a separate schema for this test
+CREATE SCHEMA buffer_resize_test;
+SET search_path TO buffer_resize_test, public;
+
+-- Create a view for buffer-related shared memory allocations
+CREATE VIEW buffer_allocations AS
+SELECT name, segment, size, allocated_size 
+FROM pg_shmem_allocations 
+WHERE name IN ('Buffer Blocks', 'Buffer Descriptors', 'Buffer IO Condition Variables', 
+               'Checkpoint BufferIds')
+ORDER BY name;
+
+-- Note: We exclude the 'main' segment even if it contains the shared buffer
+-- lookup table because it contains other shared structures whose total sizes
+-- may vary as the code changes.
+CREATE VIEW buffer_segments AS
+SELECT name, size, mapping_size, mapping_reserved_size
+FROM pg_shmem_segments
+WHERE name <> 'main'
+ORDER BY name;
+
+-- Enable pg_buffercache for buffer count verification
+CREATE EXTENSION IF NOT EXISTS pg_buffercache;
+
+-- Test 1: Default shared_buffers 
+SHOW shared_buffers;
+SELECT * FROM buffer_allocations;
+SELECT * FROM buffer_segments;
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+
+-- Test 2: Set to 64MB  
+ALTER SYSTEM SET shared_buffers = '64MB';
+SELECT pg_reload_conf();
+SELECT pg_sleep(1);
+SHOW shared_buffers;
+SELECT * FROM buffer_allocations;
+SELECT * FROM buffer_segments;
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+
+-- Test 3: Set to 256MB
+ALTER SYSTEM SET shared_buffers = '256MB';
+SELECT pg_reload_conf();
+SELECT pg_sleep(1);
+SHOW shared_buffers;
+SELECT * FROM buffer_allocations;
+SELECT * FROM buffer_segments;
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+
+-- Test 4: Set to 100MB (non-power-of-two)
+ALTER SYSTEM SET shared_buffers = '100MB';
+SELECT pg_reload_conf();
+SELECT pg_sleep(1);
+SHOW shared_buffers;
+SELECT * FROM buffer_allocations;
+SELECT * FROM buffer_segments;
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+
+-- Test 5: Set to minimum 128kB
+ALTER SYSTEM SET shared_buffers = '128kB';
+SELECT pg_reload_conf();
+SELECT pg_sleep(1);
+SHOW shared_buffers;
+SELECT * FROM buffer_allocations;
+SELECT * FROM buffer_segments;
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+
+-- Clean up the schema and all its objects
+RESET search_path;
+DROP SCHEMA buffer_resize_test CASCADE;
diff --git a/src/test/buffermgr/t/001_resize_buffer.pl b/src/test/buffermgr/t/001_resize_buffer.pl
new file mode 100644
index 00000000000..8cf9e4539ab
--- /dev/null
+++ b/src/test/buffermgr/t/001_resize_buffer.pl
@@ -0,0 +1,126 @@
+# Copyright (c) 2025-2025, PostgreSQL Global Development Group
+#
+# Minimal test testing shared_buffer resizing under load
+
+use strict;
+use warnings;
+use IPC::Run;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Function to resize buffer pool and verify the change.
+sub apply_and_verify_buffer_change
+{
+	my ($node, $new_size) = @_;
+	
+	# Use a single background_psql session for consistency
+	my $psql_session = $node->background_psql('postgres');
+	$psql_session->query_safe("ALTER SYSTEM SET shared_buffers = '$new_size'");
+	$psql_session->query_safe("SELECT pg_reload_conf()");
+	
+	# Wait till the resizing finishes using the same session
+	# 
+	# TODO: Right now there is no way to know when the resize has finished and
+	# all the backends are using new value of shared_buffers. Hence we poll
+	# manually until we get the expected value in the same session.
+	my $current_size;
+	my $attempts = 0;
+	my $max_attempts = 60; # 60 seconds timeout
+	do {
+		$current_size = $psql_session->query_safe("SHOW shared_buffers");
+		$attempts++;
+		
+		# Only sleep if we didn't get the expected result and haven't timed out yet
+		if ($current_size ne $new_size && $attempts < $max_attempts) {
+			sleep(1);
+		}
+	} while ($current_size ne $new_size && $attempts < $max_attempts);
+	
+	$psql_session->quit;
+	
+	# Check if we succeeded or timed out
+	if ($current_size ne $new_size) {
+		die "Timeout waiting for shared_buffers to change to $new_size (got $current_size after ${attempts}s)";
+	}
+}
+
+# Initialize a cluster and start pgbench in the background for concurrent load.
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->start;
+$node->safe_psql('postgres', "CREATE EXTENSION pg_buffercache");
+my $pgb_scale = 10;
+my $pgb_duration = 120;
+my $pgb_num_clients = 10;
+$node->pgbench(
+	"--initialize --init-steps=dtpvg --scale=$pgb_scale --quiet",
+	0,
+	[qr{^$}],
+	[   # stderr patterns to verify initialization stages
+		qr{dropping old tables},
+		qr{creating tables},
+		qr{done in \d+\.\d\d s }
+	],
+	"pgbench initialization (scale=$pgb_scale)"
+);
+my ($pgbench_stdin, $pgbench_stdout, $pgbench_stderr) = ('', '', '');
+my $pgbench_process = IPC::Run::start(
+	[
+		'pgbench',
+		'-p', $node->port,
+		'-T', $pgb_duration,
+		'-c', $pgb_num_clients,
+		'postgres'
+	],
+	'<'  => \$pgbench_stdin,
+	'>'  => \$pgbench_stdout,
+	'2>' => \$pgbench_stderr
+);
+
+ok($pgbench_process, "pgbench started successfully");
+
+# Allow pgbench to establish connections and start generating load.
+# 
+# TODO: When creating new backends is known to work well with buffer pool
+# resizing, this wait should be removed.
+sleep(1);
+
+# Resize buffer pool to various sizes while pgbench is running in the
+# background.
+# 
+# TODO: These are pseudo-randomly picked sizes, but we can do better.
+my $tests_completed = 0;
+my @buffer_sizes = ('900MB', '500MB', '250MB', '400MB', '120MB', '600MB');
+for my $target_size (@buffer_sizes)
+{
+	# Verify workload generator is still running
+	if (!$pgbench_process->pumpable) {
+		ok(0, "pgbench is still running");
+		last;
+	}
+	
+	apply_and_verify_buffer_change($node, $target_size);
+	$tests_completed++;
+	
+	# Wait for the resized buffer pool to stabilize. If the resized buffer pool
+	# is utilized fully, it might hit any wrongly initialized areas of shared
+	# memory.
+	sleep(2);
+}
+is($tests_completed, scalar(@buffer_sizes), "All buffer sizes were tested");
+
+# Make sure that pgbench can end normally.
+$pgbench_process->signal('TERM');
+IPC::Run::finish $pgbench_process;
+ok(grep { $pgbench_process->result == $_ } (0, 15),  "pgbench exited gracefully");
+
+# Log any error output from pgbench for debugging
+diag("pgbench stderr:\n$pgbench_stderr");
+diag("pgbench stdout:\n$pgbench_stdout");
+
+# Ensure database is still functional after all the buffer changes
+$node->connect_ok("dbname=postgres", 
+	"Database remains accessible after $tests_completed buffer resize operations");
+
+done_testing();
\ No newline at end of file
diff --git a/src/test/meson.build b/src/test/meson.build
index ccc31d6a86a..2a5ba1dec39 100644
--- a/src/test/meson.build
+++ b/src/test/meson.build
@@ -4,6 +4,7 @@ subdir('regress')
 subdir('isolation')
 
 subdir('authentication')
+subdir('buffermgr')
 subdir('postmaster')
 subdir('recovery')
 subdir('subscription')
-- 
2.34.1

0017-Revert-Introduce-pending-flag-for-GUC-assig-20251013.patchapplication/x-patch; name=0017-Revert-Introduce-pending-flag-for-GUC-assig-20251013.patchDownload

From aeadb3079216f970505605217a2bd9fabeff584a Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Fri, 26 Sep 2025 16:47:24 +0530
Subject: [PATCH 17/19] Revert "Introduce pending flag for GUC assign hooks"

This reverts commit 0a13e56dceea8cc7a2685df7ee8cea434588681b.
---
 src/backend/access/transam/xlog.c    |  2 +-
 src/backend/commands/variable.c      |  6 +--
 src/backend/libpq/pqcomm.c           |  8 ++--
 src/backend/tcop/postgres.c          |  2 +-
 src/backend/utils/misc/guc.c         | 59 +++++++++-------------------
 src/backend/utils/misc/stack_depth.c |  2 +-
 src/include/utils/guc.h              |  2 +-
 src/include/utils/guc_hooks.h        | 20 +++++-----
 8 files changed, 40 insertions(+), 61 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cc48b253bc8..eceab341255 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2197,7 +2197,7 @@ CalculateCheckpointSegments(void)
 }
 
 void
-assign_max_wal_size(int newval, void *extra, bool *pending)
+assign_max_wal_size(int newval, void *extra)
 {
 	max_wal_size_mb = newval;
 	CalculateCheckpointSegments();
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index e40dae2ddf2..608f10d9412 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1143,7 +1143,7 @@ check_cluster_name(char **newval, void **extra, GucSource source)
  * GUC assign_hook for maintenance_io_concurrency
  */
 void
-assign_maintenance_io_concurrency(int newval, void *extra, bool *pending)
+assign_maintenance_io_concurrency(int newval, void *extra)
 {
 	/*
 	 * Reconfigure recovery prefetching, because a setting it depends on
@@ -1161,12 +1161,12 @@ assign_maintenance_io_concurrency(int newval, void *extra, bool *pending)
  * they may be assigned in either order.
  */
 void
-assign_io_max_combine_limit(int newval, void *extra, bool *pending)
+assign_io_max_combine_limit(int newval, void *extra)
 {
 	io_combine_limit = Min(newval, io_combine_limit_guc);
 }
 void
-assign_io_combine_limit(int newval, void *extra, bool *pending)
+assign_io_combine_limit(int newval, void *extra)
 {
 	io_combine_limit = Min(io_max_combine_limit, newval);
 }
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 1726a7c0993..25f739a6a17 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -1951,7 +1951,7 @@ pq_settcpusertimeout(int timeout, Port *port)
  * GUC assign_hook for tcp_keepalives_idle
  */
 void
-assign_tcp_keepalives_idle(int newval, void *extra, bool *pending)
+assign_tcp_keepalives_idle(int newval, void *extra)
 {
 	/*
 	 * The kernel API provides no way to test a value without setting it; and
@@ -1984,7 +1984,7 @@ show_tcp_keepalives_idle(void)
  * GUC assign_hook for tcp_keepalives_interval
  */
 void
-assign_tcp_keepalives_interval(int newval, void *extra, bool *pending)
+assign_tcp_keepalives_interval(int newval, void *extra)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_setkeepalivesinterval(newval, MyProcPort);
@@ -2007,7 +2007,7 @@ show_tcp_keepalives_interval(void)
  * GUC assign_hook for tcp_keepalives_count
  */
 void
-assign_tcp_keepalives_count(int newval, void *extra, bool *pending)
+assign_tcp_keepalives_count(int newval, void *extra)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_setkeepalivescount(newval, MyProcPort);
@@ -2030,7 +2030,7 @@ show_tcp_keepalives_count(void)
  * GUC assign_hook for tcp_user_timeout
  */
 void
-assign_tcp_user_timeout(int newval, void *extra, bool *pending)
+assign_tcp_user_timeout(int newval, void *extra)
 {
 	/* See comments in assign_tcp_keepalives_idle */
 	(void) pq_settcpusertimeout(newval, MyProcPort);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 81881ef56c1..ee9f308379c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3598,7 +3598,7 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 
 /* GUC assign hook for transaction_timeout */
 void
-assign_transaction_timeout(int newval, void *extra, bool *pending)
+assign_transaction_timeout(int newval, void *extra)
 {
 	if (IsTransactionState())
 	{
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c9361a0e423..8794e26ef1d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1681,7 +1681,6 @@ InitializeOneGUCOption(struct config_generic *gconf)
 				struct config_int *conf = (struct config_int *) gconf;
 				int			newval = conf->boot_val;
 				void	   *extra = NULL;
-				bool 	   pending = false;
 
 				Assert(newval >= conf->min);
 				Assert(newval <= conf->max);
@@ -1690,13 +1689,9 @@ InitializeOneGUCOption(struct config_generic *gconf)
 					elog(FATAL, "failed to initialize %s to %d",
 						 conf->gen.name, newval);
 				if (conf->assign_hook)
-					conf->assign_hook(newval, extra, &pending);
-
-				if (!pending)
-				{
-					*conf->variable = conf->reset_val = newval;
-					conf->gen.extra = conf->reset_extra = extra;
-				}
+					conf->assign_hook(newval, extra);
+				*conf->variable = conf->reset_val = newval;
+				conf->gen.extra = conf->reset_extra = extra;
 				break;
 			}
 		case PGC_REAL:
@@ -2052,18 +2047,13 @@ ResetAllOptions(void)
 			case PGC_INT:
 				{
 					struct config_int *conf = (struct config_int *) gconf;
-					bool 			  pending = false;
 
 					if (conf->assign_hook)
 						conf->assign_hook(conf->reset_val,
-										  conf->reset_extra,
-										  &pending);
-					if (!pending)
-					{
-						*conf->variable = conf->reset_val;
-						set_extra_field(&conf->gen, &conf->gen.extra,
-										conf->reset_extra);
-					}
+										  conf->reset_extra);
+					*conf->variable = conf->reset_val;
+					set_extra_field(&conf->gen, &conf->gen.extra,
+									conf->reset_extra);
 					break;
 				}
 			case PGC_REAL:
@@ -2440,21 +2430,16 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
 							struct config_int *conf = (struct config_int *) gconf;
 							int			newval = newvalue.val.intval;
 							void	   *newextra = newvalue.extra;
-							bool 	    pending = false;
 
 							if (*conf->variable != newval ||
 								conf->gen.extra != newextra)
 							{
 								if (conf->assign_hook)
-									conf->assign_hook(newval, newextra, &pending);
-
-								if (!pending)
-								{
-									*conf->variable = newval;
-									set_extra_field(&conf->gen, &conf->gen.extra,
-													newextra);
-									changed = true;
-								}
+									conf->assign_hook(newval, newextra);
+								*conf->variable = newval;
+								set_extra_field(&conf->gen, &conf->gen.extra,
+												newextra);
+								changed = true;
 							}
 							break;
 						}
@@ -3871,24 +3856,18 @@ set_config_with_handle(const char *name, config_handle *handle,
 
 				if (changeVal)
 				{
-					bool pending = false;
-
 					/* Save old value to support transaction abort */
 					if (!makeDefault)
 						push_old_value(&conf->gen, action);
 
 					if (conf->assign_hook)
-						conf->assign_hook(newval, newextra, &pending);
-
-					if (!pending)
-					{
-						*conf->variable = newval;
-						set_extra_field(&conf->gen, &conf->gen.extra,
-										newextra);
-						set_guc_source(&conf->gen, source);
-						conf->gen.scontext = context;
-						conf->gen.srole = srole;
-					}
+						conf->assign_hook(newval, newextra);
+					*conf->variable = newval;
+					set_extra_field(&conf->gen, &conf->gen.extra,
+									newextra);
+					set_guc_source(&conf->gen, source);
+					conf->gen.scontext = context;
+					conf->gen.srole = srole;
 				}
 				if (makeDefault)
 				{
diff --git a/src/backend/utils/misc/stack_depth.c b/src/backend/utils/misc/stack_depth.c
index ef59ae62008..8f7cf531fbc 100644
--- a/src/backend/utils/misc/stack_depth.c
+++ b/src/backend/utils/misc/stack_depth.c
@@ -156,7 +156,7 @@ check_max_stack_depth(int *newval, void **extra, GucSource source)
 
 /* GUC assign hook for max_stack_depth */
 void
-assign_max_stack_depth(int newval, void *extra, bool *pending)
+assign_max_stack_depth(int newval, void *extra)
 {
 	ssize_t		newval_bytes = newval * (ssize_t) 1024;
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index c3056cd2da8..f21ec37da89 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -187,7 +187,7 @@ typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource sourc
 typedef bool (*GucEnumCheckHook) (int *newval, void **extra, GucSource source);
 
 typedef void (*GucBoolAssignHook) (bool newval, void *extra);
-typedef void (*GucIntAssignHook) (int newval, void *extra, bool *pending);
+typedef void (*GucIntAssignHook) (int newval, void *extra);
 typedef void (*GucRealAssignHook) (double newval, void *extra);
 typedef void (*GucStringAssignHook) (const char *newval, void *extra);
 typedef void (*GucEnumAssignHook) (int newval, void *extra);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 658c799419e..82ac8646a8d 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -81,12 +81,12 @@ extern bool check_log_stats(bool *newval, void **extra, GucSource source);
 extern bool check_log_timezone(char **newval, void **extra, GucSource source);
 extern void assign_log_timezone(const char *newval, void *extra);
 extern const char *show_log_timezone(void);
-extern void assign_maintenance_io_concurrency(int newval, void *extra, bool *pending);
-extern void assign_io_max_combine_limit(int newval, void *extra, bool *pending);
-extern void assign_io_combine_limit(int newval, void *extra, bool *pending);
-extern void assign_max_wal_size(int newval, void *extra, bool *pending);
+extern void assign_maintenance_io_concurrency(int newval, void *extra);
+extern void assign_io_max_combine_limit(int newval, void *extra);
+extern void assign_io_combine_limit(int newval, void *extra);
+extern void assign_max_wal_size(int newval, void *extra);
 extern bool check_max_stack_depth(int *newval, void **extra, GucSource source);
-extern void assign_max_stack_depth(int newval, void *extra, bool *pending);
+extern void assign_max_stack_depth(int newval, void *extra);
 extern bool check_multixact_member_buffers(int *newval, void **extra,
 										   GucSource source);
 extern bool check_multixact_offset_buffers(int *newval, void **extra,
@@ -141,13 +141,13 @@ extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 extern void assign_syslog_facility(int newval, void *extra);
 extern void assign_syslog_ident(const char *newval, void *extra);
-extern void assign_tcp_keepalives_count(int newval, void *extra, bool *pending);
+extern void assign_tcp_keepalives_count(int newval, void *extra);
 extern const char *show_tcp_keepalives_count(void);
-extern void assign_tcp_keepalives_idle(int newval, void *extra, bool *pending);
+extern void assign_tcp_keepalives_idle(int newval, void *extra);
 extern const char *show_tcp_keepalives_idle(void);
-extern void assign_tcp_keepalives_interval(int newval, void *extra, bool *pending);
+extern void assign_tcp_keepalives_interval(int newval, void *extra);
 extern const char *show_tcp_keepalives_interval(void);
-extern void assign_tcp_user_timeout(int newval, void *extra, bool *pending);
+extern void assign_tcp_user_timeout(int newval, void *extra);
 extern const char *show_tcp_user_timeout(void);
 extern bool check_temp_buffers(int *newval, void **extra, GucSource source);
 extern bool check_temp_tablespaces(char **newval, void **extra,
@@ -163,7 +163,7 @@ extern bool check_transaction_buffers(int *newval, void **extra, GucSource sourc
 extern bool check_transaction_deferrable(bool *newval, void **extra, GucSource source);
 extern bool check_transaction_isolation(int *newval, void **extra, GucSource source);
 extern bool check_transaction_read_only(bool *newval, void **extra, GucSource source);
-extern void assign_transaction_timeout(int newval, void *extra, bool *pending);
+extern void assign_transaction_timeout(int newval, void *extra);
 extern const char *show_unix_socket_permissions(void);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern bool check_wal_consistency_checking(char **newval, void **extra,
-- 
2.34.1

0018-Re-implement-UI-and-synchronization-for-res-20251013.patchapplication/x-patch; name=0018-Re-implement-UI-and-synchronization-for-res-20251013.patchDownload

From 9f804f7a003d00771304af6f0f4f96a9839571f3 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Fri, 26 Sep 2025 19:12:45 +0530
Subject: [PATCH 18/19] Re-implement UI and synchronization for resizing buffer
 pool

shared_buffers is not PGC_SIGHUP instead of PGC_POSTMASTER.  The value
of this GUC is saved in NBuffersPending instead of NBuffers which now
shows the size of buffer pool in-effect. When the server starts, the
shared memory size is estimated and the memory is allocated using
NBuffersPending followed by setting NBuffers = NBuffersPending.

When a server is running, the new value of GUC (set using ALTER SYSTEM
... SET shared_buffers = ...; followed by SELECT pg_reload_conf()) does
not come into effect immediately. Instead a function
pg_resize_shared_buffers() is used to resize the buffer pool. The
function uses the current value of GUC in the backends where it is
executed. The function also coordinates the buffer resizing
synchronization across backends.

SHOW shared_buffers now shows the current size of the shared buffers but
it also shows pending size of shared buffers, if any.

A new GUC max_shared_buffers is introduced to control the maximum value
of shared_buffers that can be set. By default it is 0 and it is set to
shared_buffers' value. When explicitly set it needs to be higher than
'shared_buffers'. This GUC determines the size of address space reserved
for future buffer pool sizes and the size of buffer look up table.

TODO: In case the backend executing pg_resize_shared_buffers() exits
before the operation finishes, we will need somebody to clean up or
complete the half-finished resizing operation. Best possibility is to
use a background worker (mostly background writer) to do that. But then
I think making that background worker the coordinator itself might be a
better option since it will be restarted by the postmaster upon
premature exit.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
---
 doc/src/sgml/config.sgml                      |  44 +-
 doc/src/sgml/func/func-admin.sgml             |  57 ++
 src/backend/access/transam/slru.c             |   2 +-
 src/backend/access/transam/xlog.c             |   2 +-
 src/backend/bootstrap/bootstrap.c             |   2 +
 src/backend/port/sysv_shmem.c                 | 433 ++--------------
 src/backend/postmaster/postmaster.c           |  40 +-
 src/backend/storage/buffer/buf_init.c         | 264 ++++++++--
 src/backend/storage/buffer/bufmgr.c           | 126 +++--
 src/backend/storage/buffer/freelist.c         | 118 ++---
 src/backend/storage/ipc/ipci.c                |   8 +-
 src/backend/storage/ipc/procsignal.c          |  14 +-
 src/backend/storage/ipc/shmem.c               | 485 +++++++++++++++++-
 src/backend/tcop/postgres.c                   |  13 +-
 .../utils/activity/wait_event_names.txt       |   4 +-
 src/backend/utils/init/globals.c              |   6 +-
 src/backend/utils/init/postinit.c             |  32 ++
 src/backend/utils/misc/guc.c                  |   2 +-
 src/backend/utils/misc/guc_parameters.dat     |  16 +-
 src/include/catalog/pg_proc.dat               |   6 +
 src/include/miscadmin.h                       |   6 +-
 src/include/storage/buf_internals.h           |   2 +-
 src/include/storage/bufmgr.h                  |  17 +-
 src/include/storage/ipc.h                     |   1 -
 src/include/storage/pg_shmem.h                |  24 +-
 src/include/storage/procsignal.h              |   5 +-
 src/include/storage/shmem.h                   |   8 +
 src/include/utils/guc.h                       |   2 +
 src/test/buffermgr/Makefile                   |   3 +
 src/test/buffermgr/buffermgr_test.conf        |   9 +
 src/test/buffermgr/expected/buffer_resize.out | 184 +++++--
 src/test/buffermgr/meson.build                |   5 +
 src/test/buffermgr/sql/buffer_resize.sql      |  44 +-
 src/test/buffermgr/t/001_resize_buffer.pl     |  44 +-
 .../buffermgr/t/003_parallel_resize_buffer.pl |  71 +++
 35 files changed, 1387 insertions(+), 712 deletions(-)
 create mode 100644 src/test/buffermgr/buffermgr_test.conf
 create mode 100644 src/test/buffermgr/t/003_parallel_resize_buffer.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 39e658b7808..732f9636857 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1724,7 +1724,6 @@ include_dir 'conf.d'
         that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
         (Non-default values of <symbol>BLCKSZ</symbol> change the minimum
         value.)
-        This parameter can only be set at server start.
        </para>
 
        <para>
@@ -1747,6 +1746,49 @@ include_dir 'conf.d'
         appropriate, so as to leave adequate space for the operating system.
        </para>
 
+       <para>
+        The shared memory consumed by the buffer pool is allocated and
+        initialized according to the value of the GUC at the time of starting
+        the server. A desired new value of GUC can be loaded while the server is
+        running using <systemitem>SIGHUP</systemitem>. But the buffer pool will
+        not be resized immediately. Use
+        <function>pg_resize_shared_buffers()</function> to dynamically resize
+        the shared buffer pool (see <xref linkend="functions-admin"/> for details).
+        <command>SHOW shared_buffers</command> shows the current number of
+        shared buffers and pending number, if any. Please note that when the GUC
+        is changed, the other GUCS which use this GUCs value to set their
+        defaults will not be changed. They may still require a server restart to
+        consider new value.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-max-shared-buffers" xreflabel="max_shared_buffers">
+      <term><varname>max_shared_buffers</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_shared_buffers</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the upper limit for the <varname>shared_buffers</varname> value.
+        The default value is <literal>0</literal>,
+        which means no explicit limit is set and <varname>max_shared_buffers</varname>
+        will be automatically set to the value of <varname>shared_buffers</varname>
+        at server startup.
+        If this value is specified without units, it is taken as blocks,
+        that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+        This parameter can only be set at server start.
+       </para>
+
+       <para>
+        This parameter determines the amount of memory address space to reserve
+        in each backend for expanding the buffer pool in future. While the
+        memory for buffer pool is allocated on demand as it is resized, the
+        memory required to hold the buffer manager metadata is allocated
+        statically at the server start accounting for the largest buffer pool
+        size allowed by this parameter.
+       </para>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/func/func-admin.sgml b/doc/src/sgml/func/func-admin.sgml
index 1b465bc8ba7..0dc89b07c76 100644
--- a/doc/src/sgml/func/func-admin.sgml
+++ b/doc/src/sgml/func/func-admin.sgml
@@ -99,6 +99,63 @@
         <returnvalue>off</returnvalue>
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_resize_shared_buffers</primary>
+        </indexterm>
+        <function>pg_resize_shared_buffers</function> ()
+        <returnvalue>boolean</returnvalue>
+       </para>
+       <para>
+        Dynamically resizes the shared buffer pool to match the current
+        value of the <varname>shared_buffers</varname> parameter. This
+        function implements a coordinated resize process that ensures all
+        backend processes acknowledge the change before completing the
+        operation. The resize happens in multiple phases to maintain
+        data consistency and system stability. Returns <literal>true</literal>
+        if the resize was successful, or raises an error if the operation
+        fails. This function can only be called by superusers.
+       </para>
+       <para>
+        To resize shared buffers, first update the <varname>shared_buffers</varname>
+        setting and reload the configuration, then verify the new value is loaded
+        before calling this function. For example:
+<programlisting>
+postgres=# ALTER SYSTEM SET shared_buffers = '256MB';
+ALTER SYSTEM
+postgres=# SELECT pg_reload_conf();
+ pg_reload_conf
+----------------
+ t
+(1 row)
+
+postgres=# SHOW shared_buffers;
+     shared_buffers      
+-------------------------
+ 128MB (pending: 256MB)
+(1 row)
+
+postgres=# SELECT pg_resize_shared_buffers();
+ pg_resize_shared_buffers
+--------------------------
+ t
+(1 row)
+
+postgres=# SHOW shared_buffers;
+ shared_buffers
+----------------
+ 256MB
+(1 row)
+</programlisting>
+        The <command>SHOW shared_buffers</command> step is important to verify
+        that the configuration reload was successful and the new value is
+        available to the current session before attempting the resize. The
+        output shows both the current and pending values when a change is waiting
+        to be applied.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 5d3fcd62c94..3eae1d0c7e9 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -232,7 +232,7 @@ SimpleLruAutotuneBuffers(int divisor, int max)
 {
 	return Min(max - (max % SLRU_BANK_SIZE),
 			   Max(SLRU_BANK_SIZE,
-				   NBuffers / divisor - (NBuffers / divisor) % SLRU_BANK_SIZE));
+				   NBuffersPending / divisor - (NBuffersPending / divisor) % SLRU_BANK_SIZE));
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..ea01befe15c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4662,7 +4662,7 @@ XLOGChooseNumBuffers(void)
 {
 	int			xbuffers;
 
-	xbuffers = NBuffers / 32;
+	xbuffers = NBuffersPending / 32;
 	if (xbuffers > (wal_segment_size / XLOG_BLCKSZ))
 		xbuffers = (wal_segment_size / XLOG_BLCKSZ);
 	if (xbuffers < 8)
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index fc8638c1b61..226944e4588 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -335,6 +335,8 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 
 	InitializeFastPathLocks();
 
+	InitializeMaxNBuffers();
+
 	CreateSharedMemoryAndSemaphores();
 
 	/*
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 3be28e228ae..380ecbc9751 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -102,14 +102,8 @@ void	   *UsedShmemSegAddr = NULL;
 
 AnonymousMapping Mappings[ANON_MAPPINGS];
 
-/* Flag telling postmaster that resize is needed */
-volatile bool pending_pm_shmem_resize = false;
 volatile bool delay_shmem_resize = false;
 
-/* Keeps track of the previous NBuffers value */
-static int NBuffersOld = -1;
-static int NBuffersPending = -1;
-
 /*
  * Anonymous mapping layout we use looks like this:
  *
@@ -137,20 +131,9 @@ static int NBuffersPending = -1;
  * reservation, into which shared memory segment can be extended and is
  * represented by the second /memfd:main with no permissions.
  *
- * The reserved space for each segment is calculated as a fraction of the total
- * reserved space (MaxAvailableMemory), as specified in the SHMEM_RESIZE_RATIO
- * array. E.g. we allow BUFFERS_SHMEM_SEGMENT to take up to 60% of the whole
- * space when resizing, based on the fact that it most likely will be the main
- * consumer of this memory. Those numbers are pulled out of thin air for now,
- * makes sense to evaluate them more precise.
+ * The reserved space for buffer manager related segments is calculated based on
+ * MaxNBuffers.
  */
-static double SHMEM_RESIZE_RATIO[6] = {
-	0.15,    /* MAIN_SHMEM_SEGMENT */
-	0.6,    /* BUFFERS_SHMEM_SEGMENT */
-	0.1,    /* BUFFER_DESCRIPTORS_SHMEM_SEGMENT */
-	0.1,    /* BUFFER_IOCV_SHMEM_SEGMENT */
-	0.05,   /* CHECKPOINT_BUFFERS_SHMEM_SEGMENT */
-};
 
 /*
  * Flag telling that we have decided to use huge pages.
@@ -160,13 +143,6 @@ static double SHMEM_RESIZE_RATIO[6] = {
  */
 static bool huge_pages_on = false;
 
-/*
- * Flag telling that we have prepared the memory layout to be resizable. If
- * false after all shared memory segments creation, it means we failed to setup
- * needed layout and falled back to the regular non-resizable approach.
- */
-static bool shmem_resizable = false;
-
 /*
  * Currently broadcasted value of NBuffers in shared memory.
  *
@@ -791,8 +767,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 		if (mapping->shmem_reserved < mapping->shmem_size)
 			ereport(ERROR,
 					(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
-					 errmsg("not enough shared memory is reserved"),
-					 errhint("You may need to increase \"max_available_memory\".")));
+					 errmsg("not enough shared memory is reserved")));
 
 		mmap_flags = PG_MMAP_FLAGS | mmap_flags;
 	}
@@ -961,36 +936,27 @@ AnonymousShmemDetach(int status, Datum arg)
 }
 
 /*
- * Resize all shared memory segments based on the current NBuffers value, which
- * is is applied from NBuffersPending. The actual segment resizing is done via
- * ftruncate, which will fail if is not sufficient space to expand the anon
- * file. When finished, based on the new and old values initialize new buffer
- * blocks if any.
- *
- * If reinitializing took place, as the last step this function does buffers
- * reinitialization as well and broadcasts the new value of NSharedBuffers. All
- * of that needs to be done only by one backend, the first one that managed to
- * grab the ShmemResizeLock.
+ * Resize all shared memory segments based on the new shared_buffers value (saved
+ * in ShmemCtrl area). The actual segment resizing is done via ftruncate, which
+ * will fail if there is not sufficient space to expand the anon file.
+ * 
+ * TODO: Rename this to BufferShmemResize() or something. Only buffer manager's
+ * memory should be resized in this function.
  */
 bool
 AnonymousShmemResize(void)
 {
-	int		numSemas;
-	bool 	reinit = false;
 	int		mmap_flags = PG_MMAP_FLAGS;
 	Size 	hugepagesize;
 
-	NBuffers = NBuffersPending;
-
-	elog(DEBUG1, "Resize shmem from %d to %d", NBuffersOld, NBuffers);
-
-	/*
-	 * XXX: Where to reset the flag is still an open question. E.g. do we
-	 * consider a no-op when NBuffers is equal to NBuffersOld a genuine resize
-	 * and reset the flag?
-	 */
-	pending_pm_shmem_resize = false;
+	Assert(!pg_atomic_unlocked_test_flag(&ShmemCtrl->resize_in_progress));
 
+	/* TODO: This is a hack. NBuffersPending should never be written by anything
+	 * other than GUC system. Find a way to pass new NBuffers value to
+	 * BufferManagerShmemSize(). */
+	NBuffersPending = pg_atomic_read_u32(&ShmemCtrl->targetNBuffers);
+	elog(DEBUG1, "Resize shmem from %d to %d", NBuffers, NBuffersPending);
+	
 #ifndef MAP_HUGETLB
 	/* PrepareHugePages should have dealt with this case */
 	Assert(huge_pages != HUGE_PAGES_ON && !huge_pages_on);
@@ -1005,8 +971,8 @@ AnonymousShmemResize(void)
 	}
 #endif
 
-	/* Note that CalculateShmemSize indirectly depends on NBuffers */
-	CalculateShmemSize(&numSemas);
+	/* Note that BufferManagerShmemSize() indirectly depends on NBuffersPending. */
+	BufferManagerShmemSize(false);
 
 	for(int i = 0; i < ANON_MAPPINGS; i++)
 	{
@@ -1014,10 +980,18 @@ AnonymousShmemResize(void)
 		ShmemSegment *segment = &Segments[i];
 		PGShmemHeader *shmem_hdr = segment->ShmemSegHdr;
 
+		/* Main shared memory segment is always static. Ignore it. */
+		if (i == MAIN_SHMEM_SEGMENT)
+			continue;
+
+		m->shmem_req_size = add_size(m->shmem_req_size, 8192 - (m->shmem_req_size % 8192));
 #ifdef MAP_HUGETLB
 		if (huge_pages_on && (m->shmem_req_size % hugepagesize != 0))
 			m->shmem_req_size += hugepagesize - (m->shmem_req_size % hugepagesize);
 #endif
+		elog(DEBUG1, "segment[%s]: requested size %zu, current size %zu, reserved %zu",
+			 MappingName(m->shmem_segment), m->shmem_req_size, m->shmem_size,
+			 m->shmem_reserved);
 
 		if (m->shmem == NULL)
 			continue;
@@ -1025,26 +999,28 @@ AnonymousShmemResize(void)
 		if (m->shmem_size == m->shmem_req_size)
 			continue;
 
+		/* We should have reserved enough address space. Also made sure that the
+		 * new size can fit in the existing mapping. PANIC if that's not the
+		 * case. */
 		if (m->shmem_reserved < m->shmem_req_size)
-			ereport(ERROR,
+			ereport(PANIC,
 					(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
-					 errmsg("not enough shared memory is reserved"),
-					 errhint("You may need to increase \"max_available_memory\".")));
+					 errmsg("not enough shared memory is reserved")));
 
 		elog(DEBUG1, "segment[%s]: resize from %zu to %zu at address %p",
 					 MappingName(m->shmem_segment), m->shmem_size,
 					 m->shmem_req_size, m->shmem);
 
-		/* Resize the backing anon file. */
+		/* Resize the backing anon file. If the operation fails, in one backend and we do not know the status in other backends, it will lead to inconsistent buffer manager structures across backends. PANIC. */
 		if(ftruncate(m->segment_fd, m->shmem_req_size) == -1)
-			ereport(FATAL,
+			ereport(PANIC,
 					(errcode(ERRCODE_SYSTEM_ERROR),
 					 errmsg("could not truncase anonymous file for \"%s\": %m",
 							MappingName(m->shmem_segment))));
 
 		/* Adjust memory accessibility */
 		if(mprotect(m->shmem, m->shmem_req_size, PROT_READ | PROT_WRITE) == -1)
-			ereport(FATAL,
+			ereport(PANIC,
 					(errcode(ERRCODE_SYSTEM_ERROR),
 					 errmsg("could not mprotect anonymous shared memory for \"%s\": %m",
 							MappingName(m->shmem_segment))));
@@ -1052,308 +1028,19 @@ AnonymousShmemResize(void)
 		/* If shrinking, make reserved space unavailable again */
 		if(m->shmem_req_size < m->shmem_size &&
 		   mprotect(m->shmem + m->shmem_req_size, m->shmem_size - m->shmem_req_size, PROT_NONE) == -1)
-			ereport(FATAL,
+			ereport(PANIC,
 					(errcode(ERRCODE_SYSTEM_ERROR),
 					 errmsg("could not mprotect reserved shared memory for \"%s\": %m",
 							MappingName(m->shmem_segment))));
 
-		reinit = true;
 		m->shmem_size = m->shmem_req_size;
 		shmem_hdr->totalsize = m->shmem_size;
 		segment->ShmemEnd = m->shmem + m->shmem_size;
 	}
 
-	if (reinit)
-	{
-		if(IsUnderPostmaster &&
-			LWLockConditionalAcquire(ShmemResizeLock, LW_EXCLUSIVE))
-		{
-			/*
-			 * If the new NBuffers was already broadcasted, the buffer pool was
-			 * already initialized before.
-			 *
-			 * Since we're not on a hot path, we use lwlocks and do not need to
-			 * involve memory barrier.
-			 */
-			if(pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers)
-			{
-				/*
-				 * Allow the first backend that managed to get the lock to
-				 * reinitialize the new portion of buffer pool. Every other
-				 * process will wait on the shared barrier for that to finish,
-				 * since it's a part of the SHMEM_RESIZE_DONE phase.
-				 *
-				 * Note that it's enough when only one backend will do that,
-				 * even the ShmemInitStruct part. The reason is that resized
-				 * shared memory will maintain the same addresses, meaning that
-				 * all the pointers are still valid, and we only need to update
-				 * structures size in the ShmemIndex once -- any other backend
-				 * will pick up this shared structure from the index.
-				 */
-				BufferManagerShmemInit(NBuffersOld);
-
-				/*
-				 * Wipe out the evictor PID so that it can be used for the next
-				 * buffer resizing operation.
-				*/
-				ShmemCtrl->evictor_pid = 0;
-				/* If all fine, broadcast the new value */
-				pg_atomic_write_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
-			}
-
-			LWLockRelease(ShmemResizeLock);
-		}
-	}
-
-	return true;
-}
-
-/*
- * We are asked to resize shared memory. Wait for all ProcSignal participants
- * to join the barrier, then do the resize and wait on the barrier until all
- * participating finish resizing as well -- otherwise we face danger of
- * inconsistency between backends.
- *
- * XXX: If a backend is blocked on ReadCommand in PostgresMain, it will not
- * proceed with AnonymousShmemResize after receiving SIGHUP, until something
- * will be sent.
- */
-bool
-ProcessBarrierShmemResize(Barrier *barrier)
-{
-	Assert(IsUnderPostmaster);
-
-	elog(DEBUG1, "Handle a barrier for shmem resizing from %d to %d, %d, %d",
-		 NBuffersOld, NBuffersPending, pending_pm_shmem_resize, delay_shmem_resize);
-
-	/* Wait until we have seen the new NBuffers value */
-	if (!pending_pm_shmem_resize)
-		return false;
-
-	/* Wait till this process becomes ready to resize buffers. */
-	if (delay_shmem_resize)
-		return false;
-
-	/*
-	 * First thing to do after attaching to the barrier is to wait for others.
-	 * We can't simply use BarrierArriveAndWait, because backends might arrive
-	 * here in disjoint groups, e.g. first two backends, pause, then second two
-	 * backends. If the resize is quick enough that can lead to a situation
-	 * when the first group is already finished before the second has appeared,
-	 * and the barrier will only synchonize withing those groups.
-	 */
-	if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED)
-		WaitForProcSignalBarrierReceived(
-				pg_atomic_read_u64(&ShmemCtrl->Generation));
-
-	/*
-	 * Now start the procedure, and elect one backend to ping postmaster to do
-	 * the same.
-	 *
-	 * XXX: If we need to be able to abort resizing, this has to be done later,
-	 * after the SHMEM_RESIZE_DONE.
-	 */
-
-	/*
-	 * Evict extra buffers when shrinking shared buffers. We need to do this
-	 * while the memory for extra buffers is still mapped i.e. before remapping
-	 * the shared memory segments to a smaller memory area.
-	 */
-	if (NBuffersOld > NBuffersPending)
-	{
-		BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START);
-
-		/*
-		 * TODO: If the buffer eviction fails for any reason, we should
-		 * gracefully rollback the shared buffer resizing and try again. But the
-		 * infrastructure to do so is not available right now. Hence just raise
-		 * a FATAL so that the system restarts.
-		 */
-		if (!EvictExtraBuffers(NBuffersPending, NBuffersOld))
-			elog(FATAL, "buffer eviction failed");
-
-		if (BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_EVICT))
-			SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
-	}
-	else
-		if (BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START))
-			SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
-
-	AnonymousShmemResize();
-
-	/* The second phase means the resize has finished, SHMEM_RESIZE_DONE */
-	BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_DONE);
-
-	if (MyBackendType == B_BG_WRITER)
-	{
-		/*
-		 * Before resuming regular background writer activity, adjust the
-		 * statistics collected so far.
-		 */
-		BgBufferSyncReset(NBuffersOld, NBuffers);
-	}
-
-	BarrierDetach(barrier);
 	return true;
 }
 
-/*
- * GUC assign hook for shared_buffers.
- *
- * When setting the GUC first time after starting the server, the GUC value is
- * changed immediately since there is not shared memory setup yet.
- *
- * After the shared memory is setup, changing the GUC value requires resizing and
- * reiniatializing (at least parts of) the shared memory structures related to
- * shared buffers. That's a long and complicated process.  It's recommended for
- * an assign hook to be as minimal as possible, thus we just request shared
- * memory resize and remember the previous value.
- */
-void
-assign_shared_buffers(int newval, void *extra, bool *pending)
-{
-	/*
-	 * TODO: If a backend joins while the buffer resizing is in progress or it
-	 * reads a value of shared_buffers from configuration which is different from
-	 * the value being used by existing backends, this method may not work. Need
-	 * to think of a better solution. 
-	 */
-	if (BufferBlocks)
-	{
-		elog(DEBUG1, "bufferpool is already initialized with size = %d, reinitializing it with size = %d",
-			NBuffers, newval);
-		pending_pm_shmem_resize = true;
-		*pending = true;
-		NBuffersPending = newval;
-		NBuffersOld = NBuffers;
-	}
-	else
-	{
-		elog(DEBUG1, "initializing buffer pool with size = %d", newval);
-		NBuffers = newval;
-		*pending = false;
-		pending_pm_shmem_resize = false;
-	}
-}
-
-/*
- * Test if we have somehow missed a shmem resize signal and NBuffers value
- * differs from NSharedBuffers. If yes, catchup and do resize.
- */
-void
-AdjustShmemSize(void)
-{
-	uint32 NSharedBuffers = pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers);
-
-	if (NSharedBuffers != NBuffers)
-	{
-		/*
-		 * If the broadcasted shared_buffers is different from the one we see,
-		 * it could be that the backend has missed a resize signal. To avoid
-		 * any inconsistency, adjust the shared mappings, before having a
-		 * chance to access the buffer pool.
-		 */
-		ereport(LOG,
-				(errmsg("shared_buffers has been changed from %d to %d, "
-						"resize shared memory",
-						NBuffers, NSharedBuffers)));
-		NBuffers = NSharedBuffers;
-		AnonymousShmemResize();
-	}
-}
-
-/*
- * Start resizing procedure, making sure all existing processes will have
- * consistent view of shared memory size. Must be called only in postmaster.
- */
-void
-CoordinateShmemResize(void)
-{
-	elog(DEBUG1, "Coordinating shmem resize from %d to %d",
-		 NBuffersOld, NBuffers);
-	Assert(!IsUnderPostmaster);
-
-	/*
-	 * We use dynamic barrier to help dealing with backends that were spawned
-	 * during the resize.
-	 */
-	BarrierInit(&ShmemCtrl->Barrier, 0);
-
-	/*
-	 * If the value did not change, or shared memory segments are not
-	 * initialized yet, skip the resize.
-	 */
-	if (NBuffersPending == NBuffersOld)
-	{
-		elog(DEBUG1, "Skip resizing, new %d, old %d",
-			 NBuffers, NBuffersOld);
-		return;
-	}
-
-	/*
-	 * Shared memory resize requires some coordination done by postmaster,
-	 * and consists of three phases:
-	 *
-	 * - Before the resize all existing backends have the same old NBuffers.
-	 * - When resize is in progress, backends are expected to have a
-	 *   mixture of old a new values. They're not allowed to touch buffer
-	 *   pool during this time frame.
-	 * - After resize has been finished, all existing backends, that can access
-	 *   the buffer pool, are expected to have the same new value of NBuffers.
-	 *
-	 * Those phases are ensured by joining the shared barrier associated with
-	 * the procedure. Since resizing takes time, we need to take into account
-	 * that during that time:
-	 *
-	 * - New backends can be spawned. They will check status of the barrier
-	 *   early during the bootstrap, and wait until everything is over to work
-	 *   with the new NBuffers value.
-	 *
-	 * - Old backends can exit before attempting to resize. Synchronization
-	 *   used between backends relies on ProcSignalBarrier and waits for all
-	 *   participants received the message at the beginning to gather all
-	 *   existing backends.
-	 *
-	 * - Some backends might be blocked and not responsing either before or
-	 *   after receiving the message. In the first case such backend still
-	 *   have ProcSignalSlot and should be waited for, in the second case
-	 *   shared barrier will make sure we still waiting for those backends. In
-	 *   any case there is an unbounded wait.
-	 *
-	 * - Backends might join barrier in disjoint groups with some time in
-	 *   between. That means that relying only on the shared dynamic barrier is
-	 *   not enough -- it will only synchronize resize procedure withing those
-	 *   groups. That's why we wait first for all participants of ProcSignal
-	 *   mechanism who received the message.
-	 */
-	elog(DEBUG1, "Emit a barrier for shmem resizing");
-	pg_atomic_init_u64(&ShmemCtrl->Generation,
-					   EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SHMEM_RESIZE));
-
-	/* To order everything after setting Generation value */
-	pg_memory_barrier();
-
-	/*
-	 * After that postmaster waits for PMSIGNAL_SHMEM_RESIZE as a sign that all
-	 * the rest of the pack has started the procedure and it can resize shared
-	 * memory as well.
-	 *
-	 * Normally we would call WaitForProcSignalBarrier here to wait until every
-	 * backend has reported on the ProcSignalBarrier. But for shared memory
-	 * resize we don't need this, as every participating backend will
-	 * synchronize on the ProcSignal barrier. In fact even if we would like to
-	 * wait here, it wouldn't be possible -- we're in the postmaster, without
-	 * any waiting infrastructure available.
-	 *
-	 * If at some point it will turn out that waiting is essential, we would
-	 * need to consider some alternatives. E.g. it could be a designated
-	 * coordination process, which is not a postmaster. Another option would be
-	 * to introduce a CoordinateShmemResize lock and allow only one process to
-	 * take it (this probably would have to be something different than
-	 * LWLocks, since they block interrupts, and coordination relies on them).
-	 */
-}
-
 /*
  * PGSharedMemoryCreate
  *
@@ -1374,7 +1061,7 @@ PGSharedMemoryCreate(AnonymousMapping *mapping,
 	void	   *memAddress;
 	PGShmemHeader *hdr;
 	struct stat statbuf;
-	Size		sysvsize, total_reserved;
+	Size		sysvsize;
 
 	/*
 	 * We use the data directory's ID info (inode and device numbers) to
@@ -1398,12 +1085,6 @@ PGSharedMemoryCreate(AnonymousMapping *mapping,
 
 	/* Prepare the mapping information */
 	mapping->shmem_size = mapping->shmem_req_size;
-	total_reserved = (Size) MaxAvailableMemory * BLCKSZ;
-	mapping->shmem_reserved = total_reserved * SHMEM_RESIZE_RATIO[mapping->shmem_segment];
-
-	/* Round up to be a multiple of BLCKSZ */
-	mapping->shmem_reserved = mapping->shmem_reserved + BLCKSZ -
-		(mapping->shmem_reserved % BLCKSZ);
 
 	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
@@ -1666,29 +1347,6 @@ PGSharedMemoryDetach(void)
 	}
 }
 
-void
-WaitOnShmemBarrier()
-{
-	Barrier *barrier = &ShmemCtrl->Barrier;
-
-	/* Nothing to do if resizing is not started */
-	if (BarrierPhase(barrier) < SHMEM_RESIZE_START)
-		return;
-
-	BarrierAttach(barrier);
-
-	/* Otherwise wait through all available phases */
-	while (BarrierPhase(barrier) < SHMEM_RESIZE_DONE)
-	{
-		ereport(LOG, (errmsg("ProcSignal barrier is in phase %d, waiting",
-							 BarrierPhase(barrier))));
-
-		BarrierArriveAndWait(barrier, 0);
-	}
-
-	BarrierDetach(barrier);
-}
-
 void
 ShmemControlInit(void)
 {
@@ -1700,16 +1358,13 @@ ShmemControlInit(void)
 
 	if (!foundShmemCtrl)
 	{
-		/*
-		 * The barrier is missing here, it will be initialized right before
-		 * starting the resizing process as a convenient way to reset it.
-		 */
-
-		/* Initialize with the currently known value */
-		pg_atomic_init_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
-
-		/* shmem_resizable should be initialized by now */
-		ShmemCtrl->Resizable = shmem_resizable;
-		ShmemCtrl->evictor_pid = 0;
+		pg_atomic_init_u32(&ShmemCtrl->targetNBuffers, 0);
+		pg_atomic_init_u32(&ShmemCtrl->activeNBuffers, 0);
+		pg_atomic_init_u32(&ShmemCtrl->transitNBuffers, 0);
+		pg_atomic_init_flag(&ShmemCtrl->resize_in_progress);
+
+		ShmemCtrl->coordinator = 0;
+		ShmemCtrl->pmwork_done = false;
+		ConditionVariableInit(&ShmemCtrl->pm_cv);
 	}
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ba9528d5dfa..3be146abac2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -110,9 +110,11 @@
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
 #include "storage/aio_subsys.h"
+#include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "tcop/backend_startup.h"
@@ -125,7 +127,6 @@
 
 #ifdef EXEC_BACKEND
 #include "common/file_utils.h"
-#include "storage/pg_shmem.h"
 #endif
 
 
@@ -959,6 +960,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	InitializeFastPathLocks();
 
+	/*
+	 * Calculate MaxNBuffers for buffer pool resizing.
+	 */
+	InitializeMaxNBuffers();
+
 	/*
 	 * Give preloaded libraries a chance to request additional shared memory.
 	 */
@@ -1698,9 +1704,6 @@ ServerLoop(void)
 			if (pending_pm_pmsignal)
 				process_pm_pmsignal();
 
-			if (pending_pm_shmem_resize)
-				process_pm_shmem_resize();
-
 			if (events[i].events & WL_SOCKET_ACCEPT)
 			{
 				ClientSocket s;
@@ -2046,15 +2049,34 @@ process_pm_reload_request(void)
 	}
 }
 
+/*
+ * Handle requests from the coordinator to resize shared memory maps so that the
+ * new backends can inherit those.
+ */
 static void
 process_pm_shmem_resize(void)
 {
+	elog(LOG, "postmaster received PMSIGNAL_SHMEM_RESIZE, coordinating memory remapping");
+	
 	/*
-	 * Failure to resize is considered to be fatal and will not be
-	 * retried, which means we can disable pending flag right here.
+	 * Perform the memory remapping in postmaster process. This should never fail
+	 * since the address map is always reserved. If it fails the address maps in
+	 * the backends will becomes inconsistent which is a fundamental assumption
+	 * in PostgreSQL architecture. Hence PANIC.
 	 */
-	pending_pm_shmem_resize = false;
-	CoordinateShmemResize();
+	if (!AnonymousShmemResize())
+		elog(PANIC, "postmaster failed to resize anonymous shared memory");
+	else
+	{
+		int targetNBuffers = pg_atomic_read_u32(&ShmemCtrl->targetNBuffers);
+		elog(LOG, "postmaster successfully completed shared memory remapping");
+
+		BufferManagerShmemValidate(targetNBuffers);
+		elog(LOG, "postmaster successfully validated buffer manager shared memory");
+		ShmemCtrl->pmwork_done = true;
+		ConditionVariableBroadcast(&ShmemCtrl->pm_cv);
+		NBuffers = targetNBuffers;
+	}
 }
 
 /*
@@ -3878,7 +3900,7 @@ process_pm_pmsignal(void)
 	}
 
 	if (CheckPostmasterSignal(PMSIGNAL_SHMEM_RESIZE))
-		AnonymousShmemResize();
+		process_pm_shmem_resize();
 
 	/*
 	 * Try to advance postmaster's state machine, if a child requests it.
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index be64fa5a136..80a168ec2ce 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -19,6 +19,7 @@
 #include "storage/pg_shmem.h"
 #include "storage/bufmgr.h"
 #include "storage/pg_shmem.h"
+#include "utils/guc.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -63,47 +64,39 @@ CkptSortItem *CkptBufferIds;
 /*
  * Initialize shared buffer pool
  *
- * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend) or during shared-memory resize. Size
- * of data structures initialized here depends on NBuffers, and to be able to
- * change NBuffers without a restart we store each structure into a separate
- * shared memory segment, which could be resized on demand.
- *
- * FirstBufferToInit tells where to start initializing buffers. For
- * initialization it always will be zero, but when resizing shared-memory it
- * indicates the number of already initialized buffers.
- *
+ * This is called once during shared-memory initialization.
+ * TODO: Restore this function to it's initial form. This function should see no
+ * change in buffer resize patches, except may be use of NBuffersPending.
+ * 
  * No locks are taking in this function, it is the caller responsibility to
  * make sure only one backend can work with new buffers.
  */
 void
-BufferManagerShmemInit(int FirstBufferToInit)
+BufferManagerShmemInit(void)
 {
 	bool		foundBufs,
 				foundDescs,
 				foundIOCV,
 				foundBufCkpt;
 	int			i;
-	elog(DEBUG1, "BufferManagerShmemInit from %d to %d",
-				 FirstBufferToInit, NBuffers);
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
 		ShmemInitStructInSegment("Buffer Descriptors",
-						NBuffers * sizeof(BufferDescPadded),
+						NBuffersPending * sizeof(BufferDescPadded),
 						&foundDescs, BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
 		TYPEALIGN(PG_IO_ALIGN_SIZE,
 				  ShmemInitStructInSegment("Buffer Blocks",
-								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  NBuffersPending * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
 								  &foundBufs, BUFFERS_SHMEM_SEGMENT));
 
 	/* Align condition variables to cacheline boundary. */
 	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
 		ShmemInitStructInSegment("Buffer IO Condition Variables",
-						NBuffers * sizeof(ConditionVariableMinimallyPadded),
+						NBuffersPending * sizeof(ConditionVariableMinimallyPadded),
 						&foundIOCV, BUFFER_IOCV_SHMEM_SEGMENT);
 
 	/*
@@ -115,7 +108,7 @@ BufferManagerShmemInit(int FirstBufferToInit)
 	 */
 	CkptBufferIds = (CkptSortItem *)
 		ShmemInitStructInSegment("Checkpoint BufferIds",
-						NBuffers * sizeof(CkptSortItem), &foundBufCkpt,
+						NBuffersPending * sizeof(CkptSortItem), &foundBufCkpt,
 						CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
 
 	if (foundDescs || foundBufs || foundIOCV || foundBufCkpt)
@@ -124,15 +117,14 @@ BufferManagerShmemInit(int FirstBufferToInit)
 		Assert(foundDescs && foundBufs && foundIOCV && foundBufCkpt);
 		/*
 		 * note: this path is only taken in EXEC_BACKEND case when initializing
-		 * shared memory, or in all cases when resizing shared memory.
+		 * shared memory.
 		 */
 	}
 
-#ifndef EXEC_BACKEND
 	/*
 	 * Initialize all the buffer headers.
 	 */
-	for (i = FirstBufferToInit; i < NBuffers; i++)
+	for (i = 0; i < NBuffersPending; i++)
 	{
 		BufferDesc *buf = GetBufferDescriptor(i);
 
@@ -150,21 +142,18 @@ BufferManagerShmemInit(int FirstBufferToInit)
 
 		ConditionVariableInit(BufferDescriptorGetIOCV(buf));
 	}
-#endif
 
 	/*
-	 * Init other shared buffer-management stuff from scratch configuring buffer
-	 * pool the first time. If we are just resizing buffer pool adjust only the
-	 * required structures.
+	 * Init other shared buffer-management stuff.
 	 */
-	if (FirstBufferToInit == 0)
-		StrategyInitialize(!foundDescs);
-	else
-		StrategyReInitialize(FirstBufferToInit);
+	StrategyInitialize(!foundDescs);
 
 	/* Initialize per-backend file flush context */
 	WritebackContextInit(&BackendWritebackContext,
 						 &backend_flush_after);
+	
+	/* Declare the size of current buffer pool. */
+	NBuffers = NBuffersPending;
 }
 
 /*
@@ -175,30 +164,61 @@ BufferManagerShmemInit(int FirstBufferToInit)
  * shared memory segment. The main segment must not allocate anything
  * related to buffers, every other segment will receive part of the
  * data.
+ * 
+ * If set_reserved is true, also sets the shmem_reserved field for each
+ * segment based on MaxNBuffers. This should be true during server startup
+ * but false during buffer pool resizing.
  */
 Size
-BufferManagerShmemSize(void)
+BufferManagerShmemSize(bool set_reserved)
 {
 	size_t size;
 
 	/* size of buffer descriptors, plus alignment padding */
-	size = add_size(0, mul_size(NBuffers, sizeof(BufferDescPadded)));
+	size = add_size(0, mul_size(NBuffersPending, sizeof(BufferDescPadded)));
 	size = add_size(size, PG_CACHE_LINE_SIZE);
 	Mappings[BUFFER_DESCRIPTORS_SHMEM_SEGMENT].shmem_req_size = size;
+	if (set_reserved)
+	{
+		/* reserved size based on MaxNBuffers */
+		size = add_size(0, mul_size(MaxNBuffers, sizeof(BufferDescPadded)));
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+		Mappings[BUFFER_DESCRIPTORS_SHMEM_SEGMENT].shmem_reserved = size;
+	}
 
 	/* size of data pages, plus alignment padding */
 	size = add_size(0, PG_IO_ALIGN_SIZE);
-	size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	size = add_size(size, mul_size(NBuffersPending, BLCKSZ));
 	Mappings[BUFFERS_SHMEM_SEGMENT].shmem_req_size = size;
+	if (set_reserved)
+	{
+		/* reserved size based on MaxNBuffers */
+		size = add_size(0, PG_IO_ALIGN_SIZE);
+		size = add_size(size, mul_size(MaxNBuffers, BLCKSZ));
+		Mappings[BUFFERS_SHMEM_SEGMENT].shmem_reserved = size;
+	}
 
 	/* size of I/O condition variables, plus alignment padding */
-	size = add_size(0, mul_size(NBuffers,
+	size = add_size(0, mul_size(NBuffersPending,
 								   sizeof(ConditionVariableMinimallyPadded)));
 	size = add_size(size, PG_CACHE_LINE_SIZE);
 	Mappings[BUFFER_IOCV_SHMEM_SEGMENT].shmem_req_size = size;
+	if (set_reserved)
+	{
+		/* reserved size based on MaxNBuffers */
+		size = add_size(0, mul_size(MaxNBuffers,
+									   sizeof(ConditionVariableMinimallyPadded)));
+		size = add_size(size, PG_CACHE_LINE_SIZE);
+		Mappings[BUFFER_IOCV_SHMEM_SEGMENT].shmem_reserved = size;
+	}
 
 	/* size of checkpoint sort array in bufmgr.c */
-	Mappings[CHECKPOINT_BUFFERS_SHMEM_SEGMENT].shmem_req_size = mul_size(NBuffers, sizeof(CkptSortItem));
+	Mappings[CHECKPOINT_BUFFERS_SHMEM_SEGMENT].shmem_req_size = mul_size(NBuffersPending, sizeof(CkptSortItem));
+	if (set_reserved)
+	{
+		/* reserved size based on MaxNBuffers */
+		Mappings[CHECKPOINT_BUFFERS_SHMEM_SEGMENT].shmem_reserved = mul_size(MaxNBuffers, sizeof(CkptSortItem));
+	}
 
 	/* Allocations in the main memory segment, at the end. */
 
@@ -207,3 +227,181 @@ BufferManagerShmemSize(void)
 
 	return size;
 }
+
+/*
+ * Reinitialize shared buffer manager structures when resizing the buffer pool.
+ *
+ * This function is called in the backend which coordinates buffer resizing
+ * operation.
+ *
+ * TODO: Avoid code duplication with BufferManagerShmemInit() and also assess
+ * which functionality in the latter is required in this function.
+ */
+void
+BufferManagerShmemResize(int currentNBuffers, int targetNBuffers)
+{
+	bool found;
+	int			i;
+	void *tmpPtr;
+
+	tmpPtr = (BufferDescPadded *)
+		ShmemUpdateStructInSegment("Buffer Descriptors",
+						targetNBuffers * sizeof(BufferDescPadded),
+						&found, BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
+	if (BufferDescriptors != tmpPtr || !found)
+		elog(FATAL, "resizing buffer descriptors failed: expected pointer %p, got %p, found=%d",
+			 BufferDescriptors, tmpPtr, found);
+
+	tmpPtr = (ConditionVariableMinimallyPadded *)
+		ShmemUpdateStructInSegment("Buffer IO Condition Variables",
+						targetNBuffers * sizeof(ConditionVariableMinimallyPadded),
+						&found, BUFFER_IOCV_SHMEM_SEGMENT);
+	if (BufferIOCVArray != tmpPtr || !found)
+		elog(FATAL, "resizing buffer IO condition variables failed: expected pointer %p, got %p, found=%d",
+			 BufferIOCVArray, tmpPtr, found);
+
+	tmpPtr = (CkptSortItem *)
+		ShmemUpdateStructInSegment("Checkpoint BufferIds",
+						targetNBuffers * sizeof(CkptSortItem), &found,
+						CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
+	if (CkptBufferIds != tmpPtr || !found)
+		elog(FATAL, "resizing checkpoint buffer IDs failed: expected pointer %p, got %p, found=%d",
+			 CkptBufferIds, tmpPtr, found);
+
+	tmpPtr = (char *)
+		TYPEALIGN(PG_IO_ALIGN_SIZE,
+				  ShmemUpdateStructInSegment("Buffer Blocks",
+								  targetNBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  &found, BUFFERS_SHMEM_SEGMENT));
+	if (BufferBlocks != tmpPtr || !found)
+		elog(FATAL, "resizing buffer blocks failed: expected pointer %p, got %p, found=%d",
+			 BufferBlocks, tmpPtr, found);
+
+	/*
+	 * Initialize the headers for new buffers. If we are shrinking the
+	 * buffers, currentNBuffers >= targetNBuffers, thus this loop doesn't execute.
+	 */
+	for (i = currentNBuffers; i < targetNBuffers; i++)
+	{
+		BufferDesc *buf = GetBufferDescriptor(i);
+
+		ClearBufferTag(&buf->tag);
+
+		pg_atomic_init_u32(&buf->state, 0);
+		buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
+
+		buf->buf_id = i;
+
+		LWLockInitialize(BufferDescriptorGetContentLock(buf),
+						 LWTRANCHE_BUFFER_CONTENT);
+
+		ConditionVariableInit(BufferDescriptorGetIOCV(buf));
+	}
+
+	StrategyReset(targetNBuffers);
+}
+
+/*
+ * BufferManagerShmemValidate
+ *		Validate that buffer manager shared memory structures have correct
+ *		pointers and sizes after a resize operation.
+ *
+ * This function is called by backends during ProcessBarrierShmemResizeStruct
+ * to ensure their view of the buffer structures is consistent after memory
+ * remapping.
+ */
+void
+BufferManagerShmemValidate(int targetNBuffers)
+{
+	bool found;
+	void *tmpPtr;
+
+	/* Validate Buffer Descriptors */
+	tmpPtr = (BufferDescPadded *)
+		ShmemInitStructInSegment("Buffer Descriptors",
+						targetNBuffers * sizeof(BufferDescPadded),
+						&found, BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
+	if (!found || BufferDescriptors != tmpPtr)
+		elog(FATAL, "validating buffer descriptors failed: expected pointer %p, got %p, found=%d",
+			 BufferDescriptors, tmpPtr, found);
+
+	/* Validate Buffer IO Condition Variables */
+	tmpPtr = (ConditionVariableMinimallyPadded *)
+		ShmemInitStructInSegment("Buffer IO Condition Variables",
+						targetNBuffers * sizeof(ConditionVariableMinimallyPadded),
+						&found, BUFFER_IOCV_SHMEM_SEGMENT);
+	if (!found || BufferIOCVArray != tmpPtr)
+		elog(FATAL, "validating buffer IO condition variables failed: expected pointer %p, got %p, found=%d",
+			 BufferIOCVArray, tmpPtr, found);
+
+	/* Validate Checkpoint BufferIds */
+	tmpPtr = (CkptSortItem *)
+		ShmemInitStructInSegment("Checkpoint BufferIds",
+						targetNBuffers * sizeof(CkptSortItem), &found,
+						CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
+	if (!found || CkptBufferIds != tmpPtr)
+		elog(FATAL, "validating checkpoint buffer IDs failed: expected pointer %p, got %p, found=%d",
+			 CkptBufferIds, tmpPtr, found);
+
+	/* Validate Buffer Blocks */
+	tmpPtr = (char *)
+		TYPEALIGN(PG_IO_ALIGN_SIZE,
+				  ShmemInitStructInSegment("Buffer Blocks",
+								  targetNBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  &found, BUFFERS_SHMEM_SEGMENT));
+	if (!found || BufferBlocks != tmpPtr)
+		elog(FATAL, "validating buffer blocks failed: expected pointer %p, got %p, found=%d",
+			 BufferBlocks, tmpPtr, found);
+}
+
+/*
+ * check_shared_buffers
+ *		GUC check_hook for shared_buffers
+ *
+ * When reloading the configuration, shared_buffers should not be set to a value
+ * higher than max_shared_buffers fixed at the boot time.
+ */
+bool
+check_shared_buffers(int *newval, void **extra, GucSource source)
+{
+	if (finalMaxNBuffers && *newval > MaxNBuffers)
+	{
+		GUC_check_errdetail("\"shared_buffers\" must be less than \"max_shared_buffers\".");
+		return false;
+	}
+	return true;
+}
+
+/*
+ * show_shared_buffers
+ *		GUC show_hook for shared_buffers
+ *
+ * Shows both current and pending buffer counts with proper unit formatting.
+ */
+const char *
+show_shared_buffers(void)
+{
+	static char buffer[128];
+	int64 current_value, pending_value;
+	const char *current_unit, *pending_unit;
+
+	if (NBuffers == NBuffersPending)
+	{
+		/* No buffer pool resizing pending. */
+		convert_int_from_base_unit(NBuffers, GUC_UNIT_BLOCKS, &current_value, &current_unit);
+		snprintf(buffer, sizeof(buffer), INT64_FORMAT "%s", current_value, current_unit);
+	}
+	else
+	{
+		/*
+		 * New value for NBuffers is loaded but not applied yet, show both
+		 * current and pending.
+		 */
+		convert_int_from_base_unit(NBuffers, GUC_UNIT_BLOCKS, &current_value, &current_unit);
+		convert_int_from_base_unit(NBuffersPending, GUC_UNIT_BLOCKS, &pending_value, &pending_unit);
+		snprintf(buffer, sizeof(buffer), INT64_FORMAT "%s (pending: " INT64_FORMAT "%s)", 
+				 current_value, current_unit, pending_value, pending_unit);
+	}
+	
+	return buffer;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fdcb5556235..14200a38a0f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3631,12 +3631,12 @@ static float smoothed_alloc = 0;
 static float smoothed_density = 10.0;
 
 void
-BgBufferSyncReset(int NBuffersOld, int NBuffersNew)
+BgBufferSyncReset(int currentNBuffers, int targetNBuffers)
 {
 	saved_info_valid = false;
 #ifdef BGW_DEBUG
 	elog(DEBUG2, "invalidated background writer status after resizing buffers from %d to %d",
-		 NBuffersOld, NBuffersNew);
+		 currentNBuffers, targetNBuffers);
 #endif
 }
 
@@ -3686,8 +3686,11 @@ BgBufferSync(WritebackContext *wb_context)
 	 * valid. If the buffer pool is being expanded, more buffers will become
 	 * available without even this function writing out any. Hence wait till
 	 * buffer resizing finishes i.e. go into hibernation mode.
+	 * 
+	 * TODO: We may not need this synchronization if background worker itself
+	 * becomes the coordinator.
 	 */
-	if (pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers)
+	if (!pg_atomic_unlocked_test_flag(&ShmemCtrl->resize_in_progress))
 		return true;
 
 	/*
@@ -3883,7 +3886,7 @@ BgBufferSync(WritebackContext *wb_context)
 	 * finish.
 	 */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est &&
-			pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) == NBuffers)
+			!pg_atomic_unlocked_test_flag(&ShmemCtrl->resize_in_progress))
 	{
 		int			sync_state = SyncOneBuffer(next_to_clean, true,
 											   wb_context);
@@ -4255,7 +4258,23 @@ DebugPrintBufferRefcount(Buffer buffer)
 void
 CheckPointBuffers(int flags)
 {
+	/* Mark that buffer sync is in progress - delay any shared memory resizing. */
+	/*
+	 * TODO: We need to assess whether we should allow checkpoint and buffer
+	 * resizing to run in parallel. When expanding buffers it may be fine to let
+	 * the checkpointer run in RESIZE_MAP_AND_MEM phase but delay phase EXPAND
+	 * phase till the checkpoint finishes, at the same time not allow checkpoint
+	 * to run during expansion phase. When shrinking the buffers, we should
+	 * delay SHRINK phase till checkpoint finishes and not allow to start
+	 * checkpoint till SHRINK phase is done, but allow it to run in
+	 * RESIZE_MAP_AND_MEM phase. This needs careful analysis and testing.
+	 */
+	delay_shmem_resize = true;
+	
 	BufferSync(flags);
+
+	/* Mark that buffer sync is no longer in progress - allow shared memory resizing */
+	delay_shmem_resize = false;
 }
 
 /*
@@ -7504,10 +7523,12 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
  * of the shrunk buffer pool.
  */
 bool
-EvictExtraBuffers(int newBufSize, int oldBufSize)
+EvictExtraBuffers(int targetNBuffers, int currentNBuffers)
 {
 	bool result = true;
 
+	Assert(targetNBuffers < currentNBuffers);
+
 	/*
 	 * If the buffer being evicated is locked, this function will need to wait.
 	 * This function should not be called from a Postmaster since it can not wait on a lock.
@@ -7515,77 +7536,50 @@ EvictExtraBuffers(int newBufSize, int oldBufSize)
 	Assert(IsUnderPostmaster);
 
 	/*
-	 * Let only one backend perform eviction. We could split the work across all
-	 * the backends but that doesn't seem necessary.
-	 *
-	 * The first backend to acquire ShmemResizeLock, sets its own PID as the
-	 * evictor PID for other backends to know that the eviction is in progress or
-	 * has already been performed. The evictor backend releases the lock when it
-	 * finishes eviction.  While the eviction is in progress, backends other than
-	 * evictor backend won't be able to take the lock. They won't perform
-	 * eviction. A backend may acquire the lock after eviction has completed, but
-	 * it will not perform eviction since the evictor PID is already set. Evictor
-	 * PID is reset only when the buffer resizing finishes. Thus only one backend
-	 * will perform eviction in a given instance of shared buffers resizing.
-	 *
-	 * Any backend which acquires this lock will release it before the eviction
-	 * phase finishes, hence the same lock can be reused for the next phase of
-	 * resizing buffers.
+	 * TODO: Before evicting any buffer, we should check whether any of the
+	 * buffers are pinned. If we find that a buffer is pinned after evicting
+	 * most of them, that will impact performance since all those evicted
+	 * buffers might need to be read again.
 	 */
-	if (LWLockConditionalAcquire(ShmemResizeLock, LW_EXCLUSIVE))
+	for (Buffer buf = targetNBuffers + 1; buf <= currentNBuffers; buf++)
 	{
-		if (ShmemCtrl->evictor_pid == 0)
-		{
-			ShmemCtrl->evictor_pid = MyProcPid;
-
-			/*
-			 * TODO: Before evicting any buffer, we should check whether any of the
-			 * buffers are pinned. If we find that a buffer is pinned after evicting
-			 * most of them, that will impact performance since all those evicted
-			 * buffers might need to be read again.
-			 */
-			for (Buffer buf = newBufSize + 1; buf <= oldBufSize; buf++)
-			{
-				BufferDesc *desc = GetBufferDescriptor(buf - 1);
-				uint32		buf_state;
-				bool		buffer_flushed;
+		BufferDesc *desc = GetBufferDescriptor(buf - 1);
+		uint32		buf_state;
+		bool		buffer_flushed;
 
-				buf_state = pg_atomic_read_u32(&desc->state);
+		buf_state = pg_atomic_read_u32(&desc->state);
 
-				/*
-				 * Nobody is expected to touch the buffers while resizing is
-				 * going one hence unlocked precheck should be safe and saves
-				 * some cycles.
-				 */
-				if (!(buf_state & BM_VALID))
-					continue;
+		/*
+		 * Nobody is expected to touch the buffers while resizing is
+		 * going one hence unlocked precheck should be safe and saves
+		 * some cycles.
+		 */
+		if (!(buf_state & BM_VALID))
+			continue;
 
-				/*
-				 * XXX: Looks like CurrentResourceOwner can be NULL here, find
-				 * another one in that case?
-				 * */
-				if (CurrentResourceOwner)
-					ResourceOwnerEnlarge(CurrentResourceOwner);
+		/*
+		 * XXX: Looks like CurrentResourceOwner can be NULL here, find
+		 * another one in that case?
+		 * */
+		if (CurrentResourceOwner)
+			ResourceOwnerEnlarge(CurrentResourceOwner);
 
-				ReservePrivateRefCountEntry();
+		ReservePrivateRefCountEntry();
 
-				LockBufHdr(desc);
+		LockBufHdr(desc);
 
-				/*
-				 * Now that we have locked buffer descriptor, make sure that the
-				 * buffer without valid data has been skipped above.
-				 */
-				Assert(buf_state & BM_VALID);
+		/*
+		 * Now that we have locked buffer descriptor, make sure that the
+		 * buffer without valid data has been skipped above.
+		 */
+		Assert(buf_state & BM_VALID);
 
-				if (!EvictUnpinnedBufferInternal(desc, &buffer_flushed))
-				{
-					elog(WARNING, "could not remove buffer %u, it is pinned", buf);
-					result = false;
-					break;
-				}
-			}
+		if (!EvictUnpinnedBufferInternal(desc, &buffer_flushed))
+		{
+			elog(WARNING, "could not remove buffer %u, it is pinned", buf);
+			result = false;
+			break;
 		}
-		LWLockRelease(ShmemResizeLock);
 	}
 
 	return result;
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 55be5eebe0a..c09875934d4 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -33,10 +33,16 @@ typedef struct
 	/* Spinlock: protects the values below */
 	slock_t		buffer_strategy_lock;
 
+	/*
+	 * Number of active buffers that can be allocated. During buffer resizing,
+	 * this may be different from NBuffers which tracks the global buffer count.
+	 */
+	pg_atomic_uint32 activeNBuffers;
+
 	/*
 	 * clock-sweep hand: index of next buffer to consider grabbing. Note that
 	 * this isn't a concrete buffer - we only ever increase the value. So, to
-	 * get an actual buffer, it needs to be used modulo NBuffers.
+	 * get an actual buffer, it needs to be used modulo activeNBuffers.
 	 */
 	pg_atomic_uint32 nextVictimBuffer;
 
@@ -101,6 +107,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
+	int			activeBuffers;
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -110,12 +117,15 @@ ClockSweepTick(void)
 	victim =
 		pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
 
-	if (victim >= NBuffers)
+	/* Read the current active buffer count atomically */
+	activeBuffers = pg_atomic_read_u32(&StrategyControl->activeNBuffers);
+
+	if (victim >= activeBuffers)
 	{
 		uint32		originalVictim = victim;
 
 		/* always wrap what we look up in BufferDescriptors */
-		victim = victim % NBuffers;
+		victim = victim % activeBuffers;
 
 		/*
 		 * If we're the one that just caused a wraparound, force
@@ -143,7 +153,7 @@ ClockSweepTick(void)
 				 */
 				SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 
-				wrapped = expected % NBuffers;
+				wrapped = expected % activeBuffers;
 
 				success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
 														 &expected, wrapped);
@@ -177,6 +187,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
+	int		activeNBuffers;
 
 	*from_ring = false;
 
@@ -228,7 +239,9 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/* Use the "clock sweep" algorithm to find a free buffer */
-	trycounter = NBuffers;
+	activeNBuffers = pg_atomic_read_u32(&StrategyControl->activeNBuffers);
+	trycounter = activeNBuffers;
+	
 	for (;;)
 	{
 		uint32		old_buf_state;
@@ -280,7 +293,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 				if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state,
 												   local_buf_state))
 				{
-					trycounter = NBuffers;
+					trycounter = activeNBuffers;
 					break;
 				}
 			}
@@ -323,10 +336,12 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 {
 	uint32		nextVictimBuffer;
 	int			result;
+	uint32		activeNBuffers;
 
 	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 	nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
-	result = nextVictimBuffer % NBuffers;
+	activeNBuffers = pg_atomic_read_u32(&StrategyControl->activeNBuffers);
+	result = nextVictimBuffer % activeNBuffers;
 
 	if (complete_passes)
 	{
@@ -336,7 +351,7 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 		 * Additionally add the number of wraparounds that happened before
 		 * completePasses could be incremented. C.f. ClockSweepTick().
 		 */
-		*complete_passes += nextVictimBuffer / NBuffers;
+		*complete_passes += nextVictimBuffer / activeNBuffers;
 	}
 
 	if (num_buf_alloc)
@@ -391,6 +406,31 @@ StrategyShmemSize(void)
 	return size;
 }
 
+void
+StrategyReset(int activeNBuffers)
+{
+	Assert(StrategyControl);
+
+	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+	
+	/* Update the active buffer count for the strategy */
+	pg_atomic_write_u32(&StrategyControl->activeNBuffers, activeNBuffers);
+	
+	/* Reset the clock-sweep pointer to start from beginning */
+	pg_atomic_write_u32(&StrategyControl->nextVictimBuffer, 0);
+
+	/*
+	 * The statistics is viewed in the context of the number of shared buffers.
+	 * Reset it as the size of active number of shared buffers changes.
+	 */
+	StrategyControl->completePasses = 0;
+	pg_atomic_write_u32(&StrategyControl->numBufferAllocs, 0);
+
+	/* TODO: Do we need to seset background writer notifications? */
+	StrategyControl->bgwprocno = -1;
+	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+}
+
 /*
  * StrategyInitialize -- initialize the buffer cache replacement
  *		strategy.
@@ -416,13 +456,13 @@ StrategyInitialize(bool init)
 	 * directory without rehashing all the entries. Just allocating more entries
 	 * will lead to more contention. Hence we setup the buffer lookup table
 	 * considering the maximum possible size of the buffer pool which is
-	 * MaxAvailableMemory.
+	 * MaxNBuffers.
 	 *
 	 * Additionally BufferAlloc() tries to insert a new entry before deleting the
 	 * old.  In principle this could be happening in each partition concurrently,
 	 * so we need extra NUM_BUFFER_PARTITIONS entries.
 	 */
-	InitBufTable(MaxAvailableMemory + NUM_BUFFER_PARTITIONS);
+	InitBufTable(MaxNBuffers + NUM_BUFFER_PARTITIONS);
 
 	/*
 	 * Get or create the shared strategy control block
@@ -441,6 +481,8 @@ StrategyInitialize(bool init)
 
 		SpinLockInit(&StrategyControl->buffer_strategy_lock);
 
+		/* Initialize the active buffer count */
+		pg_atomic_init_u32(&StrategyControl->activeNBuffers, NBuffersPending);
 		/* Initialize the clock-sweep pointer */
 		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
 
@@ -455,62 +497,6 @@ StrategyInitialize(bool init)
 		Assert(!init);
 }
 
-/*
- * StrategyReInitialize -- re-initialize the buffer cache replacement
- *		strategy.
- *
- * To be called when resizing buffer manager and only from the coordinator.
- * TODO: Assess the differences between this function and StrategyInitialize().
- */
-void
-StrategyReInitialize(int FirstBufferIdToInit)
-{
-	bool		found;
-
-	/*
-	 * Resizing memory for buffer pools should not affect the address of
-	 * StrategyControl.
-	 */
-	if (StrategyControl != (BufferStrategyControl *)
-		ShmemInitStructInSegment("Buffer Strategy Status",
-						sizeof(BufferStrategyControl),
-						&found, MAIN_SHMEM_SEGMENT))
-		elog(FATAL, "something went wrong while re-initializing the buffer strategy");
-
-	Assert(found);
-
-	/* TODO: Buffer lookup table adjustment: There are two options:
-	 *
-	 * 1. Resize the buffer lookup table to match the new number of buffers. But
-	 * this requires rehashing all the entries in the buffer lookup table with
-	 * the new table size.
-	 *
-	 * 2. Allocate maximum size of the buffer lookup table at the beginning and
-	 * never resize it. This leaves sparse buffer lookup table which is
-	 * inefficient from both memory and time perspective. According to David
-	 * Rowley, the sparse entries in the buffer look up table cause frequent
-	 * cacheline reload which affect performance. If the impact of that
-	 * inefficiency in a benchmark is significant, we will need to consider first
-	 * option.
-	 */
-	/*
-	 * The clock sweep tick pointer might have got invalidated. Reset it as if
-	 * starting a fresh server.
-	 */
-	pg_atomic_write_u32(&StrategyControl->nextVictimBuffer, 0);
-
-	/*
-	 * The old statistics is viewed in the context of the number of shared
-	 * buffers. It does not make sense now that the number of shared buffers
-	 * itself has changed.
-	 */
-	StrategyControl->completePasses = 0;
-	pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
-
-	/* No pending notification */
-	StrategyControl->bgwprocno = -1;
-}
-
 
 /* ----------------------------------------------------------------
  *				Backend-private buffer ring management
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index bd75f06047e..cfd952e621e 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -133,7 +133,7 @@ CalculateShmemSize(int *num_semaphores)
 	 * memory segment that it uses in the corresponding AnonymousMappings.
 	 * Consider size required from only the main shared memory segment here.
 	 */
-	size = add_size(size, BufferManagerShmemSize());
+	size = add_size(size, BufferManagerShmemSize(true));
 	size = add_size(size, LockManagerShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
@@ -187,10 +187,14 @@ CalculateShmemSize(int *num_semaphores)
 	 * shared memory segment.
 	 */
 	Mappings[MAIN_SHMEM_SEGMENT].shmem_req_size = size;
+	Mappings[MAIN_SHMEM_SEGMENT].shmem_reserved = size;
 
 	/* might as well round it off to a multiple of a typical page size */
 	for (int segment = 0; segment < ANON_MAPPINGS; segment++)
+	{
 		Mappings[segment].shmem_req_size = add_size(Mappings[segment].shmem_req_size, 8192 - (Mappings[segment].shmem_req_size % 8192));
+		Mappings[segment].shmem_reserved = add_size(Mappings[segment].shmem_reserved, 8192 - (Mappings[segment].shmem_reserved % 8192));
+	}
 
 	return size;
 }
@@ -341,7 +345,7 @@ CreateOrAttachShmemStructs(void)
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
-	BufferManagerShmemInit(0);
+	BufferManagerShmemInit();
 
 	/*
 	 * Set up lock manager
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 2160d258fa7..0a173f038a3 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -657,9 +657,17 @@ ProcessProcSignalBarrier(void)
 					case PROCSIGNAL_BARRIER_SMGRRELEASE:
 						processed = ProcessBarrierSmgrRelease();
 						break;
-					case PROCSIGNAL_BARRIER_SHMEM_RESIZE:
-						processed = ProcessBarrierShmemResize(
-								&ShmemCtrl->Barrier);
+					case PROCSIGNAL_BARRIER_SHBUF_SHRINK:
+						processed = ProcessBarrierShmemShrink();
+						break;
+					case PROCSIGNAL_BARRIER_SHBUF_RESIZE_MAP_AND_MEM:
+						processed = ProcessBarrierShmemResizeMapAndMem();
+						break;
+					case PROCSIGNAL_BARRIER_SHBUF_EXPAND:
+						processed = ProcessBarrierShmemExpand();
+						break;
+					case PROCSIGNAL_BARRIER_SHBUF_RESIZE_FAILED:
+						processed = ProcessBarrierShmemResizeFailed();
 						break;
 				}
 
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 0f9abf69fd5..9793d27042a 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -69,11 +69,19 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "port/pg_numa.h"
+#include "postmaster/bgwriter.h"
+#include "storage/bufmgr.h"
+#include "storage/buf_internals.h"
+#include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
+#include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/injection_point.h"
+#include "utils/wait_event.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 static void *ShmemAllocRawInSegment(Size size, Size *allocated_size,
@@ -498,28 +506,15 @@ ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
 	{
 		/*
 		 * Structure is in the shmem index so someone else has allocated it
-		 * already. Verify the structure's size:
-		 * - If it's the same, we've found the expected structure.
-		 * - If it's different, we're resizing the expected structure.
-		 *
-		 * XXX: There is an implicit assumption this can only happen in
-		 * "resizable" segments, where only one shared structure is allowed.
-		 * This has to be implemented more cleanly. Probably we should implement
-		 * ShmemReallocRawInSegment functionality just to adjust the size
-		 * according to alignment, return the allocated size and update the
-		 * mapping offset.
+		 * already. The size better be the same as the size we are trying to
 		 */
 		if (result->size != size)
 		{
-			Size delta = size - result->size;
-
-			result->size = size;
-			result->allocated_size = size;
-
-			/* Reflect size change in the shared segment */
-			SpinLockAcquire(Segments[shmem_segment].ShmemLock);
-			Segments[shmem_segment].ShmemSegHdr->freeoffset += delta;
-			SpinLockRelease(Segments[shmem_segment].ShmemLock);
+			LWLockRelease(ShmemIndexLock);
+			ereport(ERROR,
+					(errmsg("ShmemIndex entry size is wrong for data structure"
+							" \"%s\": expected %zu, actual %zu",
+							name, size, result->size)));
 		}
 
 		structPtr = result->location;
@@ -556,6 +551,59 @@ ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
 	return structPtr;
 }
 
+/*
+ * ShmemUpdateStructInSegment -- Update the size of a structure in shared memory.
+ *
+ * This function updates the size of an existing shared memory structure. It
+ * finds the structure in the shmem index and updates its size information while
+ * preserving the existing memory location.
+ *
+ * Returns: pointer to the existing structure location.
+ */
+void *
+ShmemUpdateStructInSegment(const char *name, Size size, bool *foundPtr,
+						   int shmem_segment)
+{
+	ShmemIndexEnt *result;
+	void	   *structPtr;
+	Size delta;
+
+	LWLockAcquire(ShmemIndexLock, LW_EXCLUSIVE);
+
+	Assert(ShmemIndex);
+
+	/* Look up the structure in the shmem index */
+	result = (ShmemIndexEnt *)
+		hash_search(ShmemIndex, name, HASH_FIND, foundPtr);
+
+	Assert(*foundPtr);
+	Assert(result);
+	Assert(result->shmem_segment == shmem_segment);
+
+	delta = size - result->size;
+	/* Store the existing structure pointer */
+	structPtr = result->location;
+
+	/* Update the size information.
+	   TODO: Ideally we should implement repalloc kind of functionality for shared memory which will return allocated size. */
+	result->size = size;
+	result->allocated_size = size;
+
+	/* Reflect size change in the shared segment */
+	SpinLockAcquire(Segments[shmem_segment].ShmemLock);
+	Segments[shmem_segment].ShmemSegHdr->freeoffset += delta;
+	SpinLockRelease(Segments[shmem_segment].ShmemLock);
+	LWLockRelease(ShmemIndexLock);
+
+	/* Verify the structure is still in the correct segment */
+	Assert(ShmemAddrIsValidInSegment(structPtr, shmem_segment));
+	Assert(structPtr == (void *) CACHELINEALIGN(structPtr));
+
+	return structPtr;
+}
+
+
+
 /*
  * Add two Size values, checking for overflow
  */
@@ -892,3 +940,402 @@ pg_get_shmem_segments(PG_FUNCTION_ARGS)
 	return (Datum) 0;
 }
 
+/*
+ * TODO: The function henceforth are related to buffer manager and better be
+ * placed in buffer manager related file.
+ */
+
+/*
+ * Prepare ShmemCtrl for resizing the shared buffer pool.
+ */
+static void
+MarkBufferResizingStart(int targetNBuffers, int currentNBuffers)
+{
+	Assert(!pg_atomic_unlocked_test_flag(&ShmemCtrl->resize_in_progress));
+
+	pg_atomic_write_u32(&ShmemCtrl->currentNBuffers, currentNBuffers);
+	pg_atomic_write_u32(&ShmemCtrl->targetNBuffers, targetNBuffers);
+	pg_atomic_write_u32(&ShmemCtrl->activeNBuffers, Min(targetNBuffers, currentNBuffers));
+	pg_atomic_write_u32(&ShmemCtrl->transitNBuffers, currentNBuffers);
+	ShmemCtrl->coordinator = MyProcPid;
+	ShmemCtrl->pmwork_done = false;
+}
+
+/*
+ * Reset ShmemCtrl after resizing the shared buffer pool is done.
+ */
+static void
+MarkBufferResizingEnd(int NBuffers)
+{	
+	Assert(!pg_atomic_unlocked_test_flag(&ShmemCtrl->resize_in_progress));
+
+	pg_atomic_write_u32(&ShmemCtrl->currentNBuffers, NBuffers);
+	pg_atomic_write_u32(&ShmemCtrl->targetNBuffers, NBuffers);
+	pg_atomic_write_u32(&ShmemCtrl->activeNBuffers, NBuffers);
+	pg_atomic_write_u32(&ShmemCtrl->transitNBuffers, NBuffers);
+	ShmemCtrl->coordinator = -1;
+	ShmemCtrl->pmwork_done = false;
+}
+
+/*
+ * Function which updates the shared buffers according to the current values of
+ * shared_buffers GUCs.
+ *
+ * When resizing the buffer pool is divided into two portions
+ *
+ * - active buffer pool, which is the part of buffer pool which remains active
+ * even during resizing. Its size is given by activeNBuffers. Newly allocated
+ * buffers will have their buffer ids less than activeNBuffers.
+ *
+ * - in-transit buffer pool, which is the part of buffer pool which may be
+ * accessible to some backends but not others. When shrinking the buffer pool
+ * this is the part of buffer pool which will be evicted. When expanding the
+ * buffer pool this is the expanded portion. Its size is given by
+ * transitNBuffers. The backends may see buffer ids upto transitNBuffers.
+ *
+ * Before starting resizing, activeNBuffers = transitNBuffers = NBuffers. And
+ * NewNBuffers is the new size of shared buffer pool.
+ *
+ * In order to synchronize with other running backends, the coordinator sends
+ * following ProcSignalBarriers in the order given below:
+ *
+ * 1. When shrinking the shared buffer pool (with size NBuffers), the coordinator
+ * sends SHBUF_SHRINK ProcSignalBarrier. Every backend sets activeNBuffers =
+ * NewNBuffers to restrict its buffer pool allocations to the new size of the
+ * buffer pool and acknowledges the ProcSignalBarrrier. Once every backend has
+ * acknowledged, the coordinator evicts the buffers in the area being shrunk.
+ * Note that tansitNBuffers is still NBuffers, so the backends may see buffer ids
+ * upto NBuffers from earlier allocations.
+ *
+ * 2. In both cases, when expanding the buffer pool or shrinking the buffer pool,
+ * the coordinator sends SHBUF_RESIZE_MAP_AND_MEM ProcSignalBarrier. Every
+ * backend is expected to adjust their shared memory segment maps (by calling
+ * AnonymousShmemResize()) and validate that their pointers to the shared buffers
+ * structure are valid and have the right size. When shrinking shared buffer pool
+ * transitNBuffers is set to NewNBuffers and the backends should no more see
+ * buffer ids beyond NewNBuffers. When expanding they should also set
+ * transitNBuffers to NewNBuffers to accomodate backends which may accept the
+ * next barrier earlier than the others. After this the backends should
+ * acknowledge the ProcSignalBarrier.
+ *
+ * 3. When expanding the buffer pool, the coordinator sends SHBUF_EXPAND
+ * ProcSignalBarrier. The backends are expected to set activeNBuffers =
+ * NewNBuffers and start allocating buffers from the expanded range.
+ *
+ * Find a better place for this function, also a name if we find this interface
+ * viable.
+ *
+ * TODO: Should this function be in bufmgr.c?
+ *
+ * TODO: Handle the case when the backend executing this function dies or the
+ * query is cancelled.
+ */
+Datum
+pg_resize_shared_buffers(PG_FUNCTION_ARGS)
+{
+	bool result = true;
+	int currentNBuffers = NBuffers;
+	int targetNBuffers = NBuffersPending;
+
+	if (currentNBuffers == targetNBuffers)
+	{
+		elog(LOG, "shared buffers are already at %d, no need to resize", currentNBuffers);
+		PG_RETURN_BOOL(true);
+	}
+
+	if (!pg_atomic_test_set_flag(&ShmemCtrl->resize_in_progress))
+	{
+		elog(LOG, "shared buffer resizing already in progress");
+		PG_RETURN_BOOL(false);
+	}
+
+	MarkBufferResizingStart(targetNBuffers, currentNBuffers);
+	elog(LOG, "resizing shared buffers from %d to %d", currentNBuffers, targetNBuffers);
+
+	INJECTION_POINT("pg-resize-shared-buffers-flag-set", NULL);
+
+	/* Phase 1: SHBUF_SHRINK - Only for shrinking buffer pool */
+	if (targetNBuffers < currentNBuffers)
+	{
+		/*
+		 * Phase 1: Shrinking - send SHBUF_SHRINK barrier
+		 * Every backend sets activeNBuffers = NewNBuffers to restrict 
+		 * buffer pool allocations to the new size
+		 */
+		elog(LOG, "Phase 1: Shrinking buffer pool, restricting allocations to %d buffers", targetNBuffers);
+		
+		WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SHBUF_SHRINK));
+		elog(LOG, "all backends acknowledged shrink phase");
+
+		/* Evict buffers in the area being shrunk */
+		elog(LOG, "evicting buffers %u..%u", targetNBuffers + 1, currentNBuffers);
+		if (!EvictExtraBuffers(targetNBuffers, currentNBuffers))
+		{
+			elog(ERROR, "failed to evict extra buffers during shrinking");
+			WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SHBUF_RESIZE_FAILED));
+			MarkBufferResizingEnd(currentNBuffers);
+			pg_atomic_clear_flag(&ShmemCtrl->resize_in_progress);
+			Assert(NBuffers == currentNBuffers);
+			NBuffers = pg_atomic_read_u32(&ShmemCtrl->currentNBuffers);
+			PG_RETURN_BOOL(false);
+		}
+
+		/* This backend handles NBuffers itself instead of relying on the barrier
+		 * handler, so that barrier handlers do not interfere with its
+		 * operations. */
+		NBuffers = targetNBuffers;
+	}
+
+	/* Phase 2: SHBUF_RESIZE_MAP_AND_MEM - Both expanding and shrinking */
+	elog(LOG, "Phase 2: Remapping shared memory segments and updating structures");
+	if (!AnonymousShmemResize())
+	{
+		/*
+		 * This should never fail since address map should already be reserved.
+		 * So the failure should be treated as PANIC.
+		 */
+		elog(PANIC, "failed to resize anonymous shared memory");
+	}
+
+	/* When shrinking no backends should see buffers beyond active portion of the
+	 * buffer pool. When expanding, update transitNBuffers so backends can see
+	 * the new range. */
+	pg_atomic_write_u32(&ShmemCtrl->transitNBuffers, targetNBuffers);
+
+	/* Update structure pointers and sizes */
+	BufferManagerShmemResize(currentNBuffers, targetNBuffers);
+
+	/* Request Postmaster to remap and resize. TODO: Handle the case when Postmaster is not able to remap and resize the shared memory structures. */
+	SendPostmasterSignal(PMSIGNAL_SHMEM_RESIZE);
+	elog(LOG, "waiting for the postmaster to finish remapping and resizing the shared buffers");
+	while (!ShmemCtrl->pmwork_done)
+	{
+		if (ConditionVariableTimedSleep(&ShmemCtrl->pm_cv,
+										5000,
+										WAIT_EVENT_PM_BUFFER_RESIZE_WAIT))
+			ereport(LOG,
+					(errmsg("still waiting for the postmaster PID %d to finish resizing buffers",
+							(int) PostmasterPid)));
+	}
+	ConditionVariableCancelSleep();
+	elog(LOG, "postmaster remapped and resized the shared memory");
+
+	WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SHBUF_RESIZE_MAP_AND_MEM));
+	elog(LOG, "all backends acknowledged memory remapping and structure updates");
+
+	/* Phase 3: SHBUF_EXPAND - Only for expanding buffer pool */
+	if (targetNBuffers > currentNBuffers)
+	{
+		/*
+		 * Phase 3: Expanding - send SHBUF_EXPAND barrier
+		 * Backends set activeNBuffers = NewNBuffers and start allocating 
+		 * buffers from the expanded range
+		 */
+		elog(LOG, "Phase 3: Expanding buffer pool, enabling allocations up to %d buffers", targetNBuffers);
+		
+		WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SHBUF_EXPAND));
+		elog(LOG, "all backends acknowledged expand phase");
+
+		/* This backend handles NBuffers itself instead of relying on the barrier
+		 * handler, so that barrier handlers do not interfere with its
+		 * operations. */
+		NBuffers = targetNBuffers;
+	}
+
+	/*
+	 * Reset buffer resize control area.
+	 */
+	MarkBufferResizingEnd(targetNBuffers);
+
+	pg_atomic_clear_flag(&ShmemCtrl->resize_in_progress);
+
+	elog(LOG, "successfully resized shared buffers to %d", targetNBuffers);
+
+	PG_RETURN_BOOL(result);
+}
+
+bool
+ProcessBarrierShmemShrink(void)
+{
+	int targetNBuffers = pg_atomic_read_u32(&ShmemCtrl->targetNBuffers);
+	int activeNBuffers = pg_atomic_read_u32(&ShmemCtrl->activeNBuffers);
+
+	Assert(!pg_atomic_unlocked_test_flag(&ShmemCtrl->resize_in_progress));
+
+	/* The work to be done by the coordinator is done in the function which sends the barriers. Hence acknowledge immediately. */
+	if (ShmemCtrl->coordinator == MyProcPid)
+	{
+		elog(LOG, "Phase 1: Coordinator backend %d acknowledging SHBUF_SHRINK barrier immediately", MyProcPid);
+		return true;
+	}
+
+	/*
+	 * Delay adjusting the new active size of buffer pool till this process
+	 * becomes ready to resize buffers.
+	 */
+	if (delay_shmem_resize)
+	{
+		elog(LOG, "Phase 1: Delaying SHBUF_SHRINK barrier - restricting allocations from %d to %d buffers, coordinator is %d",
+			NBuffers, targetNBuffers, ShmemCtrl->coordinator);
+
+		return false;
+	}
+
+	elog(LOG, "Phase 1: Processing SHBUF_SHRINK barrier - restricting allocations from %d to %d buffers, coordinator is %d",
+			NBuffers, targetNBuffers, ShmemCtrl->coordinator);
+
+	if (MyBackendType == B_BG_WRITER)
+	{
+		/*
+		 * Before resuming regular background writer activity, adjust the
+		 * statistics collected so far.
+		 */
+		BgBufferSyncReset(NBuffers, targetNBuffers);
+		/* Reset strategy control to new size */
+		StrategyReset(targetNBuffers);
+	}
+
+	/* Update local knowledge of activeNBuffers  */
+	NBuffers = activeNBuffers;
+
+	return true;
+}
+
+bool
+ProcessBarrierShmemResizeMapAndMem(void)
+{
+	int targetNBuffers = pg_atomic_read_u32(&ShmemCtrl->targetNBuffers);
+#ifdef USE_ASSERT_CHECKING
+	int activeNBuffers = pg_atomic_read_u32(&ShmemCtrl->activeNBuffers);
+	int transitNBuffers = pg_atomic_read_u32(&ShmemCtrl->transitNBuffers);
+#endif /* USE_ASSERT_CHECKING */
+
+	Assert(!pg_atomic_unlocked_test_flag(&ShmemCtrl->resize_in_progress));
+
+	/* The work to be done by the coordinator is done in the function which sends the barriers. Hence acknowledge immediately. */
+	if (ShmemCtrl->coordinator == MyProcPid)
+	{
+		elog(LOG, "Phase 2: Coordinator backend %d acknowledging SHBUF_RESIZE_MAP_AND_MEM barrier immediately", MyProcPid);
+		return true;
+	}
+
+	/*
+	 * If buffer pool is being shrunk, we are already working with a smaller
+	 * buffer pool, so shrinking address space and shared structures should not
+	 * be a problem. When expanding, expanding the address space and shared
+	 * structures beyond the current boundaries is not going to be a problem
+	 * since we are not accessing that memory yet. So there is no reason to
+	 * delay processing this barrier.
+	 */
+
+	elog(LOG, "Phase 2: Processing SHBUF_RESIZE_MAP_AND_MEM barrier - adjusting memory maps and validating structures, coordinator is %d",
+			ShmemCtrl->coordinator);
+
+	/* 
+	 * NBuffers should already be set to activeNBuffers from Phase 1.
+	 * When shrinking, NBuffers should also be same as transitNBuffers in this phase.
+	 */
+	Assert(NBuffers == activeNBuffers);
+	if (targetNBuffers < pg_atomic_read_u32(&ShmemCtrl->currentNBuffers))
+	{
+		/* Shrinking case - verify NBuffers equals transitNBuffers */
+		Assert(NBuffers == transitNBuffers);
+	}
+
+	/* 
+	 * Address space should already be reserved so resizing should not fail. If
+	 * it fails, the address map of this backend may go out of sync with other
+	 * backends. Hence PANIC.
+	 */
+	if (!AnonymousShmemResize())
+		elog(PANIC, "failed to resize anonymous shared memory in backend %d", MyProcPid);
+
+	elog(LOG, "Backend %d successfully remapped shared memory segments for buffer resize", MyProcPid);
+
+	/* 
+	 * Backends validate that their pointers to shared buffer structures are 
+	 * still valid and have the correct size after memory remapping.
+	 */
+	BufferManagerShmemValidate(targetNBuffers);
+	
+	/* 
+	 * TODO: Save new transitNBuffers value in process local memory, if
+	 * necessary.
+	 */
+	elog(LOG, "Backend %d successfully validated structure pointers after resize", MyProcPid);
+
+	return true;
+}
+
+bool
+ProcessBarrierShmemExpand(void)
+{
+	int targetNBuffers = pg_atomic_read_u32(&ShmemCtrl->targetNBuffers);
+#ifdef USE_ASSERT_CHECKING
+	int transitNBuffers = pg_atomic_read_u32(&ShmemCtrl->transitNBuffers);
+#endif /* USE_ASSERT_CHECKING */
+
+	Assert(!pg_atomic_unlocked_test_flag(&ShmemCtrl->resize_in_progress));
+
+	/* The work to be done by the coordinator is done in the function which sends the barriers. Hence acknowledge immediately. */
+	if (ShmemCtrl->coordinator == MyProcPid)
+	{
+		elog(LOG, "Phase 3: Coordinator backend %d acknowledging SHBUF_EXPAND barrier immediately", MyProcPid);
+		return true;
+	}
+
+	/*
+	 * Delay adjusting the new active size of buffer pool till this process
+	 * becomes ready to resize buffers.
+	 */
+	if (delay_shmem_resize)
+	{
+		elog(LOG, "Phase 3: delaying SHBUF_EXPAND barrier - enabling allocations up to %d buffers, coordinator is %d",
+				targetNBuffers, ShmemCtrl->coordinator);
+		return false;
+	}
+
+	elog(LOG, "Phase 3: Processing SHBUF_EXPAND barrier - enabling allocations up to %d buffers, coordinator is %d",
+			targetNBuffers, ShmemCtrl->coordinator);
+
+	if (MyBackendType == B_BG_WRITER)
+	{
+		/*
+		 * Adjust background writer statistics for the expanded buffer pool
+		 */
+		BgBufferSyncReset(NBuffers, targetNBuffers);
+		StrategyReset(targetNBuffers);
+	}
+
+	/* Update local knowledge about the size of active buffer pool. */
+	NBuffers = targetNBuffers;
+
+	/* When expanding, NBuffers should be same as transitNBuffers previous phase. */
+	Assert(NBuffers == transitNBuffers);
+
+	return true;
+}
+
+bool
+ProcessBarrierShmemResizeFailed(void)
+{
+	int currentNBuffers = pg_atomic_read_u32(&ShmemCtrl->currentNBuffers);
+	int targetNBuffers = pg_atomic_read_u32(&ShmemCtrl->targetNBuffers);
+
+	Assert(!pg_atomic_unlocked_test_flag(&ShmemCtrl->resize_in_progress));
+
+	/* The work to be done by the coordinator is done in the function which sends the barriers. Hence acknowledge immediately. */
+	if (ShmemCtrl->coordinator == MyProcPid)
+	{
+		elog(LOG, "Coordinator backend %d acknowledging SHBUF_RESIZE_FAILED barrier immediately", MyProcPid);
+		return true;
+	}
+
+	elog(LOG, "received proc signal indicating failure to resize shared buffers from %d to %d, restoring to %d, coordinator is %d",
+			NBuffers, targetNBuffers, currentNBuffers, ShmemCtrl->coordinator);
+
+	/* Restore NBuffers to the original value */
+	NBuffers = currentNBuffers;
+
+	return true;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index ee9f308379c..b43f1408855 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4129,6 +4129,9 @@ PostgresSingleUserMain(int argc, char *argv[],
 	/* Initialize size of fast-path lock cache. */
 	InitializeFastPathLocks();
 
+	/* Initialize MaxNBuffers for buffer pool resizing. */
+	InitializeMaxNBuffers();
+
 	/*
 	 * Give preloaded libraries a chance to request additional shared memory.
 	 */
@@ -4319,14 +4322,12 @@ PostgresMain(const char *dbname, const char *username)
 	 */
 	BeginReportingGUCOptions();
 
-	/* Verify the shared barrier, if it's still active: join and wait. */
-	WaitOnShmemBarrier();
-
 	/*
-	 * After waiting on the barrier above we guaranteed to have NSharedBuffers
-	 * broadcasted, so we can use it in the function below.
+	 * TODO: The new backend should fetch the shared buffers status. If the
+	 * resizing is going on, it should bring itself upto speed with it. If not,
+	 * simply fetch the latest pointers are sizes. Is this the right place to do
+	 * that?
 	 */
-	AdjustShmemSize();
 
 	/*
 	 * Also set up handler to log session end; we have to wait till now to be
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 9a6a6275305..5794d9522d7 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -155,14 +155,12 @@ REPLICATION_ORIGIN_DROP	"Waiting for a replication origin to become inactive so
 REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it can be dropped."
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
-SHMEM_RESIZE_START	"Waiting for other backends to start resizing shared memory."
-SHMEM_RESIZE_EVICT	"Waiting for other backends to finish buffer evication phase."
-SHMEM_RESIZE_DONE	"Waiting for other backends to finish resizing shared memory."
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
 WAL_RECEIVER_EXIT	"Waiting for the WAL receiver to exit."
 WAL_RECEIVER_WAIT_START	"Waiting for startup process to send initial data for streaming replication."
 WAL_SUMMARY_READY	"Waiting for a new WAL summary to be generated."
 XACT_GROUP_UPDATE	"Waiting for the group leader to update transaction status at transaction end."
+PM_BUFFER_RESIZE_WAIT	"Waiting for the postmaster to complete shared buffer pool resize operations."
 
 ABI_compatibility:
 
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 90d3feb547c..894a04caf0f 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -139,8 +139,10 @@ int			max_parallel_maintenance_workers = 2;
  * MaxBackends is computed by PostmasterMain after modules have had a chance to
  * register background workers.
  */
-int			NBuffers = 16384;
-int			MaxAvailableMemory = 524288;
+int			NBuffers = 0;
+int			NBuffersPending = 16384;
+bool finalMaxNBuffers = false;
+int			MaxNBuffers = 0;
 int			MaxConnections = 100;
 int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 641e535a73c..e0401fb6477 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -599,6 +599,38 @@ InitializeFastPathLocks(void)
 		   pg_nextpower2_32(FastPathLockGroupsPerBackend));
 }
 
+/*
+ * Initialize MaxNBuffers variable with validation.
+ *
+ * This must be called after GUCs have been loaded but before shared memory size
+ * is determined.
+ *
+ * Since MaxNBuffers limits the size of the buffer pool, it must be at least as
+ * much as NBuffersPending. If MaxNBuffers is 0 (default), set it to
+ * NBuffersPending. Otherwise, validate that MaxNBuffers is not less than
+ * NBuffersPending.
+ */
+void
+InitializeMaxNBuffers(void)
+{
+	if (MaxNBuffers == 0)  /* default/boot value */
+		MaxNBuffers = NBuffersPending;
+	else
+	{
+		if (MaxNBuffers < NBuffersPending)
+		{
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("max_shared_buffers (%d) cannot be less than current shared_buffers (%d)",
+							MaxNBuffers, NBuffersPending),
+					 errhint("Increase max_shared_buffers or decrease shared_buffers.")));
+		}
+	}
+	
+	Assert(!finalMaxNBuffers);
+	finalMaxNBuffers = true;
+}
+
 /*
  * Early initialization of a backend (either standalone or under postmaster).
  * This happens even before InitPostgres.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8794e26ef1d..71a09a65182 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2731,7 +2731,7 @@ convert_to_base_unit(double value, const char *unit,
  * the value without loss.  For example, if the base unit is GUC_UNIT_KB, 1024
  * is converted to 1 MB, but 1025 is represented as 1025 kB.
  */
-static void
+void
 convert_int_from_base_unit(int64 base_value, int base_unit,
 						   int64 *value, const char **unit)
 {
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 7b3ac5f3716..262f42c06c3 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1108,25 +1108,23 @@
 { name => 'shared_buffers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_MEM',
   short_desc => 'Sets the number of shared memory buffers used by the server.',
   flags => 'GUC_UNIT_BLOCKS',
-  variable => 'NBuffers',
+  variable => 'NBuffersPending',
   boot_val => '16384',
   min => '16',
   max => 'INT_MAX / 2',
-  assign_hook => 'assign_shared_buffers'
+  check_hook => 'check_shared_buffers',
+  show_hook => 'show_shared_buffers',
 },
 
-# TODO: should this be PGC_POSTMASTER?
-{ name => "max_available_memory", type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_MEM',
+{ name => "max_shared_buffers", type => 'int', context => 'PGC_POSTMASTER', group => 'RESOURCES_MEM',
   short_desc => 'Sets the upper limit for the shared_buffers value.',
-  long_desc => 'Shared memory could be resized at runtime, this parameters sets the upper limit for it, beyond which resizing would not be supported. Normally this value would be the same as the total available memory.',
   flags => 'GUC_UNIT_BLOCKS',
-  variable => 'MaxAvailableMemory',
-  boot_val => '524288',
-  min => '16',
+  variable => 'MaxNBuffers',
+  boot_val => '0',
+  min => '0',
   max => 'INT_MAX / 2',
 },
 
-
 { name => 'vacuum_buffer_usage_limit', type => 'int', context => 'PGC_USERSET', group => 'RESOURCES_MEM',
   short_desc => 'Sets the buffer pool size for VACUUM, ANALYZE, and autovacuum.',
   flags => 'GUC_UNIT_KB',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 8f1d0b7c031..ce5110b8636 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12620,4 +12620,10 @@
   proargnames => '{pid,io_id,io_generation,state,operation,off,length,target,handle_data_len,raw_result,result,target_desc,f_sync,f_localmem,f_buffered}',
   prosrc => 'pg_get_aios' },
 
+{ oid => '9999', descr => 'resize shared buffers according to the value of GUC `shared_buffers`',
+  proname => 'pg_resize_shared_buffers',
+  provolatile => 'v',
+  prorettype => 'bool',
+  proargtypes => '',
+  prosrc => 'pg_resize_shared_buffers'},
 ]
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index a0c37a7749e..efe3d3c73ff 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -172,8 +172,11 @@ extern PGDLLIMPORT bool ExitOnAnyError;
 extern PGDLLIMPORT char *DataDir;
 extern PGDLLIMPORT int data_directory_mode;
 
+/* TODO: This is no more a GUC variable; should be moved somewhere else. */
 extern PGDLLIMPORT int NBuffers;
-extern PGDLLIMPORT int MaxAvailableMemory;
+extern PGDLLIMPORT int NBuffersPending;
+extern PGDLLIMPORT bool finalMaxNBuffers;
+extern PGDLLIMPORT int MaxNBuffers;
 extern PGDLLIMPORT int MaxBackends;
 extern PGDLLIMPORT int MaxConnections;
 extern PGDLLIMPORT int max_worker_processes;
@@ -502,6 +505,7 @@ extern PGDLLIMPORT ProcessingMode Mode;
 extern void pg_split_opts(char **argv, int *argcp, const char *optstr);
 extern void InitializeMaxBackends(void);
 extern void InitializeFastPathLocks(void);
+extern void InitializeMaxNBuffers(void);
 extern void InitPostgres(const char *in_dbname, Oid dboid,
 						 const char *username, Oid useroid,
 						 bits32 flags,
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 20bea8132fd..bbb7a225216 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -447,7 +447,7 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
-extern void StrategyReInitialize(int FirstBufferToInit);
+extern void StrategyReset(int activeNBuffers);
 
 /* buf_table.c */
 extern Size BufTableShmemSize(int size);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 74e226269af..6866d09dc22 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -20,6 +20,7 @@
 #include "storage/buf.h"
 #include "storage/bufpage.h"
 #include "storage/relfilelocator.h"
+#include "utils/guc.h"
 #include "utils/relcache.h"
 #include "utils/snapmgr.h"
 
@@ -151,6 +152,7 @@ typedef struct WritebackContext WritebackContext;
 
 /* in globals.c ... this duplicates miscadmin.h */
 extern PGDLLIMPORT int NBuffers;
+extern PGDLLIMPORT int NBuffersPending;
 
 /* in bufmgr.c */
 extern PGDLLIMPORT bool zero_damaged_pages;
@@ -197,6 +199,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * prototypes for functions in buf_init.c
+ */
+extern const char *show_shared_buffers(void);
+extern bool check_shared_buffers(int *newval, void **extra, GucSource source);
 
 /*
  * prototypes for functions in bufmgr.c
@@ -300,7 +307,7 @@ extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(WritebackContext *wb_context);
-extern void BgBufferSyncReset(int NBuffersOld, int NBuffersNew);
+extern void BgBufferSyncReset(int currentNBuffers, int targetNBuffers);
 
 extern uint32 GetPinLimit(void);
 extern uint32 GetLocalPinLimit(void);
@@ -317,11 +324,13 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 									int32 *buffers_evicted,
 									int32 *buffers_flushed,
 									int32 *buffers_skipped);
-extern bool EvictExtraBuffers(int fromBuf, int toBuf);
+extern bool EvictExtraBuffers(int targetNBuffers, int currentNBuffers);
 
 /* in buf_init.c */
-extern void BufferManagerShmemInit(int);
-extern Size BufferManagerShmemSize(void);
+extern void BufferManagerShmemInit(void);
+extern Size BufferManagerShmemSize(bool set_reserved);
+extern void BufferManagerShmemResize(int currentNBuffers, int targetNBuffers);
+extern void BufferManagerShmemValidate(int targetNBuffers);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 6e7b0abb625..10e74b34813 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -64,7 +64,6 @@ typedef void (*shmem_startup_hook_type) (void);
 /* ipc.c */
 extern PGDLLIMPORT bool proc_exit_inprogress;
 extern PGDLLIMPORT bool shmem_exit_inprogress;
-extern PGDLLIMPORT volatile bool pending_pm_shmem_resize;
 extern PGDLLIMPORT volatile bool delay_shmem_resize;
 
 pg_noreturn extern void proc_exit(int code);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 704b065f9e9..34b5e6c48ca 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -24,9 +24,11 @@
 #ifndef PG_SHMEM_H
 #define PG_SHMEM_H
 
+#include "port/atomics.h"
 #include "storage/barrier.h"
 #include "storage/dsm_impl.h"
 #include "storage/spin.h"
+#include "utils/guc.h"
 
 typedef struct AnonymousMapping
 {
@@ -73,14 +75,20 @@ extern PGDLLIMPORT AnonymousMapping Mappings[ANON_MAPPINGS];
 /*
  * ShmemControl is shared between backends and helps to coordinate shared
  * memory resize.
+ * 
+ * TODO: I think we need a lock to protect this structure. If we do so, do we
+ * need to use atomic integers?
  */
 typedef struct
 {
-	pg_atomic_uint32 	NSharedBuffers;
-	pid_t				evictor_pid;
-	Barrier 			Barrier;
-	pg_atomic_uint64 	Generation;
-	bool                Resizable;
+	pg_atomic_flag		resize_in_progress; /* true if resizing is in progress. false otherwise. */
+	pg_atomic_uint32	currentNBuffers; /* Original NBuffers value before resize started */
+	pg_atomic_uint32	targetNBuffers;
+	pg_atomic_uint32	activeNBuffers; /* Active portion of buffer pool during resizing. */
+	pg_atomic_uint32	transitNBuffers; /* Part of the buffer pool beyond activeNBuffers which may remain accessible during resizing. */
+	pid_t				coordinator;
+	ConditionVariable pm_cv; /* Coordinator waits for PM to complete its work using this CV. */
+	bool pmwork_done; /* PM has completed its work of resizing buffers. */
 } ShmemControl;
 
 extern PGDLLIMPORT ShmemControl *ShmemCtrl;
@@ -95,7 +103,8 @@ extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
 extern PGDLLIMPORT int huge_page_size;
 extern PGDLLIMPORT int huge_pages_status;
-extern PGDLLIMPORT int MaxAvailableMemory;
+extern PGDLLIMPORT bool finalMaxNBuffers;
+extern PGDLLIMPORT int MaxNBuffers;
 
 /* Possible values for huge_pages and huge_pages_status */
 typedef enum
@@ -145,7 +154,8 @@ extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
 void PrepareHugePages(void);
 
 bool ProcessBarrierShmemResize(Barrier *barrier);
-void assign_shared_buffers(int newval, void *extra, bool *pending);
+const char *show_shared_buffers(void);
+bool check_shared_buffers(int *newval, void **extra, GucSource source);
 void AdjustShmemSize(void);
 extern void WaitOnShmemBarrier(void);
 extern void ShmemControlInit(void);
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 97033f84dce..b80b05f2804 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -54,7 +54,10 @@ typedef enum
 typedef enum
 {
 	PROCSIGNAL_BARRIER_SMGRRELEASE, /* ask smgr to close files */
-	PROCSIGNAL_BARRIER_SHMEM_RESIZE, /* ask backends to resize shared memory */
+	PROCSIGNAL_BARRIER_SHBUF_SHRINK, /* shrink buffer pool - restrict allocations to new size */
+	PROCSIGNAL_BARRIER_SHBUF_RESIZE_MAP_AND_MEM, /* remap shared memory segments and update structure pointers */
+	PROCSIGNAL_BARRIER_SHBUF_EXPAND, /* expand buffer pool - enable allocations in new range */
+	PROCSIGNAL_BARRIER_SHBUF_RESIZE_FAILED, /* signal backends that the shared buffer resizing failed. */
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index 64ff5a286ba..6944560d485 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -50,11 +50,19 @@ extern HTAB *ShmemInitHashInSegment(const char *name, long init_size,
 extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
 extern void *ShmemInitStructInSegment(const char *name, Size size,
 									  bool *foundPtr, int shmem_segment);
+extern void *ShmemUpdateStructInSegment(const char *name, Size size,
+										bool *foundPtr, int shmem_segment);
 extern Size add_size(Size s1, Size s2);
 extern Size mul_size(Size s1, Size s2);
 
 extern PGDLLIMPORT Size pg_get_shmem_pagesize(void);
 
+extern bool ProcessBarrierShmemShrink(void);
+extern bool ProcessBarrierShmemResizeMapAndMem(void);
+extern bool ProcessBarrierShmemExpand(void);
+extern bool ProcessBarrierShmemResizeFailed(void);
+
+
 /* ipci.c */
 extern void RequestAddinShmemSpace(Size size);
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index f21ec37da89..08a84373fb7 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -459,6 +459,8 @@ extern config_handle *get_config_handle(const char *name);
 extern void AlterSystemSetConfigFile(AlterSystemStmt *altersysstmt);
 extern char *GetConfigOptionByName(const char *name, const char **varname,
 								   bool missing_ok);
+extern void convert_int_from_base_unit(int64 base_value, int base_unit,
+									   int64 *value, const char **unit);
 
 extern void TransformGUCArray(ArrayType *array, List **names,
 							  List **values);
diff --git a/src/test/buffermgr/Makefile b/src/test/buffermgr/Makefile
index 97c3da9e20a..eb275027fa6 100644
--- a/src/test/buffermgr/Makefile
+++ b/src/test/buffermgr/Makefile
@@ -13,6 +13,9 @@ EXTRA_INSTALL = contrib/pg_buffercache
 
 REGRESS = buffer_resize
 
+# Custom configuration for buffer manager tests
+TEMP_CONFIG = $(srcdir)/buffermgr_test.conf
+
 subdir = src/test/buffermgr
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
diff --git a/src/test/buffermgr/buffermgr_test.conf b/src/test/buffermgr/buffermgr_test.conf
new file mode 100644
index 00000000000..21ccf66d9c7
--- /dev/null
+++ b/src/test/buffermgr/buffermgr_test.conf
@@ -0,0 +1,9 @@
+# Configuration for buffer manager regression tests
+
+# Even if max_shared_buffers is set multiple times only the last one is used to
+# as the limit on shared_buffers.
+max_shared_buffers = 128kB
+# Set initial shared_buffers as expected by test
+shared_buffers = 128MB
+# Set a larger value for max_shared_buffers to allow testing resize operations
+max_shared_buffers = 300MB 
\ No newline at end of file
diff --git a/src/test/buffermgr/expected/buffer_resize.out b/src/test/buffermgr/expected/buffer_resize.out
index a986be9a5da..d5cb9d78437 100644
--- a/src/test/buffermgr/expected/buffer_resize.out
+++ b/src/test/buffermgr/expected/buffer_resize.out
@@ -1,9 +1,8 @@
 -- Test buffer pool resizing and shared memory allocation tracking
 -- This test resizes the buffer pool multiple times and monitors
 -- shared memory allocations related to buffer management
--- Create a separate schema for this test
-CREATE SCHEMA buffer_resize_test;
-SET search_path TO buffer_resize_test, public;
+-- TODO: The test sets shared_buffers values in MBs. Instead it could use values
+-- in kBs so that the test runs on very small machines.
 -- Create a view for buffer-related shared memory allocations
 CREATE VIEW buffer_allocations AS
 SELECT name, segment, size, allocated_size 
@@ -28,6 +27,49 @@ SHOW shared_buffers;
  128MB
 (1 row)
 
+SHOW max_shared_buffers;
+ max_shared_buffers 
+--------------------
+ 300MB
+(1 row)
+
+SELECT * FROM buffer_allocations;
+             name              |   segment   |   size    | allocated_size 
+-------------------------------+-------------+-----------+----------------
+ Buffer Blocks                 | buffers     | 134221824 |      134221824
+ Buffer Descriptors            | descriptors |   1048576 |        1048576
+ Buffer IO Condition Variables | iocv        |    262144 |         262144
+ Checkpoint BufferIds          | checkpoint  |    327680 |         327680
+(4 rows)
+
+SELECT * FROM buffer_segments;
+    name     |   size    | mapping_size | mapping_reserved_size 
+-------------+-----------+--------------+-----------------------
+ buffers     | 134225920 |    134225920 |             314580992
+ checkpoint  |    335872 |       335872 |                770048
+ descriptors |   1056768 |      1056768 |               2465792
+ iocv        |    270336 |       270336 |                622592
+(4 rows)
+
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+ buffer_count 
+--------------
+        16384
+(1 row)
+
+-- Calling pg_resize_shared_buffers() without changing shared_buffers should be a no-op.
+SELECT pg_resize_shared_buffers();
+ pg_resize_shared_buffers 
+--------------------------
+ t
+(1 row)
+
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 128MB
+(1 row)
+
 SELECT * FROM buffer_allocations;
              name              |   segment   |   size    | allocated_size 
 -------------------------------+-------------+-----------+----------------
@@ -40,10 +82,10 @@ SELECT * FROM buffer_allocations;
 SELECT * FROM buffer_segments;
     name     |   size    | mapping_size | mapping_reserved_size 
 -------------+-----------+--------------+-----------------------
- buffers     | 134225920 |    134225920 |            2576982016
- checkpoint  |    335872 |       335872 |             214753280
- descriptors |   1056768 |      1056768 |             429498368
- iocv        |    270336 |       270336 |             429498368
+ buffers     | 134225920 |    134225920 |             314580992
+ checkpoint  |    335872 |       335872 |                770048
+ descriptors |   1056768 |      1056768 |               2465792
+ iocv        |    270336 |       270336 |                622592
 (4 rows)
 
 SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
@@ -60,10 +102,18 @@ SELECT pg_reload_conf();
  t
 (1 row)
 
-SELECT pg_sleep(1);
- pg_sleep 
-----------
- 
+-- reconnect to ensure new setting is loaded
+\c
+SHOW shared_buffers;
+    shared_buffers     
+-----------------------
+ 128MB (pending: 64MB)
+(1 row)
+
+SELECT pg_resize_shared_buffers();
+ pg_resize_shared_buffers 
+--------------------------
+ t
 (1 row)
 
 SHOW shared_buffers;
@@ -84,10 +134,10 @@ SELECT * FROM buffer_allocations;
 SELECT * FROM buffer_segments;
     name     |   size   | mapping_size | mapping_reserved_size 
 -------------+----------+--------------+-----------------------
- buffers     | 67117056 |     67117056 |            2576982016
- checkpoint  |   172032 |       172032 |             214753280
- descriptors |   532480 |       532480 |             429498368
- iocv        |   139264 |       139264 |             429498368
+ buffers     | 67117056 |     67117056 |             314580992
+ checkpoint  |   172032 |       172032 |                770048
+ descriptors |   532480 |       532480 |               2465792
+ iocv        |   139264 |       139264 |                622592
 (4 rows)
 
 SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
@@ -104,10 +154,18 @@ SELECT pg_reload_conf();
  t
 (1 row)
 
-SELECT pg_sleep(1);
- pg_sleep 
-----------
- 
+-- reconnect to ensure new setting is loaded
+\c
+SHOW shared_buffers;
+    shared_buffers     
+-----------------------
+ 64MB (pending: 256MB)
+(1 row)
+
+SELECT pg_resize_shared_buffers();
+ pg_resize_shared_buffers 
+--------------------------
+ t
 (1 row)
 
 SHOW shared_buffers;
@@ -128,10 +186,10 @@ SELECT * FROM buffer_allocations;
 SELECT * FROM buffer_segments;
     name     |   size    | mapping_size | mapping_reserved_size 
 -------------+-----------+--------------+-----------------------
- buffers     | 268443648 |    268443648 |            2576982016
- checkpoint  |    663552 |       663552 |             214753280
- descriptors |   2105344 |      2105344 |             429498368
- iocv        |    532480 |       532480 |             429498368
+ buffers     | 268443648 |    268443648 |             314580992
+ checkpoint  |    663552 |       663552 |                770048
+ descriptors |   2105344 |      2105344 |               2465792
+ iocv        |    532480 |       532480 |                622592
 (4 rows)
 
 SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
@@ -148,10 +206,18 @@ SELECT pg_reload_conf();
  t
 (1 row)
 
-SELECT pg_sleep(1);
- pg_sleep 
-----------
- 
+-- reconnect to ensure new setting is loaded
+\c
+SHOW shared_buffers;
+     shared_buffers     
+------------------------
+ 256MB (pending: 100MB)
+(1 row)
+
+SELECT pg_resize_shared_buffers();
+ pg_resize_shared_buffers 
+--------------------------
+ t
 (1 row)
 
 SHOW shared_buffers;
@@ -172,10 +238,10 @@ SELECT * FROM buffer_allocations;
 SELECT * FROM buffer_segments;
     name     |   size    | mapping_size | mapping_reserved_size 
 -------------+-----------+--------------+-----------------------
- buffers     | 104865792 |    104865792 |            2576982016
- checkpoint  |    262144 |       262144 |             214753280
- descriptors |    827392 |       827392 |             429498368
- iocv        |    212992 |       212992 |             429498368
+ buffers     | 104865792 |    104865792 |             314580992
+ checkpoint  |    262144 |       262144 |                770048
+ descriptors |    827392 |       827392 |               2465792
+ iocv        |    212992 |       212992 |                622592
 (4 rows)
 
 SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
@@ -192,10 +258,18 @@ SELECT pg_reload_conf();
  t
 (1 row)
 
-SELECT pg_sleep(1);
- pg_sleep 
-----------
- 
+-- reconnect to ensure new setting is loaded
+\c
+SHOW shared_buffers;
+     shared_buffers     
+------------------------
+ 100MB (pending: 128kB)
+(1 row)
+
+SELECT pg_resize_shared_buffers();
+ pg_resize_shared_buffers 
+--------------------------
+ t
 (1 row)
 
 SHOW shared_buffers;
@@ -216,10 +290,10 @@ SELECT * FROM buffer_allocations;
 SELECT * FROM buffer_segments;
     name     |  size  | mapping_size | mapping_reserved_size 
 -------------+--------+--------------+-----------------------
- buffers     | 139264 |       139264 |            2576982016
- checkpoint  |   8192 |         8192 |             214753280
- descriptors |   8192 |         8192 |             429498368
- iocv        |   8192 |         8192 |             429498368
+ buffers     | 139264 |       139264 |             314580992
+ checkpoint  |   8192 |         8192 |                770048
+ descriptors |   8192 |         8192 |               2465792
+ iocv        |   8192 |         8192 |                622592
 (4 rows)
 
 SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
@@ -228,10 +302,28 @@ SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
            16
 (1 row)
 
--- Clean up the schema and all its objects
-RESET search_path;
-DROP SCHEMA buffer_resize_test CASCADE;
-NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to view buffer_resize_test.buffer_allocations
-drop cascades to view buffer_resize_test.buffer_segments
-drop cascades to extension pg_buffercache
+-- Test 6: Try to set shared_buffers higher than max_shared_buffers (should fail)
+ALTER SYSTEM SET shared_buffers = '400MB';
+ERROR:  invalid value for parameter "shared_buffers": 51200
+DETAIL:  "shared_buffers" must be less than "max_shared_buffers".
+SELECT pg_reload_conf();
+ pg_reload_conf 
+----------------
+ t
+(1 row)
+
+-- reconnect to ensure new setting is loaded
+\c
+-- This should show the old value since the configuration was rejected
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 128kB
+(1 row)
+
+SHOW max_shared_buffers;
+ max_shared_buffers 
+--------------------
+ 300MB
+(1 row)
+
diff --git a/src/test/buffermgr/meson.build b/src/test/buffermgr/meson.build
index e71dcdea685..561630e846f 100644
--- a/src/test/buffermgr/meson.build
+++ b/src/test/buffermgr/meson.build
@@ -8,10 +8,15 @@ tests += {
     'sql': [
       'buffer_resize',
     ],
+    'regress_args': ['--temp-config', files('buffermgr_test.conf')],
   },
   'tap': {
+    'env': {
+      'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
+    },
     'tests': [
       't/001_resize_buffer.pl',
+      't/003_parallel_resize_buffer.pl',
     ],
   },
 }
diff --git a/src/test/buffermgr/sql/buffer_resize.sql b/src/test/buffermgr/sql/buffer_resize.sql
index 45f5bb6d78b..dfaaeabfcbb 100644
--- a/src/test/buffermgr/sql/buffer_resize.sql
+++ b/src/test/buffermgr/sql/buffer_resize.sql
@@ -1,10 +1,8 @@
 -- Test buffer pool resizing and shared memory allocation tracking
 -- This test resizes the buffer pool multiple times and monitors
 -- shared memory allocations related to buffer management
-
--- Create a separate schema for this test
-CREATE SCHEMA buffer_resize_test;
-SET search_path TO buffer_resize_test, public;
+-- TODO: The test sets shared_buffers values in MBs. Instead it could use values
+-- in kBs so that the test runs on very small machines.
 
 -- Create a view for buffer-related shared memory allocations
 CREATE VIEW buffer_allocations AS
@@ -28,6 +26,13 @@ CREATE EXTENSION IF NOT EXISTS pg_buffercache;
 
 -- Test 1: Default shared_buffers 
 SHOW shared_buffers;
+SHOW max_shared_buffers;
+SELECT * FROM buffer_allocations;
+SELECT * FROM buffer_segments;
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+-- Calling pg_resize_shared_buffers() without changing shared_buffers should be a no-op.
+SELECT pg_resize_shared_buffers();
+SHOW shared_buffers;
 SELECT * FROM buffer_allocations;
 SELECT * FROM buffer_segments;
 SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
@@ -35,7 +40,10 @@ SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
 -- Test 2: Set to 64MB  
 ALTER SYSTEM SET shared_buffers = '64MB';
 SELECT pg_reload_conf();
-SELECT pg_sleep(1);
+-- reconnect to ensure new setting is loaded
+\c
+SHOW shared_buffers;
+SELECT pg_resize_shared_buffers();
 SHOW shared_buffers;
 SELECT * FROM buffer_allocations;
 SELECT * FROM buffer_segments;
@@ -44,7 +52,10 @@ SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
 -- Test 3: Set to 256MB
 ALTER SYSTEM SET shared_buffers = '256MB';
 SELECT pg_reload_conf();
-SELECT pg_sleep(1);
+-- reconnect to ensure new setting is loaded
+\c
+SHOW shared_buffers;
+SELECT pg_resize_shared_buffers();
 SHOW shared_buffers;
 SELECT * FROM buffer_allocations;
 SELECT * FROM buffer_segments;
@@ -53,7 +64,10 @@ SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
 -- Test 4: Set to 100MB (non-power-of-two)
 ALTER SYSTEM SET shared_buffers = '100MB';
 SELECT pg_reload_conf();
-SELECT pg_sleep(1);
+-- reconnect to ensure new setting is loaded
+\c
+SHOW shared_buffers;
+SELECT pg_resize_shared_buffers();
 SHOW shared_buffers;
 SELECT * FROM buffer_allocations;
 SELECT * FROM buffer_segments;
@@ -62,12 +76,20 @@ SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
 -- Test 5: Set to minimum 128kB
 ALTER SYSTEM SET shared_buffers = '128kB';
 SELECT pg_reload_conf();
-SELECT pg_sleep(1);
+-- reconnect to ensure new setting is loaded
+\c
+SHOW shared_buffers;
+SELECT pg_resize_shared_buffers();
 SHOW shared_buffers;
 SELECT * FROM buffer_allocations;
 SELECT * FROM buffer_segments;
 SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
 
--- Clean up the schema and all its objects
-RESET search_path;
-DROP SCHEMA buffer_resize_test CASCADE;
+-- Test 6: Try to set shared_buffers higher than max_shared_buffers (should fail)
+ALTER SYSTEM SET shared_buffers = '400MB';
+SELECT pg_reload_conf();
+-- reconnect to ensure new setting is loaded
+\c
+-- This should show the old value since the configuration was rejected
+SHOW shared_buffers;
+SHOW max_shared_buffers;
diff --git a/src/test/buffermgr/t/001_resize_buffer.pl b/src/test/buffermgr/t/001_resize_buffer.pl
index 8cf9e4539ab..7b0a78ebe5b 100644
--- a/src/test/buffermgr/t/001_resize_buffer.pl
+++ b/src/test/buffermgr/t/001_resize_buffer.pl
@@ -14,40 +14,26 @@ sub apply_and_verify_buffer_change
 {
 	my ($node, $new_size) = @_;
 	
-	# Use a single background_psql session for consistency
-	my $psql_session = $node->background_psql('postgres');
-	$psql_session->query_safe("ALTER SYSTEM SET shared_buffers = '$new_size'");
-	$psql_session->query_safe("SELECT pg_reload_conf()");
-	
-	# Wait till the resizing finishes using the same session
-	# 
-	# TODO: Right now there is no way to know when the resize has finished and
-	# all the backends are using new value of shared_buffers. Hence we poll
-	# manually until we get the expected value in the same session.
-	my $current_size;
-	my $attempts = 0;
-	my $max_attempts = 60; # 60 seconds timeout
-	do {
-		$current_size = $psql_session->query_safe("SHOW shared_buffers");
-		$attempts++;
-		
-		# Only sleep if we didn't get the expected result and haven't timed out yet
-		if ($current_size ne $new_size && $attempts < $max_attempts) {
-			sleep(1);
-		}
-	} while ($current_size ne $new_size && $attempts < $max_attempts);
-	
-	$psql_session->quit;
-	
-	# Check if we succeeded or timed out
-	if ($current_size ne $new_size) {
-		die "Timeout waiting for shared_buffers to change to $new_size (got $current_size after ${attempts}s)";
-	}
+	# Use the new pg_resize_shared_buffers() interface which handles everything synchronously
+	$node->safe_psql('postgres', "ALTER SYSTEM SET shared_buffers = '$new_size'");
+	$node->safe_psql('postgres', "SELECT pg_reload_conf()");
+	# Call the resize function - it returns when the operation is complete
+	is($node->safe_psql('postgres', "SELECT pg_resize_shared_buffers()"), 't',
+		'resizing to ' . $new_size . ' succeeded');
+	is($node->safe_psql('postgres', "SHOW shared_buffers"), $new_size,
+		'SHOW after resizing to '. $new_size . ' succeeded');
 }
 
 # Initialize a cluster and start pgbench in the background for concurrent load.
 my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
+
+# Permit resizing up to 1GB for this test and let the server start with 128MB.
+$node->append_conf('postgresql.conf', qq{
+max_shared_buffers = 1GB
+shared_buffers = 128MB
+});
+
 $node->start;
 $node->safe_psql('postgres', "CREATE EXTENSION pg_buffercache");
 my $pgb_scale = 10;
diff --git a/src/test/buffermgr/t/003_parallel_resize_buffer.pl b/src/test/buffermgr/t/003_parallel_resize_buffer.pl
new file mode 100644
index 00000000000..9cbb5452fd2
--- /dev/null
+++ b/src/test/buffermgr/t/003_parallel_resize_buffer.pl
@@ -0,0 +1,71 @@
+# Copyright (c) 2025-2025, PostgreSQL Global Development Group
+#
+# Test that only one pg_resize_shared_buffers() call succeeds when multiple
+# sessions attempt to resize buffers concurrently
+
+use strict;
+use warnings;
+use IPC::Run;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Skip this test if injection points are not supported
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
+# Initialize a cluster
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->append_conf('postgresql.conf', 'shared_preload_libraries = injection_points');
+$node->append_conf('postgresql.conf', 'shared_buffers = 128kB');
+$node->append_conf('postgresql.conf', 'max_shared_buffers = 256kB');
+$node->start;
+
+# Load injection points extension for test coordination
+$node->safe_psql('postgres', "CREATE EXTENSION injection_points");
+
+# Test 1: Two concurrent pg_resize_shared_buffers() calls
+# Set up injection point to pause the first resize call
+$node->safe_psql('postgres', 
+	"SELECT injection_points_attach('pg-resize-shared-buffers-flag-set', 'wait')");
+
+# Change shared_buffers for the resize operation
+$node->safe_psql('postgres', "ALTER SYSTEM SET shared_buffers = '144kB'");
+$node->safe_psql('postgres', "SELECT pg_reload_conf()");
+
+# Start first resize session (will pause at injection point)
+my $session1 = $node->background_psql('postgres');
+$session1->query_until(
+	qr/starting_resize/,
+	q(
+		\echo starting_resize
+		SELECT pg_resize_shared_buffers();
+	)
+);
+
+# Wait until session actually reaches the injection point
+$node->wait_for_event('client backend', 'pg-resize-shared-buffers-flag-set');
+
+# Start second resize session (should fail immediately since resize is in progress)
+my $result2 = $node->safe_psql('postgres', "SELECT pg_resize_shared_buffers()");
+
+# The second call should return false (already in progress)
+is($result2, 'f', 'Second concurrent resize call returns false');
+
+# Wake up the first session
+$node->safe_psql('postgres', 
+	"SELECT injection_points_wakeup('pg-resize-shared-buffers-flag-set')");
+
+# The pg_resize_shared_buffers() in session1 should now complete successfully
+# We can't easily capture the return value from query_until, but we can
+# verify the session completes without error and the resize actually happened
+$session1->quit;
+
+# Detach injection point
+$node->safe_psql('postgres', 
+	"SELECT injection_points_detach('pg-resize-shared-buffers-flag-set')");
+
+done_testing();
\ No newline at end of file
-- 
2.34.1

#128

Dmitry Dolgov

9erthalion6@gmail.com

3 months ago

In reply to: Ashutosh Bapat (#127)

Re: Changing shared_buffers without restart

As I've mentioned in our off list communication, I'm working on the new
design and was planning to post some intermediate results in a couple of
weeks. Thus I'm surprised that instead of aligning on plans you've
decided to post you own version earlier. It most certainly doesn't make
things easier for me, so what's your plan anyway? Are you trying to
hijack the thread with your own patches? It doesn't strike me as
particularly constructive thing to do.

#129

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

3 months ago

In reply to: Dmitry Dolgov (#128)

Re: Changing shared_buffers without restart

Hi Dmitry,

On Tue, Oct 14, 2025 at 2:05 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

As I've mentioned in our off list communication, I'm working on the new
design and was planning to post some intermediate results in a couple of
weeks. Thus I'm surprised that instead of aligning on plans you've
decided to post you own version earlier. It most certainly doesn't make
things easier for me, so what's your plan anyway? Are you trying to
hijack the thread with your own patches? It doesn't strike me as
particularly constructive thing to do.

I am sorry if you have felt that way. That wasn't the intention.
Please allow me to explain ...

Tomas and Andres have pointed out some serious faults in the patchset
posted at [2]/messages/by-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2, including problems in my patches. There are many of
those. If we work in parallel we can make good progress. So I
continued working based on your last patchset [2]/messages/by-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2, which was posted
more than 3 months ago. Knowing that you are working on a design, I
tried not to touch the synchronization and UI part and yet find
solutions to some of the open problems (my patchset in [3]/messages/by-id/CAExHW5vB8sAmDtkEN5dcYYeBok3D8eAzMFCOH1k+krxht1yFjA@mail.gmail.com is a recent
example). As I mentioned in my email at [1]/messages/by-id/CAExHW5sOu8+9h6t7jsA5jVcQ--N-LCtjkPnCw+rpoN0ovT6PHg@mail.gmail.com, every open question I
tried to solve next was blocked because of a single problem, which I
have described in my previous email - A problem in synchronization in
the patchset at [2]/messages/by-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2.

Instead of just doing nothing, I thought I would try to implement the
UI and synchronization that I had in mind. Once I implemented it and
saw that it could address a few serious concerns raised by Andres, I
thought I would share it with hackers to get some early feedback.
Early feedback from people like Andres and Tomas is important to avoid
going down the wrong path (and wasting time). Is there something wrong
with that? BTW, this idea isn't new and it's certainly not only mine.
It's a combination of an implementation shared by Thomas Munro [4]/messages/by-id/CA+hUKGL5hW3i_pk5y_gcbF_C5kP-pWFjCuM8bAyCeHo3xUaH8g@mail.gmail.com and
an implementation I had shared with you offlist on 30th January 2025.
I never saw any comments from you on the specific changes in those
implementations and neither anything from those patchsets was absorbed
in your patchsets.

If I would have posted my alternate solution in January itself, that
might have been considered hijacking (that's a serious accusation,
btw). But instead I worked with your patches, improving them as long
as I could. Even the patchset I shared is still
on top of your patchset in [2]/messages/by-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2.

I don't know your solution. But if it's similar to my proposal, we are
in agreement and can work further in parallel on subproblems. If it's
different, let's discuss pros and cons of both - maybe there is some
value in letting those evolve parallely and let the community choose
the best, or choose best of both solutions giving rise to a new
solution. My patchset might give you solutions/code for the problems
you are trying to solve. It has tests which you can adapt to your
solution. Many exciting possibilities lie ahead with multiple working
solutions. Knowing nothing about the solution you are attempting, it's
hard to know which of these apply and help you.

[1]: /messages/by-id/CAExHW5sOu8+9h6t7jsA5jVcQ--N-LCtjkPnCw+rpoN0ovT6PHg@mail.gmail.com
[2]: /messages/by-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2
[3]: /messages/by-id/CAExHW5vB8sAmDtkEN5dcYYeBok3D8eAzMFCOH1k+krxht1yFjA@mail.gmail.com
[4]: /messages/by-id/CA+hUKGL5hW3i_pk5y_gcbF_C5kP-pWFjCuM8bAyCeHo3xUaH8g@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat

#130

Dmitry Dolgov

9erthalion6@gmail.com

3 months ago

In reply to: Ashutosh Bapat (#129)

Re: Changing shared_buffers without restart

On Thu, Oct 16, 2025 at 09:55:05PM +0530, Ashutosh Bapat wrote:

BTW, this idea isn't new and it's certainly not only mine.
It's a combination of an implementation shared by Thomas Munro [4] and
an implementation I had shared with you offlist on 30th January 2025.
I never saw any comments from you on the specific changes in those
implementations and neither anything from those patchsets was absorbed
in your patchsets.

Well, this is imply not true. We had an extensive discussion for long
time off-list and even a few video calls to talk through various design
options and agree about next steps.

I don't know your solution. But if it's similar to my proposal, we are
in agreement and can work further in parallel on subproblems. If it's
different, let's discuss pros and cons of both - maybe there is some
value in letting those evolve parallely and let the community choose
the best, or choose best of both solutions giving rise to a new
solution. My patchset might give you solutions/code for the problems
you are trying to solve. It has tests which you can adapt to your
solution. Many exciting possibilities lie ahead with multiple working
solutions. Knowing nothing about the solution you are attempting, it's
hard to know which of these apply and help you.

I've shared many times on- and off-list the general directions I'm
working in and even the expected timeline, so it's strange to state you
don't know it.

In the end you're free to do whatever you want, fortunately it's open
source. But posting an alternative patch series and "let the community
choose" does sound like hijacking to me, and a direct way to split and
reduce already scarse review attention.

#131

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

about 2 months ago

In reply to: Ashutosh Bapat (#127)

5 attachment(s)

Re: Changing shared_buffers without restart

Hi,
PFA new patchset with some TODOs from previous email addressed:

On Mon, Oct 13, 2025 at 9:28 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

1. New backends join while the synchronization is going on.

Done. Explained the solution below in detail.

An existing backend exiting.

Not tested specifically, but should work.

2. Failure or crash in the backend which is executing pg_resize_buffer_pool()

still a TODO

3. Fix crashes in the tests.

core regression passes, pg_buffercache regression tests pass and the
tests for buffer resizing pass most of the time. So far I have seen
two issues
1. An assertion from AIO worker - which happened only once and I
couldn't reproduce again. Need to study interaction of AIO worker with
buffer resizing.
2. checkpointer crashes - which is one of the TODOs listed below.
3. Also there's an shared memory id related failure, which I don't
understand but happen more frequently than the first one. Need to look
into that.

go through Tomas's detailed comments and address those
which still apply.

Still a TODO. But since many of those patches are revised heavily, I
think many of the comments may have been addressed, some may not apply
anymore.

And the patches are still WIP, with many TODOs. But I wanted to get some feedback on the proposed UI and synchronization

This is still a request.

Patches 0001 to 0016 are the same as the previous patchset. I haven't
touched them in case someone would like to see an incremental change.
However, it's getting unwieldy at this point, so I will squash
relevant patches together and provide a patchset with fewer patches
next.

I have squashed the patches into 3 so that it's easy to review, read
and work with those patches. The work is still WIP and there are many
TODOs in the patches.

Patch 0001: SQL interface to read contents of buffer lookup table. It
was there in the previous patchset as 0001 but in this patchset I have
moved the SQL function to the pg_buffercache module and renamed it
accordingly. I added this change because I found it useful to debug
issues I found while testing buffer resizing patches. The issues were
related to page->buffer mappings which existed in the buffer look up
table but were not present in the buffer descriptor array or buffer
blocks. pg_buffercache, which traverses just the buffer descriptor
array, isn't enough. Even without the resizing functionality this will
help us catch situations where buffer descriptor array and buffer
lookup table goes out of sync. I plan to keep it in this patchset as a
debugging tool. If other developers feel that it could be useful, I
will propose it in a separate thread.

Patch 0002: This is a single patch squashing all patches (0005, 0006,
0007, 0008, 0009 and 0010) related to shared memory management and
address space reservation together. This patch allows the creation of
multiple shared memory segments and also lays them out so as to make
those resizable. The actual code to resize the segments is in the next
patch. The APIs used for memory management and address space
reservation are described later. Prominent changes from the previous
patches are:
1. modifies CalculateShmemSize() so that it can work with multiple
shared memory segments.
2. It also combines AnonymousMapping and ShmemSegment structures
together as suggested by Tomas upthread. The merger is still going on.
There are some old comments or variable names referring to memory
mapping when they should be mentioning shared memory segments. I will
work on that when I start polishing this patch.
4. GUC to specify the maximum size of buffer pool has been renamed and
moved to the next patch which deals with actual resizing.
5. Changes to process config reload in AIO workers are removed. Those
are not needed after 55b454d0e14084c841a034073abbf1a0ea937a45.

Patch 0003: Implements the UI and synchronization described in the
previous email [1]/messages/by-id/CAExHW5sOu8+9h6t7jsA5jVcQ--N-LCtjkPnCw+rpoN0ovT6PHg@mail.gmail.com with additional improvements to support a new
backend joining while resizing is in progress. This patch squashes
other patches 0002 - 0004 and 0011 onward patches from the previous
patchset, but it also gets rid of a lot of code related to the old
synchronization method and the old UI. The code related to resizing
including implementation of pg_resize_shared_buffers() is moved to
storage/buffer/buf_resize.c, a new file. There is no change to the UI.
The buffer resizing still looks like as described in the previous
email.

SHOW shared_buffers; -- default
shared_buffers
----------------
128MB
(1 row)

ALTER SYSTEM SET shared_buffers = '64MB';
SELECT pg_reload_conf();
pg_reload_conf
----------------
t
(1 row)

SHOW shared_buffers;
shared_buffers
-----------------------
128MB (pending: 64MB)
(1 row)

SELECT pg_resize_shared_buffers();
pg_resize_shared_buffers
--------------------------
t
(1 row)

SHOW shared_buffers;
shared_buffers
----------------
64MB
(1 row)

ALTER SYSTEM SET shared_buffers = '256MB';
SELECT pg_reload_conf();
pg_reload_conf
----------------
t
(1 row)

SHOW shared_buffers;
shared_buffers
-----------------------
64MB (pending: 256MB)
(1 row)

SELECT pg_resize_shared_buffers();
pg_resize_shared_buffers
--------------------------
t
(1 row)

SHOW shared_buffers;
shared_buffers
----------------
256MB
(1 row)

The implementation uses a similar strategy as described in the
previous email with changes described below.

A new backend inherits the address space of shared memory segments and
the local variable NBuffers through Postmaster. These are changed when
resizing the buffer pool. And the same changes need to be applied to
the Postmaster so that a new backend inherits them. Since Postmaster
is not part of the ProcSignalBarrier mechanism, the coordinator has to
send signals to the Postmaster separately. This has the following
drawbacks
1. Additional code to signal Postmaster
2. coordinator has to wait for Postmaster to apply the changes
separately, thus adding extra delays
3. platforms which use fork() + exec(), will add more complexity to
transfer the state to new child
4. If the postmaster is signaled after sending a barrier to other
backends, the newly joined backend will miss the state update as well
as the barrier. If the postmaster is signaled before sending a barrier
to other backends, a newly joining backend will receive the barrier as
well as state update from Postmaster. This means the barrier handling
code is required to be idempotent. This will make the barrier handling
code more complex and also constrained.

Instead the approach taken by Thomas Munro in [2]/messages/by-id/CA+hUKGL5hW3i_pk5y_gcbF_C5kP-pWFjCuM8bAyCeHo3xUaH8g@mail.gmail.com does not require
updating the address space. It uses shared memory variables instead of
process local memory variables to save the state of the shared buffer
pool. This patchset uses a similar approach and
1. avoids involving Postmaster in the resizing process
2. additionally making barrier handling code super thin.

Shared Memory and address space management
========================================
An fd is created using memfd_create to manage the size of the shared
memory segment using ftruncate and fallocate(). That fd is passed to
mmap() which reserves the maximum required address space and maps the
anonymous file (and the backing memory) in that address space. mmap
uses MAP_NORESERVE so as not to allocate memory against mapping. The
size of the anonymous file controls the amount of memory allocated.
For the main shared memory segment, the size of the reserved space is
the same as the amount of memory required. But for shared buffer pool
related segments the size of the reserved space is decided by GUC
max_shared_buffers (mentioned in the previous email and quoted below).
When resizing shared buffers only the anonymous file is resized and
not the address space. I tested this protocol with an attached small
program (mfdtruncate.c). Sharing it in case somebody finds it useful.

Saving shared buffer pool sizes in the shared memory
=========================================
When resizing, we need to track two ranges of buffers 1. active
buffers, which is the range of buffers from which the new allocations
happen at a given time and 2. valid buffers which is the range of
buffers which are valid at a given time. When shrinking, the active
buffers is set to the new size while the valid buffers remains same as
the old size till all the buffers outside the new size are evicted.
When expanding, valid buffers and active buffers are both changed to
new size after memory is resized and expanded data structures are
initialized. Current global variable NBuffers is insufficient to track
these two numbers.

Instead we have a new member StrategyControl::activeNBuffers which
tracks the active buffer range. The shared memory structure
controlling the resizing operation (ShmemCtrl) has a member
currentNBuffers which gives the range of valid number of shared
buffers at a given point in time. (I am planning to merge ShmemCtrl
and StrategyControl, so that we have all the metadata about shared
buffers in one place in the shared memory). These two numbers are
saved in the shared memory for the reasons explained below and replace
current NBuffers. They are modified by the coordinator as the resizing
progresses. Some usages of NBuffers are replaced by one of the two
variables as appropriate but more work is required.

Next I will be working on
1. Background writer synchronization
2. Checkpoint synchronization
3. Make all the shared buffer pool structures, except buffer blocks,
static and maximally allocated as suggested by Andres earlier. [3]/messages/by-id/qltuzcdxapofdtb5mrd4em3bzu2qiwhp3cdwdsosmn7rhrtn4u@yaogvphfwc4h
4. Replace NBuffers usages as explained above
3. merge ShmemCtrl and StrategyControl as explained above
4. Handle failures in resizing
5. There have been concerns raised earlier that anonymous file backed
memory is not dumped with core. I am thinking of not using an
anonymous file for the main memory segment so that it gets dumped with
core. But shared buffers still will be dumped. However, I am skeptical
as to whether we need GBs (say) of shared buffers being dumped along
with core or should we leave that choice to users.

[1]: /messages/by-id/CAExHW5sOu8+9h6t7jsA5jVcQ--N-LCtjkPnCw+rpoN0ovT6PHg@mail.gmail.com
[2]: /messages/by-id/CA+hUKGL5hW3i_pk5y_gcbF_C5kP-pWFjCuM8bAyCeHo3xUaH8g@mail.gmail.com
[3]: /messages/by-id/qltuzcdxapofdtb5mrd4em3bzu2qiwhp3cdwdsosmn7rhrtn4u@yaogvphfwc4h

--
Best Wishes,
Ashutosh Bapat

Attachments:

0001-Add-a-view-to-read-contents-of-shared-buffe-20251114.patchtext/x-patch; charset=US-ASCII; name=0001-Add-a-view-to-read-contents-of-shared-buffe-20251114.patchDownload

From a24f17114aa9119dbf899166128d48aaf4106ca7 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Mon, 25 Aug 2025 19:23:50 +0530
Subject: [PATCH 1/4] Add a view to read contents of shared buffer lookup table

The view exposes the contents of the shared buffer lookup table for
debugging, testing and investigation.

This helped me in debugging issues where the buffer descriptor array and
buffer lookup table were out of sync; either the buffer lookup table had
a mapping page->buffer which wasn't present in the buffer descriptor
array or a page in the buffer descriptor array didn't have corresponding
entry in the buffer lookup table. pg_buffercache doesn't help with those
kind of issues. Also doing that under the debugger in very painful.

I intend to keep this patch while the rest of the code matures. If it is
found useful as a debugging tool, we may consider make it committable
and commit it.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
---
 .../expected/pg_buffercache.out               | 39 ++++++++
 .../pg_buffercache--1.5--1.6.sql              | 24 +++++
 contrib/pg_buffercache/pg_buffercache_pages.c | 18 ++++
 contrib/pg_buffercache/sql/pg_buffercache.sql | 20 +++++
 doc/src/sgml/system-views.sgml                | 89 +++++++++++++++++++
 src/backend/storage/buffer/buf_table.c        | 58 ++++++++++++
 src/include/storage/buf_internals.h           |  2 +
 7 files changed, 250 insertions(+)

diff --git a/contrib/pg_buffercache/expected/pg_buffercache.out b/contrib/pg_buffercache/expected/pg_buffercache.out
index 9a9216dc7b1..2f27bf34637 100644
--- a/contrib/pg_buffercache/expected/pg_buffercache.out
+++ b/contrib/pg_buffercache/expected/pg_buffercache.out
@@ -23,6 +23,26 @@ SELECT count(*) > 0 FROM pg_buffercache_usage_counts() WHERE buffers >= 0;
  t
 (1 row)
 
+-- Test the buffer lookup table function and count is <= shared_buffers
+select count(*) <= (select setting::bigint
+                    from pg_settings
+                    where name = 'shared_buffers')
+from pg_buffercache_lookup_table_entries();
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Check that pg_buffercache_lookup_table view works and count is <= shared_buffers
+select count(*) <= (select setting::bigint
+                    from pg_settings
+                    where name = 'shared_buffers')
+from pg_buffercache_lookup_table;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- Check that the functions / views can't be accessed by default. To avoid
 -- having to create a dedicated user, use the pg_database_owner pseudo-role.
 SET ROLE pg_database_owner;
@@ -34,6 +54,10 @@ SELECT * FROM pg_buffercache_summary();
 ERROR:  permission denied for function pg_buffercache_summary
 SELECT * FROM pg_buffercache_usage_counts();
 ERROR:  permission denied for function pg_buffercache_usage_counts
+SELECT * FROM pg_buffercache_lookup_table_entries();
+ERROR:  permission denied for function pg_buffercache_lookup_table_entries
+SELECT * FROM pg_buffercache_lookup_table;
+ERROR:  permission denied for view pg_buffercache_lookup_table
 RESET role;
 -- Check that pg_monitor is allowed to query view / function
 SET ROLE pg_monitor;
@@ -55,6 +79,21 @@ SELECT count(*) > 0 FROM pg_buffercache_usage_counts();
  t
 (1 row)
 
+RESET role;
+-- Check that pg_read_all_stats is allowed to query buffer lookup table
+SET ROLE pg_read_all_stats;
+SELECT count(*) >= 0 FROM pg_buffercache_lookup_table_entries();
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT count(*) >= 0 FROM pg_buffercache_lookup_table;
+ ?column? 
+----------
+ t
+(1 row)
+
 RESET role;
 ------
 ---- Test pg_buffercache_evict* functions
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
index 458f054a691..9bf58567878 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -44,3 +44,27 @@ CREATE FUNCTION pg_buffercache_evict_all(
     OUT buffers_skipped int4)
 AS 'MODULE_PATHNAME', 'pg_buffercache_evict_all'
 LANGUAGE C PARALLEL SAFE VOLATILE;
+
+-- Add the buffer lookup table function
+CREATE FUNCTION pg_buffercache_lookup_table_entries(
+    OUT tablespace oid,
+    OUT database oid,
+    OUT relfilenode oid,
+    OUT forknum int2,
+    OUT blocknum int8,
+    OUT bufferid int4)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_buffercache_lookup_table_entries'
+LANGUAGE C PARALLEL SAFE VOLATILE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_lookup_table AS
+    SELECT * FROM pg_buffercache_lookup_table_entries();
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_lookup_table_entries() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_lookup_table FROM PUBLIC;
+
+-- Grant access to monitoring role.
+GRANT EXECUTE ON FUNCTION pg_buffercache_lookup_table_entries() TO pg_read_all_stats;
+GRANT SELECT ON pg_buffercache_lookup_table TO pg_read_all_stats;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index ab790533ff6..fe9af45febe 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -16,6 +16,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/rel.h"
+#include "utils/tuplestore.h"
 
 
 #define NUM_BUFFERCACHE_PAGES_MIN_ELEM	8
@@ -100,6 +101,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
+PG_FUNCTION_INFO_V1(pg_buffercache_lookup_table_entries);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -776,3 +778,19 @@ pg_buffercache_evict_all(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+/*
+ * Return lookup table content as a set of records.
+ */
+Datum
+pg_buffercache_lookup_table_entries(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	/* Fill the tuplestore */
+	BufTableGetContents(rsinfo->setResult, rsinfo->setDesc);
+
+	return (Datum) 0;
+}
diff --git a/contrib/pg_buffercache/sql/pg_buffercache.sql b/contrib/pg_buffercache/sql/pg_buffercache.sql
index 47cca1907c7..569b28aebb9 100644
--- a/contrib/pg_buffercache/sql/pg_buffercache.sql
+++ b/contrib/pg_buffercache/sql/pg_buffercache.sql
@@ -12,6 +12,18 @@ from pg_buffercache_summary();
 
 SELECT count(*) > 0 FROM pg_buffercache_usage_counts() WHERE buffers >= 0;
 
+-- Test the buffer lookup table function and count is <= shared_buffers
+select count(*) <= (select setting::bigint
+                    from pg_settings
+                    where name = 'shared_buffers')
+from pg_buffercache_lookup_table_entries();
+
+-- Check that pg_buffercache_lookup_table view works and count is <= shared_buffers
+select count(*) <= (select setting::bigint
+                    from pg_settings
+                    where name = 'shared_buffers')
+from pg_buffercache_lookup_table;
+
 -- Check that the functions / views can't be accessed by default. To avoid
 -- having to create a dedicated user, use the pg_database_owner pseudo-role.
 SET ROLE pg_database_owner;
@@ -19,6 +31,8 @@ SELECT * FROM pg_buffercache;
 SELECT * FROM pg_buffercache_pages() AS p (wrong int);
 SELECT * FROM pg_buffercache_summary();
 SELECT * FROM pg_buffercache_usage_counts();
+SELECT * FROM pg_buffercache_lookup_table_entries();
+SELECT * FROM pg_buffercache_lookup_table;
 RESET role;
 
 -- Check that pg_monitor is allowed to query view / function
@@ -28,6 +42,12 @@ SELECT buffers_used + buffers_unused > 0 FROM pg_buffercache_summary();
 SELECT count(*) > 0 FROM pg_buffercache_usage_counts();
 RESET role;
 
+-- Check that pg_read_all_stats is allowed to query buffer lookup table
+SET ROLE pg_read_all_stats;
+SELECT count(*) >= 0 FROM pg_buffercache_lookup_table_entries();
+SELECT count(*) >= 0 FROM pg_buffercache_lookup_table;
+RESET role;
+
 
 ------
 ---- Test pg_buffercache_evict* functions
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 7971498fe75..8f3e2741051 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -71,6 +71,11 @@
       <entry>backend memory contexts</entry>
      </row>
 
+     <row>
+      <entry><link linkend="view-pg-buffer-lookup-table"><structname>pg_buffer_lookup_table</structname></link></entry>
+      <entry>shared buffer lookup table</entry>
+     </row>
+
      <row>
       <entry><link linkend="view-pg-config"><structname>pg_config</structname></link></entry>
       <entry>compile-time configuration parameters</entry>
@@ -901,6 +906,90 @@ AND c1.path[c2.level] = c2.path[c2.level];
   </para>
  </sect1>
 
+ <sect1 id="view-pg-buffer-lookup-table">
+  <title><structname>pg_buffer_lookup_table</structname></title>
+  <indexterm>
+   <primary>pg_buffer_lookup_table</primary>
+  </indexterm>
+  <para>
+   The <structname>pg_buffer_lookup_table</structname> view exposes the current
+   contents of the shared buffer lookup table. Each row represents an entry in
+   the lookup table mapping a relation page to the ID of buffer in which it is
+   cached. The shared buffer lookup table is locked for a short duration while
+   reading so as to ensure consistency. This may affect performance if this view
+   is queried very frequently.
+  </para>
+  <table id="pg-buffer-lookup-table-view" xreflabel="pg_buffer_lookup_table">
+   <title><structname>pg_buffer_lookup_table</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>tablespace</structfield> <type>oid</type>
+      </para>
+      <para>
+       OID of the tablespace containing the relation
+      </para></entry>
+     </row>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>database</structfield> <type>oid</type>
+      </para>
+      <para>
+       OID of the database containing the relation (zero for shared relations)
+      </para></entry>
+     </row>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>relfilenode</structfield> <type>oid</type>
+      </para>
+      <para>
+       relfilenode identifying the relation
+      </para></entry>
+     </row>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>forknum</structfield> <type>int2</type>
+      </para>
+      <para>
+       Fork number within the relation (see <xref linkend="storage-file-layout"/>)
+      </para></entry>
+     </row>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>blocknum</structfield> <type>int8</type>
+      </para>
+      <para>
+       Block number within the relation
+      </para></entry>
+     </row>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>bufferid</structfield> <type>int4</type>
+      </para>
+      <para>
+       ID of the buffer caching the page 
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+  <para>
+   Access to this view is restricted to members of the
+   <literal>pg_read_all_stats</literal> role by default.
+  </para>
+ </sect1>
+
  <sect1 id="view-pg-config">
   <title><structname>pg_config</structname></title>
 
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 9d256559bab..f0c39ec2822 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -21,7 +21,12 @@
  */
 #include "postgres.h"
 
+#include "fmgr.h"
+#include "funcapi.h"
 #include "storage/buf_internals.h"
+#include "storage/lwlock.h"
+#include "utils/rel.h"
+#include "utils/builtins.h"
 
 /* entry for buffer lookup hashtable */
 typedef struct
@@ -159,3 +164,56 @@ BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
 	if (!result)				/* shouldn't happen */
 		elog(ERROR, "shared buffer hash table corrupted");
 }
+
+/*
+ * BufTableGetContents
+ *		Fill the given tuplestore with contents of the shared buffer lookup table
+ *
+ * This function is used by pg_buffercache extension to expose buffer lookup
+ * table contents via SQL. The caller is responsible for setting up the
+ * tuplestore and result set info.
+ */
+void
+BufTableGetContents(Tuplestorestate *tupstore, TupleDesc tupdesc)
+{
+/* Expected number of attributes of the buffer lookup table entry. */
+#define BUFTABLE_CONTENTS_COLS 6
+
+	HASH_SEQ_STATUS hstat;
+	BufferLookupEnt *ent;
+	Datum		values[BUFTABLE_CONTENTS_COLS];
+	bool		nulls[BUFTABLE_CONTENTS_COLS];
+	int			i;
+
+	memset(nulls, 0, sizeof(nulls));
+
+	Assert(tupdesc->natts == BUFTABLE_CONTENTS_COLS);
+
+	/*
+	 * Lock all buffer mapping partitions to ensure a consistent view of the
+	 * hash table during the scan. Must grab LWLocks in partition-number order
+	 * to avoid LWLock deadlock.
+	 */
+	for (i = 0; i < NUM_BUFFER_PARTITIONS; i++)
+		LWLockAcquire(BufMappingPartitionLockByIndex(i), LW_SHARED);
+
+	hash_seq_init(&hstat, SharedBufHash);
+	while ((ent = (BufferLookupEnt *) hash_seq_search(&hstat)) != NULL)
+	{
+		values[0] = ObjectIdGetDatum(ent->key.spcOid);
+		values[1] = ObjectIdGetDatum(ent->key.dbOid);
+		values[2] = ObjectIdGetDatum(ent->key.relNumber);
+		values[3] = ObjectIdGetDatum(ent->key.forkNum);
+		values[4] = Int64GetDatum(ent->key.blockNum);
+		values[5] = Int32GetDatum(ent->id);
+
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/*
+	 * Release all buffer mapping partition locks in the reverse order so as
+	 * to avoid LWLock deadlock.
+	 */
+	for (i = NUM_BUFFER_PARTITIONS - 1; i >= 0; i--)
+		LWLockRelease(BufMappingPartitionLockByIndex(i));
+}
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 5400c56a965..519692702a0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -28,6 +28,7 @@
 #include "storage/spin.h"
 #include "utils/relcache.h"
 #include "utils/resowner.h"
+#include "utils/tuplestore.h"
 
 /*
  * Buffer state is a single 32-bit variable where following data is combined.
@@ -520,6 +521,7 @@ extern uint32 BufTableHashCode(BufferTag *tagPtr);
 extern int	BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
 extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
 extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
+extern void BufTableGetContents(Tuplestorestate *tupstore, TupleDesc tupdesc);
 
 /* localbuf.c */
 extern bool PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount);

base-commit: df53fa1c1ebf9bb3e8c17217f7cc1435107067fb
-- 
2.34.1

0004-WIP-test-shared-buffers-resizing-and-checkp-20251114.patchtext/x-patch; charset=US-ASCII; name=0004-WIP-test-shared-buffers-resizing-and-checkp-20251114.patchDownload

From 17b83eb9d1b5a825e1e2bfca9d360a738213bf01 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Wed, 1 Oct 2025 09:38:19 +0530
Subject: [PATCH 4/4] WIP test shared buffers resizing and checkpoint

A new test triggers an injection point in the BufferSync() after it has
collected buffers to flushed. Simultaneously it starts buffer shrinking. The
expectation is that the checkpointer would crash accessing a buffer (descriptor)
outside the new range of shared buffers. But that does not happen because of a
bug in synchronization. The checkpointer does not reload configuration when
checkpoint is going on. It does not load the new value of the configuration.
When the resizing is triggered by the PM, checkpointer receives the proc signal
barrier but it does not start it doesn't enter the barrier mechanism and doesn't
alter its address maps or memory sizes. Hence the test does not crash. But of
course it means that it won't consider the correct size of buffers next time it
performs a checkpoint.

The test was at least useful to detect this anomaly. Once we fix the
synchronization issue we should see the crash and then fix the crash.

Author: Ashutosh Bapat

Notes to reviewers
------------------

1. pg_buffercache used a query on pg_settings to fetch the value of the
   number of buffers. That doesn't work anymore because of change in the
   SHOW shared_buffers. Modified the test to convert the setting value
   to the number of shared buffers, save it in a variable and use the
   variable in queries which need the number of shared buffers. We could
   instead fix ShowGUCOption() to pass use_units flag to show_hook and
   let it output the number of shared buffers instead.  But that seems a
   larger change. There aren't other GUCs whose show_hook outputs their
   values with units. So this local fix might be better.
---
 .../expected/pg_buffercache.out               |  19 ++-
 contrib/pg_buffercache/sql/pg_buffercache.sql |  19 ++-
 src/backend/storage/buffer/bufmgr.c           |   4 +
 src/test/buffermgr/meson.build                |   1 +
 .../t/002_checkpoint_buffer_resize.pl         | 111 ++++++++++++++++++
 5 files changed, 130 insertions(+), 24 deletions(-)
 create mode 100644 src/test/buffermgr/t/002_checkpoint_buffer_resize.pl

diff --git a/contrib/pg_buffercache/expected/pg_buffercache.out b/contrib/pg_buffercache/expected/pg_buffercache.out
index 2f27bf34637..632b12abbf8 100644
--- a/contrib/pg_buffercache/expected/pg_buffercache.out
+++ b/contrib/pg_buffercache/expected/pg_buffercache.out
@@ -1,8 +1,9 @@
 CREATE EXTENSION pg_buffercache;
-select count(*) = (select setting::bigint
-                   from pg_settings
-                   where name = 'shared_buffers')
-from pg_buffercache;
+select pg_size_bytes(setting)/(select setting::bigint from pg_settings where name = 'block_size') AS nbuffers
+        from pg_settings
+        where name = 'shared_buffers'
+\gset
+select count(*) = :nbuffers from pg_buffercache;
  ?column? 
 ----------
  t
@@ -24,20 +25,14 @@ SELECT count(*) > 0 FROM pg_buffercache_usage_counts() WHERE buffers >= 0;
 (1 row)
 
 -- Test the buffer lookup table function and count is <= shared_buffers
-select count(*) <= (select setting::bigint
-                    from pg_settings
-                    where name = 'shared_buffers')
-from pg_buffercache_lookup_table_entries();
+select count(*) <= :nbuffers from pg_buffercache_lookup_table_entries();
  ?column? 
 ----------
  t
 (1 row)
 
 -- Check that pg_buffercache_lookup_table view works and count is <= shared_buffers
-select count(*) <= (select setting::bigint
-                    from pg_settings
-                    where name = 'shared_buffers')
-from pg_buffercache_lookup_table;
+select count(*) <= :nbuffers from pg_buffercache_lookup_table;
  ?column? 
 ----------
  t
diff --git a/contrib/pg_buffercache/sql/pg_buffercache.sql b/contrib/pg_buffercache/sql/pg_buffercache.sql
index 569b28aebb9..11fe85ceb3b 100644
--- a/contrib/pg_buffercache/sql/pg_buffercache.sql
+++ b/contrib/pg_buffercache/sql/pg_buffercache.sql
@@ -1,9 +1,10 @@
 CREATE EXTENSION pg_buffercache;
 
-select count(*) = (select setting::bigint
-                   from pg_settings
-                   where name = 'shared_buffers')
-from pg_buffercache;
+select pg_size_bytes(setting)/(select setting::bigint from pg_settings where name = 'block_size') AS nbuffers
+        from pg_settings
+        where name = 'shared_buffers'
+\gset
+select count(*) = :nbuffers from pg_buffercache;
 
 select buffers_used + buffers_unused > 0,
         buffers_dirty <= buffers_used,
@@ -13,16 +14,10 @@ from pg_buffercache_summary();
 SELECT count(*) > 0 FROM pg_buffercache_usage_counts() WHERE buffers >= 0;
 
 -- Test the buffer lookup table function and count is <= shared_buffers
-select count(*) <= (select setting::bigint
-                    from pg_settings
-                    where name = 'shared_buffers')
-from pg_buffercache_lookup_table_entries();
+select count(*) <= :nbuffers from pg_buffercache_lookup_table_entries();
 
 -- Check that pg_buffercache_lookup_table view works and count is <= shared_buffers
-select count(*) <= (select setting::bigint
-                    from pg_settings
-                    where name = 'shared_buffers')
-from pg_buffercache_lookup_table;
+select count(*) <= :nbuffers from pg_buffercache_lookup_table;
 
 -- Check that the functions / views can't be accessed by default. To avoid
 -- having to create a dedicated user, use the pg_database_owner pseudo-role.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6c8f8552a4c..f489ae2932f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -67,6 +67,7 @@
 #include "utils/rel.h"
 #include "utils/resowner.h"
 #include "utils/timestamp.h"
+#include "utils/injection_point.h"
 
 
 /* Note: these two macros only work on shared buffers, not local ones! */
@@ -3416,6 +3417,9 @@ BufferSync(int flags)
 			ProcessProcSignalBarrier();
 	}
 
+	/* Injection point after scanning all buffers for dirty pages */
+	INJECTION_POINT("buffer-sync-dirty-buffer-scan", NULL);
+
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
diff --git a/src/test/buffermgr/meson.build b/src/test/buffermgr/meson.build
index c24bff721e6..f33feb64a06 100644
--- a/src/test/buffermgr/meson.build
+++ b/src/test/buffermgr/meson.build
@@ -16,6 +16,7 @@ tests += {
     },
     'tests': [
       't/001_resize_buffer.pl',
+      't/002_checkpoint_buffer_resize.pl',
       't/003_parallel_resize_buffer.pl',
       't/004_client_join_buffer_resize.pl',
     ],
diff --git a/src/test/buffermgr/t/002_checkpoint_buffer_resize.pl b/src/test/buffermgr/t/002_checkpoint_buffer_resize.pl
new file mode 100644
index 00000000000..9ab615b6557
--- /dev/null
+++ b/src/test/buffermgr/t/002_checkpoint_buffer_resize.pl
@@ -0,0 +1,111 @@
+# Copyright (c) 2025-2025, PostgreSQL Global Development Group
+#
+# Test shared_buffer resizing coordination with checkpoint using injection points
+
+use strict;
+use warnings;
+use IPC::Run;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Skip this test if injection points are not supported
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
+# Initialize cluster with injection points enabled
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->append_conf('postgresql.conf', 'shared_preload_libraries = injection_points');
+$node->append_conf('postgresql.conf', 'shared_buffers = 256kB');
+# Disable background writer to prevent interference with dirty buffers
+$node->append_conf('postgresql.conf', 'bgwriter_lru_maxpages = 0');
+$node->start;
+
+# Load the injection points extension
+$node->safe_psql('postgres', "CREATE EXTENSION injection_points");
+
+# Create some data to make checkpoint meaningful and ensure many dirty buffers
+$node->safe_psql('postgres', "CREATE TABLE test_data (id int, data text)");
+# Insert enough data to fill more than 16 buffers (each row ~1KB, so 20+ rows per page)
+$node->safe_psql('postgres', "INSERT INTO test_data SELECT i, repeat('x', 1000) FROM generate_series(1, 5000) i");
+
+# Create additional tables to ensure we have plenty of dirty buffers
+$node->safe_psql('postgres', "CREATE TABLE test_data2 AS SELECT * FROM test_data WHERE id <= 2500");
+$node->safe_psql('postgres', "CREATE TABLE test_data3 AS SELECT * FROM test_data WHERE id > 2500");
+
+# Update data to create more dirty buffers
+$node->safe_psql('postgres', "UPDATE test_data SET data = repeat('y', 1000) WHERE id % 3 = 0");
+$node->safe_psql('postgres', "UPDATE test_data2 SET data = repeat('z', 1000) WHERE id % 2 = 0");
+
+# Prepare the new shared_buffers configuration before starting checkpoint
+$node->safe_psql('postgres', "ALTER SYSTEM SET shared_buffers = '128kB'");
+$node->safe_psql('postgres', "SELECT pg_reload_conf()");
+
+# Set up the injection point to make checkpoint wait
+$node->safe_psql('postgres', "SELECT injection_points_attach('buffer-sync-dirty-buffer-scan', 'wait')");
+
+# Start a checkpoint in the background that will trigger the injection point
+my $checkpoint_session = $node->background_psql('postgres');
+$checkpoint_session->query_until(
+	qr/starting_checkpoint/,
+	q(
+		\echo starting_checkpoint
+		CHECKPOINT;
+		\q
+	)
+);
+
+# Wait until checkpointer actually reaches the injection point
+$node->wait_for_event('checkpointer', 'buffer-sync-dirty-buffer-scan');
+
+# Verify checkpoint is waiting by checking if it hasn't completed
+my $checkpoint_running = $node->safe_psql('postgres', 
+	"SELECT COUNT(*) FROM pg_stat_activity WHERE backend_type = 'checkpointer' AND wait_event = 'buffer-sync-dirty-buffer-scan'");
+is($checkpoint_running, '1', 'Checkpoint is waiting at injection point');
+
+# Start the resize operation in the background (don't wait for completion)
+my $resize_session = $node->background_psql('postgres');
+$resize_session->query_until(
+	qr/starting_resize/,
+	q(
+		\echo starting_resize
+		SELECT pg_resize_shared_buffers();
+	)
+);
+
+# Continue the checkpoint and wait for its completion
+my $log_offset = -s $node->logfile;
+$node->safe_psql('postgres', "SELECT injection_points_wakeup('buffer-sync-dirty-buffer-scan')");
+
+# Wait for both checkpoint and resize to complete
+$node->wait_for_log(qr/checkpoint complete/, $log_offset);
+
+# Wait for the resize operation to complete using the proper method
+$resize_session->query(q(\echo 'resize_complete'));
+
+pass('Checkpoint and buffer resize both completed after injection point was released');
+
+# Verify the resize actually worked
+is($node->safe_psql('postgres', "SHOW shared_buffers"), '128kB',
+	'Buffer resize completed successfully after checkpoint coordination');
+
+# Cleanup the background session
+$resize_session->quit;
+
+# Clean up the injection point
+$node->safe_psql('postgres', "SELECT injection_points_detach('buffer-sync-dirty-buffer-scan')");
+
+# Verify system remains stable after coordinated operations
+
+# Perform a normal checkpoint to ensure everything is working
+$node->safe_psql('postgres', "CHECKPOINT");
+
+pass('System remains stable after injection point testing');
+
+# Cleanup
+$node->safe_psql('postgres', "DROP TABLE test_data, test_data2, test_data3");
+
+done_testing();
\ No newline at end of file
-- 
2.34.1

0003-Allow-to-resize-shared-memory-without-resta-20251114.patchtext/x-patch; charset=US-ASCII; name=0003-Allow-to-resize-shared-memory-without-resta-20251114.patchDownload

From e49b35344a9cbbb7e3229df2e7a2ff0a83290d59 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Tue, 17 Jun 2025 14:16:55 +0200
Subject: [PATCH 3/4] Allow to resize shared memory without restart

shared_buffers is now PGC_SIGHUP instead of PGC_POSTMASTER.  The value
of this GUC is saved in NBuffersPending instead of NBuffers. When the
server starts, the shared memory size is estimated and the memory is
allocated using NBuffersPending.

When a server is running, the new value of GUC (set using ALTER SYSTEM
... SET shared_buffers = ...; followed by SELECT pg_reload_conf()) does
not come into effect immediately. Instead a function
pg_resize_shared_buffers() is used to resize the buffer pool. The
function uses the current value of GUC in the backends where it is
executed. The function also coordinates the buffer resizing
synchronization across backends.

SHOW shared_buffers now shows the current size of the shared buffer pool
but it also shows pending size of shared buffers, if any.

A new GUC max_shared_buffers is introduced to control the maximum value
of shared_buffers that can be set. By default it is 0. When explicitly
set it needs to be higher than 'shared_buffers'. When max_shared_buffers
is set to 0, it assumes the same value as GUC shared_buffers. This GUC
determines the size of address space reserved for future buffer pool
sizes and the size of buffer look up table.

TBD: Describe the protocol used by pg_resize_shared_buffers() to
synchronize buffer resizing operation with other backends.

When shrinking the shared buffers pool, each buffer in the area being
shrunk needs to be flushed if it's dirty so as not to loose the changes
to that buffer after shrinking. Also, each such buffer needs to be
removed from the buffer mapping table so that backends do not access it
after shrinking.

If a buffer being evicted is pinned, we abort the resizing operation.
There are other alternative which are not implemented in the current
patches 1. to wait for the pinned buffer to get unpinned, 2. the backend
is killed or it itself cancels the query  or 3.  rollback the operation.
Note that option 1 and 2 would require the pinning related local and
shared records to be accessed. But we need infrastructure to do either
of this right now.

So far the buffer pool metdata (NBuffers and the shared memory segment
address space) is saved in process local heap memory since it's static
for the life of a server. It is passed to a new backend through
Postmaster. But with buffer pool being resized while the server running,
we need Postmaster to update its buffer pool metadata as the resizing
progresses and pass it to the new backend. This has few complications:

1. Postmaster does not receive ProcSignalBarrier. So we need to signal
   it separately.

2. Postmaster's local state is inherited by the new backend when
   fork()ed. But we need more complex implementation to pass it to an
   exec()ed backend.

3. A new backend may receive the updated state from Postmaster and also
   the signal barrier which prompts the same update. Thus the proc
   signal barrier code needs to be idempotent; adding further complexity
   to it.

4. This task takes away Postmaster resources from it's core
   functionality.

This can be avoided by following two changes:
1. The shared memory is resized only in a single backend without
   requiring any changes to the memory address space.
2. Maintaining the buffer pool metadata in the shared memory instead of
   process local memory. This change may affect performance so verify
   that performance is not degraded.

TODO: In case the backend executing pg_resize_shared_buffers() exits
before the operation finishes, we need to make sure that the changes
made to the shared memory while resizing are cleaned up properly.

Removing the evicted buffers from buffer ring
=============================================
If the buffer pool has been shrunk, the buffers in the buffer ring may
not be valid anymore. Modify GetBufferFromRing to check if the buffer is
still valid before using it. This makes GetBufferFromRing() a bit more
expensive because of additional boolean condition and masks any bug that
introduces an invalid buffer into the ring. The alternative fix is more
complex as explained below.

The strategy object is created in CurrentMemoryContext and is not
available in any global structure thus accessible when processing buffer
resizing barriers. We may modify GetAccessStrategy() to register
strategy in a global linked list and then arrange to deregister it once
it's no more in use. Looking at the places which use
GetAccessStrategy(), fixing all those may be some work.

Author: Ashutosh Bapat
Author: Dmitrii Dolgov
Author of some tests: Palak Chaturvedi <chaturvedipalak1911@gmail.com>
Reviewed-by: Tomas Vondra

More detailed note follow: Need to see which of those fit in the commit
message and which should be removed.

Reinitializing strategry control area
=====================================
The commit introduces a separate function StrategyReInitialize() instead
of reusing StrategyInitialize() since some of the things that the second
one does are not required in the first one. Here's list of what
StrategyReInitialize() does and how does it differ from
StrategyInitialize().

1. StrategyControl pointer needn't be fetched again since it should not
   change. But added an Assert to make sure the pointer is valid.
2. &StrategyControl->buffer_strategy_lock need not be initialized again.
3. nextVictimBuffer, completePasses and numBufferAllocs are viewed in
   the context of NBuffers. Now that NBuffers itself has changed, those
   three do not make sense. Reset them as if the server has restarted
   again.

Ability to delay resizing operation
===================================
This commit introduces a flag delay_shmem_resize, which postgresql
backends and workers can use to signal the coordinator to delay resizing
operation. Background writer sets this flag when its scanning buffers.

Background writer operation (needs a rethink)
===========================
Background writer is blocked when the actual resizing is in progress. It
stops a scan in progress when it sees that the resizing has begun or is
about to begin. Once the buffer resizing is finished, before resuming
the regular operation, bgwriter resets the information saved so far.
This information is viewed in the context of NBuffers and hence does not
make sense after resizing which chanegs NBuffers.

Buffer lookup table
===================
Right now there is no way to free shared memory. Even if we shrink the
buffer lookup table when shrinking the buffer pool the unused hash table
entries can not be freed. When we expand the buffer pool, more entries
can be allocated but we can not resize the hash table directory without
rehashing all the entries. Just allocating more entries will lead to
more contention. Hence we setup the buffer lookup table considering the
maximum possible size of the buffer pool which is MaxAvailableMemory
only once at the beginning.  Shared buffer lookup table and
StrategyControl are not resized even if the buffer pool is resized hence
they are allocated in the main shared memory segment

BgWriter refactoring
====================

The way BgBufferSync is written today, it packs four functionalities:
setting up the buffer sync state, performing the buffer sync, resetting
the buffer sync state when bgwriter_lru_maxpages <= 0 and setting it up
again after bgwriter_lru_maxpages > 0. That makes the code hard to read.
It will be good to divide this function into 3/4 different functions
each performing one functionality. Then pack all the state (the local
variables from that function converted to static global) into a
structure, which is passed to these functions. Once that happens
BgBufferSyncReset() will call one of the functions to reset the state
when buffer pool is resized.
---
 contrib/pg_buffercache/pg_buffercache_pages.c |  18 +-
 doc/src/sgml/config.sgml                      |  44 +-
 doc/src/sgml/func/func-admin.sgml             |  57 +++
 src/backend/access/transam/slru.c             |   2 +-
 src/backend/access/transam/xlog.c             |   2 +-
 src/backend/bootstrap/bootstrap.c             |   2 +
 src/backend/port/sysv_shmem.c                 | 176 +++++++-
 src/backend/postmaster/checkpointer.c         |  12 +-
 src/backend/postmaster/postmaster.c           |  10 +-
 src/backend/storage/buffer/Makefile           |   3 +-
 src/backend/storage/buffer/buf_init.c         | 279 ++++++++++--
 src/backend/storage/buffer/buf_resize.c       | 399 ++++++++++++++++++
 src/backend/storage/buffer/buf_table.c        |   9 +-
 src/backend/storage/buffer/bufmgr.c           | 160 ++++++-
 src/backend/storage/buffer/freelist.c         | 106 ++++-
 src/backend/storage/buffer/meson.build        |   1 +
 src/backend/storage/ipc/ipci.c                |  13 +-
 src/backend/storage/ipc/procsignal.c          |  55 +++
 src/backend/storage/ipc/shmem.c               |  66 ++-
 src/backend/tcop/postgres.c                   |  11 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/backend/utils/init/globals.c              |   5 +-
 src/backend/utils/init/postinit.c             |  49 +++
 src/backend/utils/misc/guc.c                  |   2 +-
 src/backend/utils/misc/guc_parameters.dat     |  15 +-
 src/include/catalog/pg_proc.dat               |   6 +
 src/include/miscadmin.h                       |   5 +
 src/include/storage/buf_internals.h           |   1 +
 src/include/storage/bufmgr.h                  |  20 +-
 src/include/storage/ipc.h                     |   3 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/storage/pg_shmem.h                |  52 ++-
 src/include/storage/pmsignal.h                |   3 +-
 src/include/storage/procsignal.h              |   4 +
 src/include/storage/shmem.h                   |   3 +
 src/include/utils/guc.h                       |   2 +
 src/test/Makefile                             |   2 +-
 src/test/README                               |   3 +
 src/test/buffermgr/Makefile                   |  30 ++
 src/test/buffermgr/README                     |  26 ++
 src/test/buffermgr/buffermgr_test.conf        |  11 +
 src/test/buffermgr/expected/buffer_resize.out | 329 +++++++++++++++
 src/test/buffermgr/meson.build                |  23 +
 src/test/buffermgr/sql/buffer_resize.sql      |  95 +++++
 src/test/buffermgr/t/001_resize_buffer.pl     | 135 ++++++
 .../buffermgr/t/003_parallel_resize_buffer.pl |  71 ++++
 .../t/004_client_join_buffer_resize.pl        | 241 +++++++++++
 src/test/meson.build                          |   1 +
 .../perl/PostgreSQL/Test/BackgroundPsql.pm    |  76 ++++
 src/tools/pgindent/typedefs.list              |   1 +
 50 files changed, 2535 insertions(+), 107 deletions(-)
 create mode 100644 src/backend/storage/buffer/buf_resize.c
 create mode 100644 src/test/buffermgr/Makefile
 create mode 100644 src/test/buffermgr/README
 create mode 100644 src/test/buffermgr/buffermgr_test.conf
 create mode 100644 src/test/buffermgr/expected/buffer_resize.out
 create mode 100644 src/test/buffermgr/meson.build
 create mode 100644 src/test/buffermgr/sql/buffer_resize.sql
 create mode 100644 src/test/buffermgr/t/001_resize_buffer.pl
 create mode 100644 src/test/buffermgr/t/003_parallel_resize_buffer.pl
 create mode 100644 src/test/buffermgr/t/004_client_join_buffer_resize.pl

diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index fe9af45febe..e311e3d266c 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -118,6 +118,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
 	TupleDesc	tupledesc;
 	TupleDesc	expected_tupledesc;
 	HeapTuple	tuple;
+	int			currentNBuffers = pg_atomic_read_u32(&ShmemCtrl->currentNBuffers);
 
 	if (SRF_IS_FIRSTCALL())
 	{
@@ -174,10 +175,10 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
 		/* Allocate NBuffers worth of BufferCachePagesRec records. */
 		fctx->record = (BufferCachePagesRec *)
 			MemoryContextAllocHuge(CurrentMemoryContext,
-								   sizeof(BufferCachePagesRec) * NBuffers);
+								   sizeof(BufferCachePagesRec) * currentNBuffers);
 
 		/* Set max calls and remember the user function context. */
-		funcctx->max_calls = NBuffers;
+		funcctx->max_calls = currentNBuffers; 
 		funcctx->user_fctx = fctx;
 
 		/* Return to original context when allocating transient memory */
@@ -191,13 +192,24 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
 		 * snapshot across all buffers, but we do grab the buffer header
 		 * locks, so the information of each buffer is self-consistent.
 		 */
-		for (i = 0; i < NBuffers; i++)
+		for (i = 0; i < currentNBuffers; i++)
 		{
 			BufferDesc *bufHdr;
 			uint32		buf_state;
 
 			CHECK_FOR_INTERRUPTS();
 
+			/*
+			 * TODO: We should just scan the entire buffer descriptor
+			 * array instead of relying on curent buffer pool size. But that can
+			 * happen if only we setup the descriptor array large enough at the
+			 * server startup time.
+			 */
+			if (currentNBuffers != pg_atomic_read_u32(&ShmemCtrl->currentNBuffers))
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("number of shared buffers changed during scan of buffer cache")));
+
 			bufHdr = GetBufferDescriptor(i);
 			/* Lock each buffer header before inspecting. */
 			buf_state = LockBufHdr(bufHdr);
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 683f7c36f46..e4cd9b1f555 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1724,7 +1724,6 @@ include_dir 'conf.d'
         that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
         (Non-default values of <symbol>BLCKSZ</symbol> change the minimum
         value.)
-        This parameter can only be set at server start.
        </para>
 
        <para>
@@ -1747,6 +1746,49 @@ include_dir 'conf.d'
         appropriate, so as to leave adequate space for the operating system.
        </para>
 
+       <para>
+        The shared memory consumed by the buffer pool is allocated and
+        initialized according to the value of the GUC at the time of starting
+        the server. A desired new value of GUC can be loaded while the server is
+        running using <systemitem>SIGHUP</systemitem>. But the buffer pool will
+        not be resized immediately. Use
+        <function>pg_resize_shared_buffers()</function> to dynamically resize
+        the shared buffer pool (see <xref linkend="functions-admin"/> for details).
+        <command>SHOW shared_buffers</command> shows the current number of
+        shared buffers and pending number, if any. Please note that when the GUC
+        is changed, the other GUCS which use this GUCs value to set their
+        defaults will not be changed. They may still require a server restart to
+        consider new value.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-max-shared-buffers" xreflabel="max_shared_buffers">
+      <term><varname>max_shared_buffers</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_shared_buffers</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the upper limit for the <varname>shared_buffers</varname> value.
+        The default value is <literal>0</literal>,
+        which means no explicit limit is set and <varname>max_shared_buffers</varname>
+        will be automatically set to the value of <varname>shared_buffers</varname>
+        at server startup.
+        If this value is specified without units, it is taken as blocks,
+        that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+        This parameter can only be set at server start.
+       </para>
+
+       <para>
+        This parameter determines the amount of memory address space to reserve
+        in each backend for expanding the buffer pool in future. While the
+        memory for buffer pool is allocated on demand as it is resized, the
+        memory required to hold the buffer manager metadata is allocated
+        statically at the server start accounting for the largest buffer pool
+        size allowed by this parameter.
+       </para>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/func/func-admin.sgml b/doc/src/sgml/func/func-admin.sgml
index 1b465bc8ba7..0dc89b07c76 100644
--- a/doc/src/sgml/func/func-admin.sgml
+++ b/doc/src/sgml/func/func-admin.sgml
@@ -99,6 +99,63 @@
         <returnvalue>off</returnvalue>
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_resize_shared_buffers</primary>
+        </indexterm>
+        <function>pg_resize_shared_buffers</function> ()
+        <returnvalue>boolean</returnvalue>
+       </para>
+       <para>
+        Dynamically resizes the shared buffer pool to match the current
+        value of the <varname>shared_buffers</varname> parameter. This
+        function implements a coordinated resize process that ensures all
+        backend processes acknowledge the change before completing the
+        operation. The resize happens in multiple phases to maintain
+        data consistency and system stability. Returns <literal>true</literal>
+        if the resize was successful, or raises an error if the operation
+        fails. This function can only be called by superusers.
+       </para>
+       <para>
+        To resize shared buffers, first update the <varname>shared_buffers</varname>
+        setting and reload the configuration, then verify the new value is loaded
+        before calling this function. For example:
+<programlisting>
+postgres=# ALTER SYSTEM SET shared_buffers = '256MB';
+ALTER SYSTEM
+postgres=# SELECT pg_reload_conf();
+ pg_reload_conf
+----------------
+ t
+(1 row)
+
+postgres=# SHOW shared_buffers;
+     shared_buffers      
+-------------------------
+ 128MB (pending: 256MB)
+(1 row)
+
+postgres=# SELECT pg_resize_shared_buffers();
+ pg_resize_shared_buffers
+--------------------------
+ t
+(1 row)
+
+postgres=# SHOW shared_buffers;
+ shared_buffers
+----------------
+ 256MB
+(1 row)
+</programlisting>
+        The <command>SHOW shared_buffers</command> step is important to verify
+        that the configuration reload was successful and the new value is
+        available to the current session before attempting the resize. The
+        output shows both the current and pending values when a change is waiting
+        to be applied.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 77676d6d035..73df5909886 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -232,7 +232,7 @@ SimpleLruAutotuneBuffers(int divisor, int max)
 {
 	return Min(max - (max % SLRU_BANK_SIZE),
 			   Max(SLRU_BANK_SIZE,
-				   NBuffers / divisor - (NBuffers / divisor) % SLRU_BANK_SIZE));
+				   NBuffersPending / divisor - (NBuffersPending / divisor) % SLRU_BANK_SIZE));
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 22d0a2e8c3a..f4363e0035d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4676,7 +4676,7 @@ XLOGChooseNumBuffers(void)
 {
 	int			xbuffers;
 
-	xbuffers = NBuffers / 32;
+	xbuffers = NBuffersPending / 32;
 	if (xbuffers > (wal_segment_size / XLOG_BLCKSZ))
 		xbuffers = (wal_segment_size / XLOG_BLCKSZ);
 	if (xbuffers < 8)
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index fc8638c1b61..226944e4588 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -335,6 +335,8 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 
 	InitializeFastPathLocks();
 
+	InitializeMaxNBuffers();
+
 	CreateSharedMemoryAndSemaphores();
 
 	/*
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index cc4b2c80e1a..68de301441b 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -30,13 +30,19 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "portability/mem.h"
+#include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
+#include "storage/pmsignal.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
 #include "utils/pidfile.h"
+#include "utils/wait_event.h"
 
 
 /*
@@ -98,6 +104,8 @@ typedef enum
 unsigned long UsedShmemSegID = 0;
 void	   *UsedShmemSegAddr = NULL;
 
+volatile bool delay_shmem_resize = false;
+
 /*
  * Anonymous mapping layout we use looks like this:
  *
@@ -124,6 +132,9 @@ void	   *UsedShmemSegAddr = NULL;
  * being counted against memory limits). The mapping serves as an address space
  * reservation, into which shared memory segment can be extended and is
  * represented by the second /memfd:main with no permissions.
+ *
+ * The reserved space for buffer manager related segments is calculated based on
+ * MaxNBuffers.
  */
 
 /*
@@ -134,6 +145,42 @@ void	   *UsedShmemSegAddr = NULL;
  */
 static bool huge_pages_on = false;
 
+/*
+ * Currently broadcasted value of NBuffers in shared memory.
+ *
+ * Most of the time this value is going to be equal to NBuffers. But if
+ * postmaster is resizing shared memory and a new backend was created
+ * at the same time, there is a possibility for the new backend to inherit the
+ * old NBuffers value, but miss the resize signal if ProcSignal infrastructure
+ * was not initialized yet. Consider this situation:
+ *
+ *     Postmaster ------> New Backend
+ *         |                   |
+ *         |                Launch
+ *         |                   |
+ *         |             Inherit NBuffers
+ *         |                   |
+ *     Resize NBuffers         |
+ *         |                   |
+ *     Emit Barrier            |
+ *         |            Init ProcSignal
+ *         |                   |
+ *     Finish resize           |
+ *         |                   |
+ *     New NBuffers       Old NBuffers
+ *
+ * In this case the backend is not yet ready to receive a signal from
+ * EmitProcSignalBarrier, and will be ignored. The same happens if ProcSignal
+ * is initialized even later, after the resizing was finished.
+ *
+ * To address resulting inconsistency, postmaster broadcasts the current
+ * NBuffers value via shared memory. Every new backend has to verify this value
+ * before it will access the buffer pool: if it differs from its own value,
+ * this indicates a shared memory resize has happened and the backend has to
+ * first synchronize with rest of the pack.
+ */
+ShmemControl *ShmemCtrl = NULL;
+
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
@@ -156,8 +203,6 @@ MappingName(int shmem_segment)
 			return "iocv";
 		case CHECKPOINT_BUFFERS_SHMEM_SEGMENT:
 			return "checkpoint";
-		case STRATEGY_SHMEM_SEGMENT:
-			return "strategy";
 		default:
 			return "unknown";
 	}
@@ -921,6 +966,114 @@ AnonymousShmemDetach(int status, Datum arg)
 	}
 }
 
+/*
+ * Resize all shared memory segments based on the new shared_buffers value (saved
+ * in ShmemCtrl area). The actual segment resizing is done via ftruncate, which
+ * will fail if there is not sufficient space to expand the anon file.
+ * 
+ * TODO: Rename this to BufferShmemResize() or something. Only buffer manager's
+ * memory should be resized in this function.
+ * 
+ * TODO: This function changes the amount of shared memory used. So it should
+ * also update the show only GUCs shared_memory_size and
+ * shared_memory_size_in_huge_pages in all backends. SetConfigOption() may be
+ * used for that. But it's not clear whether is_reload parameter is safe to use
+ * while resizing is going on; also at what stage it should be done.
+ */
+bool
+AnonymousShmemResize(void)
+{
+	int		mmap_flags = PG_MMAP_FLAGS;
+	Size 	hugepagesize;
+	MemoryMappingSizes mapping_sizes[NUM_MEMORY_MAPPINGS];
+
+	Assert(!pg_atomic_unlocked_test_flag(&ShmemCtrl->resize_in_progress));
+
+	/* TODO: This is a hack. NBuffersPending should never be written by anything
+	 * other than GUC system. Find a way to pass new NBuffers value to
+	 * BufferManagerShmemSize(). */
+	NBuffersPending = pg_atomic_read_u32(&ShmemCtrl->targetNBuffers);
+	elog(DEBUG1, "Resize shmem from %d to %d", NBuffers, NBuffersPending);
+	
+#ifndef MAP_HUGETLB
+	/* PrepareHugePages should have dealt with this case */
+	Assert(huge_pages != HUGE_PAGES_ON && !huge_pages_on);
+#else
+	if (huge_pages_on)
+	{
+		Assert(huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY);
+		GetHugePageSize(&hugepagesize, &mmap_flags, NULL);
+	}
+#endif
+
+	/* Note that BufferManagerShmemSize() indirectly depends on NBuffersPending. */
+	BufferManagerShmemSize(mapping_sizes);
+
+	for(int i = 0; i < NUM_MEMORY_MAPPINGS; i++)
+	{
+		MemoryMappingSizes *mapping = &mapping_sizes[i];
+		ShmemSegment *segment = &Segments[i];
+		PGShmemHeader *shmem_hdr = segment->ShmemSegHdr;
+
+		/* Main shared memory segment is always static. Ignore it. */
+		if (i == MAIN_SHMEM_SEGMENT)
+			continue;
+
+		round_off_mapping_sizes(mapping);
+		round_off_mapping_sizes_for_hugepages(mapping, hugepagesize);
+
+		/*
+		 * Size of the reserved address space should not change, since it depends
+		 * upon MaxNBuffers, which can be changed only on restart.
+		 */
+		Assert(segment->shmem_reserved == mapping->shmem_reserved);
+#ifdef MAP_HUGETLB
+		if (huge_pages_on && (mapping_sizes->shmem_req_size % hugepagesize != 0))
+			mapping_sizes->shmem_req_size += hugepagesize - (mapping_sizes->shmem_req_size % hugepagesize);
+#endif
+		elog(DEBUG1, "segment[%s]: requested size %zu, current size %zu, reserved %zu",
+			 MappingName(i), mapping->shmem_req_size, segment->shmem_size,
+			 segment->shmem_reserved);
+
+		if (segment->shmem == NULL)
+			continue;
+
+		if (segment->shmem_size == mapping->shmem_req_size)
+			continue;
+
+		/*
+		 * We should have reserved enough address space for resizing. PANIC if
+		 * that's not the case.
+		 */
+		if (segment->shmem_reserved < mapping->shmem_req_size)
+			ereport(PANIC,
+					(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+					 errmsg("not enough shared memory is reserved")));
+
+		elog(DEBUG1, "segment[%s]: resize from %zu to %zu at address %p",
+					 MappingName(i), segment->shmem_size,
+					 mapping->shmem_req_size, segment->shmem);
+
+		/*
+		 * Resize the backing file to resize the allocated memory, and allocate
+		 * more memory on supported platforms if required.
+		 */
+		if(ftruncate(segment->segment_fd, mapping->shmem_req_size) == -1)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYSTEM_ERROR),
+					 errmsg("could not truncate anonymous file for \"%s\": %m",
+							MappingName(i))));
+		if (mapping->shmem_req_size > segment->shmem_size)
+			shmem_fallocate(segment->segment_fd, MappingName(i), mapping->shmem_req_size, ERROR);
+
+		segment->shmem_size = mapping->shmem_req_size;
+		shmem_hdr->totalsize = segment->shmem_size;
+		segment->ShmemEnd = segment->shmem + segment->shmem_size;
+	}
+
+	return true;
+}
+
 /*
  * PGSharedMemoryCreate
  *
@@ -1224,3 +1377,22 @@ PGSharedMemoryDetach(void)
 		}
 	}
 }
+
+void
+ShmemControlInit(void)
+{
+	bool foundShmemCtrl;
+
+	ShmemCtrl = (ShmemControl *)
+	ShmemInitStruct("Shmem Control", sizeof(ShmemControl),
+									 &foundShmemCtrl);
+
+	if (!foundShmemCtrl)
+	{
+		pg_atomic_init_u32(&ShmemCtrl->targetNBuffers, 0);
+		pg_atomic_init_u32(&ShmemCtrl->currentNBuffers, 0);
+		pg_atomic_init_flag(&ShmemCtrl->resize_in_progress);
+
+		ShmemCtrl->coordinator = 0;
+	}
+}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e84e8663e96..ef3f84a55f5 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -654,9 +654,12 @@ CheckpointerMain(const void *startup_data, size_t startup_data_len)
 static void
 ProcessCheckpointerInterrupts(void)
 {
-	if (ProcSignalBarrierPending)
-		ProcessProcSignalBarrier();
-
+	/*
+	 * Reloading config can trigger further signals, complicating interrupts
+	 * processing -- so let it run first.
+	 *
+	 * XXX: Is there any need in memory barrier after ProcessConfigFile?
+	 */
 	if (ConfigReloadPending)
 	{
 		ConfigReloadPending = false;
@@ -676,6 +679,9 @@ ProcessCheckpointerInterrupts(void)
 		UpdateSharedMemoryConfig();
 	}
 
+	if (ProcSignalBarrierPending)
+		ProcessProcSignalBarrier();
+
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
 		ProcessLogMemoryContextInterrupt();
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 7c064cf9fbb..2095713d7c0 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -110,11 +110,15 @@
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
 #include "storage/aio_subsys.h"
+#include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
 #include "tcop/backend_startup.h"
 #include "tcop/tcopprot.h"
 #include "utils/datetime.h"
@@ -125,7 +129,6 @@
 
 #ifdef EXEC_BACKEND
 #include "common/file_utils.h"
-#include "storage/pg_shmem.h"
 #endif
 
 
@@ -958,6 +961,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	InitializeFastPathLocks();
 
+	/*
+	 * Calculate MaxNBuffers for buffer pool resizing.
+	 */
+	InitializeMaxNBuffers();
+
 	/*
 	 * Give preloaded libraries a chance to request additional shared memory.
 	 */
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index fd7c40dcb08..3bc9aee85de 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -17,6 +17,7 @@ OBJS = \
 	buf_table.o \
 	bufmgr.o \
 	freelist.o \
-	localbuf.o
+	localbuf.o \
+	buf_resize.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 4fa547f48de..4a354107185 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -17,7 +17,7 @@
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
-#include "storage/pg_shmem.h"
+#include "utils/guc.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -62,11 +62,12 @@ CkptSortItem *CkptBufferIds;
 /*
  * Initialize shared buffer pool
  *
- * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend). Size of data structures initialized
- * here depends on NBuffers, and to be able to change NBuffers without a
- * restart we store each structure into a separate shared memory segment, which
- * could be resized on demand.
+ * This is called once during shared-memory initialization.
+ * TODO: Restore this function to it's initial form. This function should see no
+ * change in buffer resize patches, except may be use of NBuffersPending.
+ * 
+ * No locks are taking in this function, it is the caller responsibility to
+ * make sure only one backend can work with new buffers.
  */
 void
 BufferManagerShmemInit(void)
@@ -75,24 +76,25 @@ BufferManagerShmemInit(void)
 				foundDescs,
 				foundIOCV,
 				foundBufCkpt;
+	int			i;
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
 		ShmemInitStructInSegment("Buffer Descriptors",
-						NBuffers * sizeof(BufferDescPadded),
+						NBuffersPending * sizeof(BufferDescPadded),
 						&foundDescs, BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
 		TYPEALIGN(PG_IO_ALIGN_SIZE,
 				  ShmemInitStructInSegment("Buffer Blocks",
-								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  NBuffersPending * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
 								  &foundBufs, BUFFERS_SHMEM_SEGMENT));
 
 	/* Align condition variables to cacheline boundary. */
 	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
 		ShmemInitStructInSegment("Buffer IO Condition Variables",
-						NBuffers * sizeof(ConditionVariableMinimallyPadded),
+						NBuffersPending * sizeof(ConditionVariableMinimallyPadded),
 						&foundIOCV, BUFFER_IOCV_SHMEM_SEGMENT);
 
 	/*
@@ -104,48 +106,54 @@ BufferManagerShmemInit(void)
 	 */
 	CkptBufferIds = (CkptSortItem *)
 		ShmemInitStructInSegment("Checkpoint BufferIds",
-						NBuffers * sizeof(CkptSortItem), &foundBufCkpt,
+						NBuffersPending * sizeof(CkptSortItem), &foundBufCkpt,
 						CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
 
 	if (foundDescs || foundBufs || foundIOCV || foundBufCkpt)
 	{
 		/* should find all of these, or none of them */
 		Assert(foundDescs && foundBufs && foundIOCV && foundBufCkpt);
-		/* note: this path is only taken in EXEC_BACKEND case */
-	}
-	else
-	{
-		int			i;
-
 		/*
-		 * Initialize all the buffer headers.
+		 * note: this path is only taken in EXEC_BACKEND case when initializing
+		 * shared memory.
 		 */
-		for (i = 0; i < NBuffers; i++)
-		{
-			BufferDesc *buf = GetBufferDescriptor(i);
+	}
 
-			ClearBufferTag(&buf->tag);
+	/*
+	 * Initialize all the buffer headers.
+	 */
+	for (i = 0; i < NBuffersPending; i++)
+	{
+		BufferDesc *buf = GetBufferDescriptor(i);
+
+		ClearBufferTag(&buf->tag);
 
-			pg_atomic_init_u32(&buf->state, 0);
-			buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
+		pg_atomic_init_u32(&buf->state, 0);
+		buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
 
-			buf->buf_id = i;
+		buf->buf_id = i;
 
-			pgaio_wref_clear(&buf->io_wref);
+		pgaio_wref_clear(&buf->io_wref);
 
-			LWLockInitialize(BufferDescriptorGetContentLock(buf),
-							 LWTRANCHE_BUFFER_CONTENT);
+		LWLockInitialize(BufferDescriptorGetContentLock(buf),
+						 LWTRANCHE_BUFFER_CONTENT);
 
-			ConditionVariableInit(BufferDescriptorGetIOCV(buf));
-		}
+		ConditionVariableInit(BufferDescriptorGetIOCV(buf));
 	}
 
-	/* Init other shared buffer-management stuff */
+	/*
+	 * Init other shared buffer-management stuff.
+	 */
 	StrategyInitialize(!foundDescs);
 
 	/* Initialize per-backend file flush context */
 	WritebackContextInit(&BackendWritebackContext,
 						 &backend_flush_after);
+	
+	/* Declare the size of current buffer pool. */
+	NBuffers = NBuffersPending;
+	pg_atomic_write_u32(&ShmemCtrl->currentNBuffers, NBuffers);
+	pg_atomic_write_u32(&ShmemCtrl->targetNBuffers, NBuffers);
 }
 
 /*
@@ -156,6 +164,8 @@ BufferManagerShmemInit(void)
  * shared memory segment. The main segment must not allocate anything
  * related to buffers, every other segment will receive part of the
  * data.
+ * 
+ * Also sets the shmem_reserved field for each segment based on MaxNBuffers.
  */
 Size
 BufferManagerShmemSize(MemoryMappingSizes *mapping_sizes)
@@ -163,31 +173,222 @@ BufferManagerShmemSize(MemoryMappingSizes *mapping_sizes)
 	size_t size;
 
 	/* size of buffer descriptors, plus alignment padding */
-	size = add_size(0, mul_size(NBuffers, sizeof(BufferDescPadded)));
+	size = add_size(0, mul_size(NBuffersPending, sizeof(BufferDescPadded)));
 	size = add_size(size, PG_CACHE_LINE_SIZE);
 	mapping_sizes[BUFFER_DESCRIPTORS_SHMEM_SEGMENT].shmem_req_size = size;
+	size = add_size(0, mul_size(MaxNBuffers, sizeof(BufferDescPadded)));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	mapping_sizes[BUFFER_DESCRIPTORS_SHMEM_SEGMENT].shmem_reserved = size;
 
 	/* size of data pages, plus alignment padding */
 	size = add_size(0, PG_IO_ALIGN_SIZE);
-	size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	size = add_size(size, mul_size(NBuffersPending, BLCKSZ));
 	mapping_sizes[BUFFERS_SHMEM_SEGMENT].shmem_req_size = size;
+	size = add_size(0, PG_IO_ALIGN_SIZE);
+	size = add_size(size, mul_size(MaxNBuffers, BLCKSZ));
 	mapping_sizes[BUFFERS_SHMEM_SEGMENT].shmem_reserved = size;
 
-	/* size of stuff controlled by freelist.c */
-	mapping_sizes[STRATEGY_SHMEM_SEGMENT].shmem_req_size = StrategyShmemSize();
-	mapping_sizes[STRATEGY_SHMEM_SEGMENT].shmem_reserved = StrategyShmemSize();
-
 	/* size of I/O condition variables, plus alignment padding */
-	size = add_size(0, mul_size(NBuffers,
+	size = add_size(0, mul_size(NBuffersPending,
 								   sizeof(ConditionVariableMinimallyPadded)));
 	size = add_size(size, PG_CACHE_LINE_SIZE);
 	mapping_sizes[BUFFER_IOCV_SHMEM_SEGMENT].shmem_req_size = size;
+	size = add_size(0, mul_size(MaxNBuffers,
+								   sizeof(ConditionVariableMinimallyPadded)));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	mapping_sizes[BUFFER_IOCV_SHMEM_SEGMENT].shmem_reserved = size;
 
 	/* size of checkpoint sort array in bufmgr.c */
-	mapping_sizes[CHECKPOINT_BUFFERS_SHMEM_SEGMENT].shmem_req_size = mul_size(NBuffers, sizeof(CkptSortItem));
-	mapping_sizes[CHECKPOINT_BUFFERS_SHMEM_SEGMENT].shmem_reserved = mul_size(NBuffers, sizeof(CkptSortItem));
+	mapping_sizes[CHECKPOINT_BUFFERS_SHMEM_SEGMENT].shmem_req_size = mul_size(NBuffersPending, sizeof(CkptSortItem));
+	mapping_sizes[CHECKPOINT_BUFFERS_SHMEM_SEGMENT].shmem_reserved = mul_size(MaxNBuffers, sizeof(CkptSortItem));
+
+	/* Allocations in the main memory segment, at the end. */
+
+	/* size of stuff controlled by freelist.c */
+	size = add_size(0, StrategyShmemSize());
 
 	return size;
 }
+
+/*
+ * Reinitialize shared buffer manager structures when resizing the buffer pool.
+ *
+ * This function is called in the backend which coordinates buffer resizing
+ * operation.
+ *
+ * TODO: Avoid code duplication with BufferManagerShmemInit() and also assess
+ * which functionality in the latter is required in this function.
+ */
+void
+BufferManagerShmemResize(int currentNBuffers, int targetNBuffers)
+{
+	bool found;
+	int			i;
+	void *tmpPtr;
+
+	tmpPtr = (BufferDescPadded *)
+		ShmemUpdateStructInSegment("Buffer Descriptors",
+						targetNBuffers * sizeof(BufferDescPadded),
+						&found, BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
+	if (BufferDescriptors != tmpPtr || !found)
+		elog(FATAL, "resizing buffer descriptors failed: expected pointer %p, got %p, found=%d",
+			 BufferDescriptors, tmpPtr, found);
+
+	tmpPtr = (ConditionVariableMinimallyPadded *)
+		ShmemUpdateStructInSegment("Buffer IO Condition Variables",
+						targetNBuffers * sizeof(ConditionVariableMinimallyPadded),
+						&found, BUFFER_IOCV_SHMEM_SEGMENT);
+	if (BufferIOCVArray != tmpPtr || !found)
+		elog(FATAL, "resizing buffer IO condition variables failed: expected pointer %p, got %p, found=%d",
+			 BufferIOCVArray, tmpPtr, found);
+
+	tmpPtr = (CkptSortItem *)
+		ShmemUpdateStructInSegment("Checkpoint BufferIds",
+						targetNBuffers * sizeof(CkptSortItem), &found,
+						CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
+	if (CkptBufferIds != tmpPtr || !found)
+		elog(FATAL, "resizing checkpoint buffer IDs failed: expected pointer %p, got %p, found=%d",
+			 CkptBufferIds, tmpPtr, found);
+
+	tmpPtr = (char *)
+		TYPEALIGN(PG_IO_ALIGN_SIZE,
+				  ShmemUpdateStructInSegment("Buffer Blocks",
+								  targetNBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  &found, BUFFERS_SHMEM_SEGMENT));
+	if (BufferBlocks != tmpPtr || !found)
+		elog(FATAL, "resizing buffer blocks failed: expected pointer %p, got %p, found=%d",
+			 BufferBlocks, tmpPtr, found);
+
+	/*
+	 * Initialize the headers for new buffers. If we are shrinking the
+	 * buffers, currentNBuffers >= targetNBuffers, thus this loop doesn't execute.
+	 */
+	for (i = currentNBuffers; i < targetNBuffers; i++)
+	{
+		BufferDesc *buf = GetBufferDescriptor(i);
+
+		ClearBufferTag(&buf->tag);
+
+		pg_atomic_init_u32(&buf->state, 0);
+		buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
+
+		buf->buf_id = i;
+
+		LWLockInitialize(BufferDescriptorGetContentLock(buf),
+						 LWTRANCHE_BUFFER_CONTENT);
+
+		ConditionVariableInit(BufferDescriptorGetIOCV(buf));
+	}
+
+	/*
+	 * We do not touch StrategyControl here. Instead it is done by background
+	 * writer when handling PROCSIGNAL_BARRIER_SHBUF_EXPAND or
+	 * PROCSIGNAL_BARRIER_SHBUF_SHRINK barrier.
+	 */
+}
+
+/*
+ * BufferManagerShmemValidate
+ *		Validate that buffer manager shared memory structures have correct
+ *		pointers and sizes after a resize operation.
+ *
+ * This function is called by backends during ProcessBarrierShmemResizeStruct
+ * to ensure their view of the buffer structures is consistent after memory
+ * remapping.
+ */
+void
+BufferManagerShmemValidate(int targetNBuffers)
+{
+	bool found;
+	void *tmpPtr;
+
+	/* Validate Buffer Descriptors */
+	tmpPtr = (BufferDescPadded *)
+		ShmemInitStructInSegment("Buffer Descriptors",
+						targetNBuffers * sizeof(BufferDescPadded),
+						&found, BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
+	if (!found || BufferDescriptors != tmpPtr)
+		elog(FATAL, "validating buffer descriptors failed: expected pointer %p, got %p, found=%d",
+			 BufferDescriptors, tmpPtr, found);
+
+	/* Validate Buffer IO Condition Variables */
+	tmpPtr = (ConditionVariableMinimallyPadded *)
+		ShmemInitStructInSegment("Buffer IO Condition Variables",
+						targetNBuffers * sizeof(ConditionVariableMinimallyPadded),
+						&found, BUFFER_IOCV_SHMEM_SEGMENT);
+	if (!found || BufferIOCVArray != tmpPtr)
+		elog(FATAL, "validating buffer IO condition variables failed: expected pointer %p, got %p, found=%d",
+			 BufferIOCVArray, tmpPtr, found);
+
+	/* Validate Checkpoint BufferIds */
+	tmpPtr = (CkptSortItem *)
+		ShmemInitStructInSegment("Checkpoint BufferIds",
+						targetNBuffers * sizeof(CkptSortItem), &found,
+						CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
+	if (!found || CkptBufferIds != tmpPtr)
+		elog(FATAL, "validating checkpoint buffer IDs failed: expected pointer %p, got %p, found=%d",
+			 CkptBufferIds, tmpPtr, found);
+
+	/* Validate Buffer Blocks */
+	tmpPtr = (char *)
+		TYPEALIGN(PG_IO_ALIGN_SIZE,
+				  ShmemInitStructInSegment("Buffer Blocks",
+								  targetNBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  &found, BUFFERS_SHMEM_SEGMENT));
+	if (!found || BufferBlocks != tmpPtr)
+		elog(FATAL, "validating buffer blocks failed: expected pointer %p, got %p, found=%d",
+			 BufferBlocks, tmpPtr, found);
+}
+
+/*
+ * check_shared_buffers
+ *		GUC check_hook for shared_buffers
+ *
+ * When reloading the configuration, shared_buffers should not be set to a value
+ * higher than max_shared_buffers fixed at the boot time.
+ */
+bool
+check_shared_buffers(int *newval, void **extra, GucSource source)
+{
+	if (finalMaxNBuffers && *newval > MaxNBuffers)
+	{
+		GUC_check_errdetail("\"shared_buffers\" must be less than \"max_shared_buffers\".");
+		return false;
+	}
+	return true;
+}
+
+/*
+ * show_shared_buffers
+ *		GUC show_hook for shared_buffers
+ *
+ * Shows both current and pending buffer counts with proper unit formatting.
+ */
+const char *
+show_shared_buffers(void)
+{
+	static char buffer[128];
+	int64 current_value, pending_value;
+	const char *current_unit, *pending_unit;
+	int currentNBuffers = pg_atomic_read_u32(&ShmemCtrl->currentNBuffers);
+
+	if (currentNBuffers == NBuffersPending)
+	{
+		/* No buffer pool resizing pending. */
+		convert_int_from_base_unit(currentNBuffers, GUC_UNIT_BLOCKS, &current_value, &current_unit);
+		snprintf(buffer, sizeof(buffer), INT64_FORMAT "%s", current_value, current_unit);
+	}
+	else
+	{
+		/*
+		 * New value for NBuffers is loaded but not applied yet, show both
+		 * current and pending.
+		 */
+		convert_int_from_base_unit(currentNBuffers, GUC_UNIT_BLOCKS, &current_value, &current_unit);
+		convert_int_from_base_unit(NBuffersPending, GUC_UNIT_BLOCKS, &pending_value, &pending_unit);
+		snprintf(buffer, sizeof(buffer), INT64_FORMAT "%s (pending: " INT64_FORMAT "%s)", 
+				 current_value, current_unit, pending_value, pending_unit);
+	}
+	
+	return buffer;
+}
diff --git a/src/backend/storage/buffer/buf_resize.c b/src/backend/storage/buffer/buf_resize.c
new file mode 100644
index 00000000000..e815600c3ba
--- /dev/null
+++ b/src/backend/storage/buffer/buf_resize.c
@@ -0,0 +1,399 @@
+/*-------------------------------------------------------------------------
+ *
+ * buf_resize.c
+ *	  shared buffer pool resizing functionality
+ *
+ * This module contains the implementation of shared buffer pool resizing,
+ * including the main resize coordination function and barrier processing
+ * functions that synchronize all backends during resize operations.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/buf_resize.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "storage/bufmgr.h"
+#include "storage/buf_internals.h"
+#include "storage/ipc.h"
+#include "storage/pg_shmem.h"
+#include "storage/pmsignal.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "utils/injection_point.h"
+
+
+/*
+ * Prepare ShmemCtrl for resizing the shared buffer pool.
+ */
+static void
+MarkBufferResizingStart(int targetNBuffers, int currentNBuffers)
+{
+	Assert(!pg_atomic_unlocked_test_flag(&ShmemCtrl->resize_in_progress));
+
+	Assert(pg_atomic_read_u32(&ShmemCtrl->currentNBuffers) == currentNBuffers);
+
+	pg_atomic_write_u32(&ShmemCtrl->targetNBuffers, targetNBuffers);
+	ShmemCtrl->coordinator = MyProcPid;
+}
+
+/*
+ * Reset ShmemCtrl after resizing the shared buffer pool is done.
+ */
+static void
+MarkBufferResizingEnd(int NBuffers)
+{	
+	Assert(!pg_atomic_unlocked_test_flag(&ShmemCtrl->resize_in_progress));
+
+	Assert(pg_atomic_read_u32(&ShmemCtrl->currentNBuffers) == NBuffers);
+	pg_atomic_write_u32(&ShmemCtrl->targetNBuffers, 0);
+	ShmemCtrl->coordinator = -1;
+}
+
+/*
+ * Communicate given buffer pool resize barrier to all other backends and the Postmaster.
+ *
+ * ProcSignalBarrier is not sent to the Postmaster but we need the Postmaster to
+ * update its knowledge about the buffer pool so that it can be inherited by the
+ * child processes.
+ */
+static void
+SharedBufferResizeBarrier(ProcSignalBarrierType barrier, const char *barrier_name)
+{
+	WaitForProcSignalBarrier(EmitProcSignalBarrier(barrier));
+	elog(LOG, "all backends acknowledged %s barrier", barrier_name);
+	
+#ifdef USE_INJECTION_POINTS
+	/* Injection point specific to this barrier type */
+	switch (barrier)
+	{
+		case PROCSIGNAL_BARRIER_SHBUF_SHRINK:
+			INJECTION_POINT("pgrsb-shrink-barrier-sent", NULL);
+			break;
+		case PROCSIGNAL_BARRIER_SHBUF_RESIZE_MAP_AND_MEM:
+			INJECTION_POINT("pgrsb-resize-barrier-sent", NULL);
+			break;
+		case PROCSIGNAL_BARRIER_SHBUF_EXPAND:
+			INJECTION_POINT("pgrsb-expand-barrier-sent", NULL);
+			break;
+		case PROCSIGNAL_BARRIER_SHBUF_RESIZE_FAILED:
+			/* TODO: Add an injection point here. */
+			break;
+		case PROCSIGNAL_BARRIER_SMGRRELEASE:
+			/*
+			 * Not relevant in this function but it's here so that the compiler
+			 * can detect any missing shared buffer resizing barrier enum here.
+			 */
+			break;
+	}
+#endif /* USE_INJECTION_POINTS */
+}
+
+/*
+ * C implementation of SQL interface to update the shared buffers according to
+ * the current values of shared_buffers GUCs.
+ *
+ * The current boundaries of the buffer pool are given by two ranges.
+ *
+ * - [1, StrategyControl::activeNBuffers] is the range of buffers from which new
+ * allocations can happen at any time.
+ *
+ * - [1, ShmemCtrl::currentNBuffers] is the range of valid buffers at any given
+ * time.
+ *
+ * Let's assume that before resizing, the number of buffers in the buffer pool is
+ * NBuffersOld. After resizing it is NBuffersNew. Before resizing
+ * StrategyControl::activeNBuffers == ShmemCtrl::currentNBuffers == NBuffersOld.
+ * After the resizing finishes StrategyControl::activeNBuffers ==
+ * ShmemCtrl::currentNBuffers == NBuffersNew. Thus when no resizing happens these
+ * two ranges are same.
+ *
+ * Following steps are performed by the coordinator during resizing.
+ *
+ * 1. Marks resizing in progress to avoid multiple concurrent invocations of this
+ * function.
+ *
+ * 2. When shrinking the shared buffer pool, the coordinator sends SHBUF_SHRINK
+ * ProcSignalBarrier. In response to this barrier background writer is expected
+ * to set StrategyControl::activeNBuffers = NBuffersNew to restrict the new
+ * buffer allocations only to the new buffer pool size and also reset its
+ * internal state. Once every backend has acknowledged the barrier, the
+ * coordinator can be sure that new allocations will not happen in the buffer
+ * pool area being shrunk. Then it evicts the buffers in that area.  Note that
+ * ShmemCtrl::currentNBuffers is still NBuffersOld, since backend may still
+ * access buffers allocated before the resizing started. Buffer eviction may fail
+ * if a buffer being evicted is pinned and the resizing operatino is aborted.
+ * Once the eviction is finished, the extra memory can be freed in the next step.
+ *
+ * 2. This step is executed in both cases, when expanding the buffer pool or
+ * shrinking the buffer pool. The anonymous file backing each of the shared
+ * memory segment containg the buffer pool shared data structures is resized to
+ * the amount of memory required for the new buffer pool size. When expanding the
+ * expanded portion of memory is initialized appropriately.
+ * ShmemCtrl::currentNBuffers is set to NBuffersNew to indicate new range of
+ * valid shared buffers. Every backend is sent SHBUF_RESIZE_MAP_AND_MEM barrier.
+ * All the backends validate that their pointers to the shared buffers structure
+ * are valid and have the right size. Once every backend has acknowledged the
+ * barrier, this step finishes.
+ *
+ * 3. When expanding the buffer pool, the coordinator sends SHBUF_EXPAND barrier
+ * to signal end of expansion. When expadning the background writer, in response
+ * to StrategyControl::activeNBuffers = NBufferNew so that new allocations can
+ * use expanded range of buffer pool.
+ *
+ * TODO: Handle the case when the backend executing this function dies or the
+ * query is cancelled or it hits an error while resizing.
+ */
+Datum
+pg_resize_shared_buffers(PG_FUNCTION_ARGS)
+{
+	bool result = true;
+	int currentNBuffers = pg_atomic_read_u32(&ShmemCtrl->currentNBuffers);
+	int targetNBuffers = NBuffersPending;
+
+	if (currentNBuffers == targetNBuffers)
+	{
+		elog(LOG, "shared buffers are already at %d, no need to resize", currentNBuffers);
+		PG_RETURN_BOOL(true);
+	}
+
+	if (!pg_atomic_test_set_flag(&ShmemCtrl->resize_in_progress))
+	{
+		elog(LOG, "shared buffer resizing already in progress");
+		PG_RETURN_BOOL(false);
+	}
+
+	/*
+	 * TODO: What if the NBuffersPending value seen here is not the desired one
+	 * because somebody did a pg_reload_conf() between the last pg_reload_conf()
+	 * and execution of this function?
+	 */
+	MarkBufferResizingStart(targetNBuffers, currentNBuffers);
+	elog(LOG, "resizing shared buffers from %d to %d", currentNBuffers, targetNBuffers);
+
+	INJECTION_POINT("pg-resize-shared-buffers-flag-set", NULL);
+
+	/* Phase 1: SHBUF_SHRINK - Only for shrinking buffer pool */
+	if (targetNBuffers < currentNBuffers)
+	{
+		/*
+		 * Phase 1: Shrinking - send SHBUF_SHRINK barrier
+		 * Every backend sets activeNBuffers = NewNBuffers to restrict 
+		 * buffer pool allocations to the new size
+		 */
+		elog(LOG, "Phase 1: Shrinking buffer pool, restricting allocations to %d buffers", targetNBuffers);
+		
+		SharedBufferResizeBarrier(PROCSIGNAL_BARRIER_SHBUF_SHRINK, CppAsString(PROCSIGNAL_BARRIER_SHBUF_SHRINK));
+
+		/* Evict buffers in the area being shrunk */
+		elog(LOG, "evicting buffers %u..%u", targetNBuffers + 1, currentNBuffers);
+		if (!EvictExtraBuffers(targetNBuffers, currentNBuffers))
+		{
+			elog(WARNING, "failed to evict extra buffers during shrinking");
+			SharedBufferResizeBarrier(PROCSIGNAL_BARRIER_SHBUF_RESIZE_FAILED, CppAsString(PROCSIGNAL_BARRIER_SHBUF_RESIZE_FAILED));
+			MarkBufferResizingEnd(currentNBuffers);
+			pg_atomic_clear_flag(&ShmemCtrl->resize_in_progress);
+			PG_RETURN_BOOL(false);
+		}
+
+		/* Update the current NBuffers. */
+		pg_atomic_write_u32(&ShmemCtrl->currentNBuffers, targetNBuffers);
+	}
+
+	/* Phase 2: SHBUF_RESIZE_MAP_AND_MEM - Both expanding and shrinking */
+	elog(LOG, "Phase 2: Remapping shared memory segments and updating structures");
+	if (!AnonymousShmemResize())
+	{
+		/*
+		 * This should never fail since address map should already be reserved.
+		 * So the failure should be treated as PANIC.
+		 */
+		elog(PANIC, "failed to resize anonymous shared memory");
+	}
+
+	/* Update structure pointers and sizes */
+	BufferManagerShmemResize(currentNBuffers, targetNBuffers);
+
+	INJECTION_POINT("pgrsb-after-shmem-resize", NULL);
+
+	SharedBufferResizeBarrier(PROCSIGNAL_BARRIER_SHBUF_RESIZE_MAP_AND_MEM, CppAsString(PROCSIGNAL_BARRIER_SHBUF_RESIZE_MAP_AND_MEM));
+
+	/* Phase 3: SHBUF_EXPAND - Only for expanding buffer pool */
+	if (targetNBuffers > currentNBuffers)
+	{
+		/*
+		 * Phase 3: Expanding - send SHBUF_EXPAND barrier
+		 * Backends set activeNBuffers = NewNBuffers and start allocating 
+		 * buffers from the expanded range
+		 */
+		elog(LOG, "Phase 3: Expanding buffer pool, enabling allocations up to %d buffers", targetNBuffers);
+		pg_atomic_write_u32(&ShmemCtrl->currentNBuffers, targetNBuffers);
+		
+		SharedBufferResizeBarrier(PROCSIGNAL_BARRIER_SHBUF_EXPAND, CppAsString(PROCSIGNAL_BARRIER_SHBUF_EXPAND));
+	}
+
+	/*
+	 * Reset buffer resize control area.
+	 */
+	MarkBufferResizingEnd(targetNBuffers);
+
+	pg_atomic_clear_flag(&ShmemCtrl->resize_in_progress);
+
+	elog(LOG, "successfully resized shared buffers to %d", targetNBuffers);
+
+	PG_RETURN_BOOL(result);
+}
+
+bool
+ProcessBarrierShmemShrink(void)
+{
+	int targetNBuffers = pg_atomic_read_u32(&ShmemCtrl->targetNBuffers);
+
+	Assert(!pg_atomic_unlocked_test_flag(&ShmemCtrl->resize_in_progress));
+
+	/*
+	 * Delay adjusting the new active size of buffer pool till this process
+	 * becomes ready to resize buffers.
+	 */
+	if (delay_shmem_resize)
+	{
+		elog(LOG, "Phase 1: Delaying SHBUF_SHRINK barrier - restricting allocations to %d buffers, coordinator is %d",
+			targetNBuffers, ShmemCtrl->coordinator);
+
+		return false;
+	}
+
+	if (MyBackendType == B_BG_WRITER)
+	{
+		/* 
+		 * We have to reset the background writer's buffer allocation statistics
+		 * and the strategy control together so that background writer doesn't go
+		 * out of sync with ClockSweepTick().
+		 * 
+		 * TODO: But in case the background writer is not running, nobody would
+		 * reset the strategy control area. So we can't rely on background
+		 * worker to do that. So find a better way.
+		 */
+		BgBufferSyncReset(NBuffers, targetNBuffers);
+		/* Reset strategy control to new size */
+		StrategyReset(targetNBuffers);
+	}
+
+	elog(LOG, "Phase 1: Processing SHBUF_SHRINK barrier - NBuffers = %d, coordinator is %d",
+		 NBuffers, ShmemCtrl->coordinator);
+
+	return true;
+}
+
+bool
+ProcessBarrierShmemResizeMapAndMem(void)
+{
+	int targetNBuffers = pg_atomic_read_u32(&ShmemCtrl->targetNBuffers);
+
+	Assert(!pg_atomic_unlocked_test_flag(&ShmemCtrl->resize_in_progress));
+
+	/*
+	 * If buffer pool is being shrunk, we are already working with a smaller
+	 * buffer pool, so shrinking address space and shared structures should not
+	 * be a problem. When expanding, expanding the address space and shared
+	 * structures beyond the current boundaries is not going to be a problem
+	 * since we are not accessing that memory yet. So there is no reason to
+	 * delay processing this barrier.
+	 */
+
+	/*
+	 * Coordinator has already adjusted its address map and also updated sizes
+	 * of the shared buffer structures, no further validation needed.
+	 */
+	if (ShmemCtrl->coordinator == MyProcPid)
+		return true;
+
+	/*
+	 * Backends validate that their pointers to shared buffer structures are 
+	 * still valid and have the correct size after memory remapping.
+	 * 
+	 * TODO: Do want to do this only in assert enabled builds?
+	 */
+	BufferManagerShmemValidate(targetNBuffers);
+	
+	elog(LOG, "Backend %d successfully validated structure pointers after resize", MyProcPid);
+
+	return true;
+}
+
+bool
+ProcessBarrierShmemExpand(void)
+{
+	int targetNBuffers = pg_atomic_read_u32(&ShmemCtrl->targetNBuffers);
+
+	Assert(!pg_atomic_unlocked_test_flag(&ShmemCtrl->resize_in_progress));
+
+	/*
+	 * Delay adjusting the new active size of buffer pool till this process
+	 * becomes ready to resize buffers.
+	 */
+	if (delay_shmem_resize)
+	{
+		elog(LOG, "Phase 3: delaying SHBUF_EXPAND barrier - enabling allocations up to %d buffers, coordinator is %d",
+				targetNBuffers, ShmemCtrl->coordinator);
+		return false;
+	}
+
+	if (MyBackendType == B_BG_WRITER)
+	{
+		/* 
+		 * We have to reset the background writer's buffer allocation statistics
+		 * and the strategy control together so that background writer doesn't go
+		 * out of sync with ClockSweepTick().
+		 * 
+		 * TODO: But in case the background writer is not running, nobody would
+		 * reset the strategy control area. So we can't rely on background
+		 * worker to do that. So find a better way.
+		 */
+		BgBufferSyncReset(NBuffers, targetNBuffers);
+		StrategyReset(targetNBuffers);
+	}
+
+	elog(LOG, "Phase 3: Processing SHBUF_EXPAND barrier - targetNBuffers = %d, ShmemCtrl->coordinator = %d", targetNBuffers, ShmemCtrl->coordinator);
+
+	return true;
+}
+
+bool
+ProcessBarrierShmemResizeFailed(void)
+{
+	int currentNBuffers = pg_atomic_read_u32(&ShmemCtrl->currentNBuffers);
+	int targetNBuffers = pg_atomic_read_u32(&ShmemCtrl->targetNBuffers);
+
+	Assert(!pg_atomic_unlocked_test_flag(&ShmemCtrl->resize_in_progress));
+
+	if (MyBackendType == B_BG_WRITER)
+	{
+		/* 
+		 * We have to reset the background writer's buffer allocation statistics
+		 * and the strategy control together so that background writer doesn't go
+		 * out of sync with ClockSweepTick().
+		 * 
+		 * TODO: But in case the background writer is not running, nobody would
+		 * reset the strategy control area. So we can't rely on background
+		 * worker to do that. So find a better way.
+		 */
+		BgBufferSyncReset(NBuffers, currentNBuffers);
+		/* Reset strategy control to new size */
+		StrategyReset(currentNBuffers);
+	}
+
+	elog(LOG, "received proc signal indicating failure to resize shared buffers from %d to %d, restoring to %d, coordinator is %d",
+			NBuffers, targetNBuffers, currentNBuffers, ShmemCtrl->coordinator);
+
+	return true;
+}
\ No newline at end of file
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 67e87f9935d..18c9c6f336c 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -65,11 +65,18 @@ InitBufTable(int size)
 	info.entrysize = sizeof(BufferLookupEnt);
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
+	/*
+	 * The shared buffer look up table is set up only once with maximum possible
+	 * entries considering maximum size of the buffer pool. It is not resized
+	 * after that even if the buffer pool is resized. Hence it is allocated in
+	 * the main shared memory segment and not in a resizeable shared memory
+	 * segment.
+	 */
 	SharedBufHash = ShmemInitHashInSegment("Shared Buffer Lookup Table",
 								  size, size,
 								  &info,
 								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION | HASH_FIXED_SIZE,
-								  STRATEGY_SHMEM_SEGMENT);
+								  MAIN_SHMEM_SEGMENT);
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 327ddb7adc8..6c8f8552a4c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -57,6 +57,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/read_stream.h"
 #include "storage/smgr.h"
@@ -3607,6 +3608,32 @@ BufferSync(int flags)
 	TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_scan);
 }
 
+/*
+ * Information saved between BgBufferSync() calls so we can determine the
+ * strategy point's advance rate and avoid scanning already-cleaned buffers. The
+ * variables are global instead of static local so that BgBufferSyncReset() can
+ * adjust it when resizing shared buffers.
+ */
+static bool saved_info_valid = false;
+static int	prev_strategy_buf_id;
+static uint32 prev_strategy_passes;
+static int	next_to_clean;
+static uint32 next_passes;
+
+/* Moving averages of allocation rate and clean-buffer density */
+static float smoothed_alloc = 0;
+static float smoothed_density = 10.0;
+
+void
+BgBufferSyncReset(int currentNBuffers, int targetNBuffers)
+{
+	saved_info_valid = false;
+#ifdef BGW_DEBUG
+	elog(DEBUG2, "invalidated background writer status after resizing buffers from %d to %d",
+		 currentNBuffers, targetNBuffers);
+#endif
+}
+
 /*
  * BgBufferSync -- Write out some dirty buffers in the pool.
  *
@@ -3626,20 +3653,6 @@ BgBufferSync(WritebackContext *wb_context)
 	uint32		strategy_passes;
 	uint32		recent_alloc;
 
-	/*
-	 * Information saved between calls so we can determine the strategy
-	 * point's advance rate and avoid scanning already-cleaned buffers.
-	 */
-	static bool saved_info_valid = false;
-	static int	prev_strategy_buf_id;
-	static uint32 prev_strategy_passes;
-	static int	next_to_clean;
-	static uint32 next_passes;
-
-	/* Moving averages of allocation rate and clean-buffer density */
-	static float smoothed_alloc = 0;
-	static float smoothed_density = 10.0;
-
 	/* Potentially these could be tunables, but for now, not */
 	float		smoothing_samples = 16;
 	float		scan_whole_pool_milliseconds = 120000.0;
@@ -3662,6 +3675,25 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	/*
+	 * If buffer pool is being shrunk the buffer being written out may not remain
+	 * valid. If the buffer pool is being expanded, more buffers will become
+	 * available without even this function writing out any. Hence wait till
+	 * buffer resizing finishes i.e. go into hibernation mode.
+	 * 
+	 * TODO: We may not need this synchronization if background worker itself
+	 * becomes the coordinator.
+	 */
+	if (!pg_atomic_unlocked_test_flag(&ShmemCtrl->resize_in_progress))
+		return true;
+
+	/*
+	 * Resizing shared buffers while this function is performing an LRU scan on
+	 * them may lead to wrong results. Indicate that the resizing should wait for
+	 * the LRU scan to complete.
+	 */
+	delay_shmem_resize = true;
+
 	/*
 	 * Find out where the clock-sweep currently is, and how many buffer
 	 * allocations have happened since our last call.
@@ -3679,6 +3711,7 @@ BgBufferSync(WritebackContext *wb_context)
 	if (bgwriter_lru_maxpages <= 0)
 	{
 		saved_info_valid = false;
+		delay_shmem_resize = false;
 		return true;
 	}
 
@@ -3838,8 +3871,17 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
-	/* Execute the LRU scan */
-	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
+	/*
+	 * Execute the LRU scan.
+	 *
+	 * If buffer pool is being shrunk, the buffer being written may not remain
+	 * valid. If the buffer pool is being expanded, more buffers will become
+	 * available without even this function writing any. Hence stop what we are doing. This
+	 * also unblocks other processes that are waiting for buffer resizing to
+	 * finish.
+	 */
+	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est &&
+			!pg_atomic_unlocked_test_flag(&ShmemCtrl->resize_in_progress))
 	{
 		int			sync_state = SyncOneBuffer(next_to_clean, true,
 											   wb_context);
@@ -3898,6 +3940,9 @@ BgBufferSync(WritebackContext *wb_context)
 #endif
 	}
 
+	/* Let the resizing commence. */
+	delay_shmem_resize = false;
+
 	/* Return true if OK to hibernate */
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
@@ -4208,7 +4253,23 @@ DebugPrintBufferRefcount(Buffer buffer)
 void
 CheckPointBuffers(int flags)
 {
+	/* Mark that buffer sync is in progress - delay any shared memory resizing. */
+	/*
+	 * TODO: We need to assess whether we should allow checkpoint and buffer
+	 * resizing to run in parallel. When expanding buffers it may be fine to let
+	 * the checkpointer run in RESIZE_MAP_AND_MEM phase but delay phase EXPAND
+	 * phase till the checkpoint finishes, at the same time not allow checkpoint
+	 * to run during expansion phase. When shrinking the buffers, we should
+	 * delay SHRINK phase till checkpoint finishes and not allow to start
+	 * checkpoint till SHRINK phase is done, but allow it to run in
+	 * RESIZE_MAP_AND_MEM phase. This needs careful analysis and testing.
+	 */
+	delay_shmem_resize = true;
+	
 	BufferSync(flags);
+
+	/* Mark that buffer sync is no longer in progress - allow shared memory resizing */
+	delay_shmem_resize = false;
 }
 
 /*
@@ -7466,3 +7527,70 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+/*
+ * When shrinking shared buffers pool, evict the buffers which will not be part
+ * of the shrunk buffer pool.
+ */
+bool
+EvictExtraBuffers(int targetNBuffers, int currentNBuffers)
+{
+	bool result = true;
+
+	Assert(targetNBuffers < currentNBuffers);
+
+	/*
+	 * If the buffer being evicated is locked, this function will need to wait.
+	 * This function should not be called from a Postmaster since it can not wait on a lock.
+	 */
+	Assert(IsUnderPostmaster);
+
+	/*
+	 * TODO: Before evicting any buffer, we should check whether any of the
+	 * buffers are pinned. If we find that a buffer is pinned after evicting
+	 * most of them, that will impact performance since all those evicted
+	 * buffers might need to be read again.
+	 */
+	for (Buffer buf = targetNBuffers + 1; buf <= currentNBuffers; buf++)
+	{
+		BufferDesc *desc = GetBufferDescriptor(buf - 1);
+		uint32		buf_state;
+		bool		buffer_flushed;
+
+		buf_state = pg_atomic_read_u32(&desc->state);
+
+		/*
+		 * Nobody is expected to touch the buffers while resizing is
+		 * going one hence unlocked precheck should be safe and saves
+		 * some cycles.
+		 */
+		if (!(buf_state & BM_VALID))
+			continue;
+
+		/*
+		 * XXX: Looks like CurrentResourceOwner can be NULL here, find
+		 * another one in that case?
+		 * */
+		if (CurrentResourceOwner)
+			ResourceOwnerEnlarge(CurrentResourceOwner);
+
+		ReservePrivateRefCountEntry();
+
+		LockBufHdr(desc);
+
+		/*
+		 * Now that we have locked buffer descriptor, make sure that the
+		 * buffer without valid data has been skipped above.
+		 */
+		Assert(buf_state & BM_VALID);
+
+		if (!EvictUnpinnedBufferInternal(desc, &buffer_flushed))
+		{
+			elog(WARNING, "could not remove buffer %u, it is pinned", buf);
+			result = false;
+			break;
+		}
+	}
+
+	return result;
+}
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 13ee840ab9f..256521d889a 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -33,10 +33,16 @@ typedef struct
 	/* Spinlock: protects the values below */
 	slock_t		buffer_strategy_lock;
 
+	/*
+	 * Number of active buffers that can be allocated. During buffer resizing,
+	 * this may be different from NBuffers which tracks the global buffer count.
+	 */
+	pg_atomic_uint32 activeNBuffers;
+
 	/*
 	 * clock-sweep hand: index of next buffer to consider grabbing. Note that
 	 * this isn't a concrete buffer - we only ever increase the value. So, to
-	 * get an actual buffer, it needs to be used modulo NBuffers.
+	 * get an actual buffer, it needs to be used modulo activeNBuffers.
 	 */
 	pg_atomic_uint32 nextVictimBuffer;
 
@@ -101,21 +107,27 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
+	int			activeBuffers;
 
 	/*
-	 * Atomically move hand ahead one buffer - if there's several processes
-	 * doing this, this can lead to buffers being returned slightly out of
-	 * apparent order.
+	 * Atomically move hand ahead one buffer - if there's several processes doing
+	 * this, this can lead to buffers being returned slightly out of apparent
+	 * order. We need to read both the current position of hand and the current
+	 * buffer allocation limit together consistently. They may be reset by
+	 * concurrent resize.
 	 */
+	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 	victim =
 		pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
+	activeBuffers = pg_atomic_read_u32(&StrategyControl->activeNBuffers);
+	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 
-	if (victim >= NBuffers)
+	if (victim >= activeBuffers)
 	{
 		uint32		originalVictim = victim;
 
 		/* always wrap what we look up in BufferDescriptors */
-		victim = victim % NBuffers;
+		victim = victim % activeBuffers;
 
 		/*
 		 * If we're the one that just caused a wraparound, force
@@ -143,7 +155,7 @@ ClockSweepTick(void)
 				 */
 				SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 
-				wrapped = expected % NBuffers;
+				wrapped = expected % activeBuffers;
 
 				success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
 														 &expected, wrapped);
@@ -228,7 +240,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/* Use the "clock sweep" algorithm to find a free buffer */
-	trycounter = NBuffers;
+	trycounter = pg_atomic_read_u32(&StrategyControl->activeNBuffers);
+	
 	for (;;)
 	{
 		uint32		old_buf_state;
@@ -281,7 +294,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 				if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state,
 												   local_buf_state))
 				{
-					trycounter = NBuffers;
+					trycounter = pg_atomic_read_u32(&StrategyControl->activeNBuffers);
 					break;
 				}
 			}
@@ -323,10 +336,12 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 {
 	uint32		nextVictimBuffer;
 	int			result;
+	uint32		activeNBuffers;
 
 	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 	nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
-	result = nextVictimBuffer % NBuffers;
+	activeNBuffers = pg_atomic_read_u32(&StrategyControl->activeNBuffers);
+	result = nextVictimBuffer % activeNBuffers;
 
 	if (complete_passes)
 	{
@@ -336,7 +351,7 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 		 * Additionally add the number of wraparounds that happened before
 		 * completePasses could be incremented. C.f. ClockSweepTick().
 		 */
-		*complete_passes += nextVictimBuffer / NBuffers;
+		*complete_passes += nextVictimBuffer / activeNBuffers;
 	}
 
 	if (num_buf_alloc)
@@ -383,7 +398,7 @@ StrategyShmemSize(void)
 	Size		size = 0;
 
 	/* size of lookup hash table ... see comment in StrategyInitialize */
-	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
+	size = add_size(size, BufTableShmemSize(MaxNBuffers + NUM_BUFFER_PARTITIONS));
 
 	/* size of the shared replacement strategy control block */
 	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
@@ -391,6 +406,31 @@ StrategyShmemSize(void)
 	return size;
 }
 
+void
+StrategyReset(int activeNBuffers)
+{
+	Assert(StrategyControl);
+
+	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+	
+	/* Update the active buffer count for the strategy */
+	pg_atomic_write_u32(&StrategyControl->activeNBuffers, activeNBuffers);
+	
+	/* Reset the clock-sweep pointer to start from beginning */
+	pg_atomic_write_u32(&StrategyControl->nextVictimBuffer, 0);
+
+	/*
+	 * The statistics is viewed in the context of the number of shared buffers.
+	 * Reset it as the size of active number of shared buffers changes.
+	 */
+	StrategyControl->completePasses = 0;
+	pg_atomic_write_u32(&StrategyControl->numBufferAllocs, 0);
+
+	/* TODO: Do we need to seset background writer notifications? */
+	StrategyControl->bgwprocno = -1;
+	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+}
+
 /*
  * StrategyInitialize -- initialize the buffer cache replacement
  *		strategy.
@@ -408,12 +448,21 @@ StrategyInitialize(bool init)
 	 *
 	 * Since we can't tolerate running out of lookup table entries, we must be
 	 * sure to specify an adequate table size here.  The maximum steady-state
-	 * usage is of course NBuffers entries, but BufferAlloc() tries to insert
-	 * a new entry before deleting the old.  In principle this could be
-	 * happening in each partition concurrently, so we could need as many as
-	 * NBuffers + NUM_BUFFER_PARTITIONS entries.
+	 * usage is of course is as many number of entries as the number of buffers
+	 * in the buffer pool.  Right now there is no way to free shared memory. Even
+	 * if we shrink the buffer lookup table when shrinking the buffer pool the
+	 * unused hash table entries can not be freed. When we expand the buffer
+	 * pool, more entries can be allocated but we can not resize the hash table
+	 * directory without rehashing all the entries. Just allocating more entries
+	 * will lead to more contention. Hence we setup the buffer lookup table
+	 * considering the maximum possible size of the buffer pool which is
+	 * MaxNBuffers.
+	 *
+	 * Additionally BufferAlloc() tries to insert a new entry before deleting the
+	 * old.  In principle this could be happening in each partition concurrently,
+	 * so we need extra NUM_BUFFER_PARTITIONS entries.
 	 */
-	InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
+	InitBufTable(MaxNBuffers + NUM_BUFFER_PARTITIONS);
 
 	/*
 	 * Get or create the shared strategy control block
@@ -421,7 +470,7 @@ StrategyInitialize(bool init)
 	StrategyControl = (BufferStrategyControl *)
 		ShmemInitStructInSegment("Buffer Strategy Status",
 						sizeof(BufferStrategyControl),
-						&found, STRATEGY_SHMEM_SEGMENT);
+						&found, MAIN_SHMEM_SEGMENT);
 
 	if (!found)
 	{
@@ -432,6 +481,8 @@ StrategyInitialize(bool init)
 
 		SpinLockInit(&StrategyControl->buffer_strategy_lock);
 
+		/* Initialize the active buffer count */
+		pg_atomic_init_u32(&StrategyControl->activeNBuffers, NBuffersPending);
 		/* Initialize the clock-sweep pointer */
 		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
 
@@ -669,12 +720,23 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 		strategy->current = 0;
 
 	/*
-	 * If the slot hasn't been filled yet, tell the caller to allocate a new
-	 * buffer with the normal allocation strategy.  He will then fill this
-	 * slot by calling AddBufferToRing with the new buffer.
+	 * If the slot hasn't been filled yet or the buffer in the slot has been
+	 * invalidated when buffer pool was shrunk, tell the caller to allocate a new
+	 * buffer with the normal allocation strategy.  He will then fill this slot
+	 * by calling AddBufferToRing with the new buffer.
+	 * 
+	 * TODO: Ideally we would want to check for bufnum > NBuffers only once
+	 * after every time the buffer pool is shrunk so as to catch any runtime
+	 * bugs that introduce invalid buffers in the ring. But that is complicated.
+	 * The BufferAccessStrategy objects are not accessible outside the
+	 * ScanState. Hence we can not purge the buffers while evicting the buffers.
+	 * After the resizing is finished, it's not possible to notice when we touch
+	 * the first of those objects and the last of objects. See if this can
+	 * fixed. 
 	 */
 	bufnum = strategy->buffers[strategy->current];
-	if (bufnum == InvalidBuffer)
+	if (bufnum == InvalidBuffer ||
+		bufnum > pg_atomic_read_u32(&StrategyControl->activeNBuffers))
 		return NULL;
 
 	buf = GetBufferDescriptor(bufnum - 1);
diff --git a/src/backend/storage/buffer/meson.build b/src/backend/storage/buffer/meson.build
index 448976d2400..2fc58db5a91 100644
--- a/src/backend/storage/buffer/meson.build
+++ b/src/backend/storage/buffer/meson.build
@@ -6,4 +6,5 @@ backend_sources += files(
   'bufmgr.c',
   'freelist.c',
   'localbuf.c',
+  'buf_resize.c',
 )
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 41190f96639..23e9b53ea07 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -154,6 +154,14 @@ CalculateShmemSize(MemoryMappingSizes *mapping_sizes)
 	size = add_size(size, AioShmemSize());
 	size = add_size(size, WaitLSNShmemSize());
 
+	/*
+	 * XXX: For some reason slightly more memory is needed for larger
+	 * shared_buffers, but this size is enough for any large value I've tested
+	 * with. Is it a mistake in how slots are split, or there was a hidden
+	 * inconsistency in shmem calculation?
+	 */
+	size = add_size(size, 1024 * 1024 * 100);
+
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
 
@@ -168,8 +176,7 @@ CalculateShmemSize(MemoryMappingSizes *mapping_sizes)
 	/* might as well round it off to a multiple of a typical page size */
 	for (int segment = 0; segment < NUM_MEMORY_MAPPINGS; segment++)
 	{
-		mapping_sizes[segment].shmem_req_size = add_size(mapping_sizes[segment].shmem_req_size, 8192 - (mapping_sizes[segment].shmem_req_size % 8192));
-		mapping_sizes[segment].shmem_reserved = add_size(mapping_sizes[segment].shmem_reserved, 8192 - (mapping_sizes[segment].shmem_reserved % 8192));
+		round_off_mapping_sizes(&mapping_sizes[segment]);
 		/* Compute the total size of all segments */
 		size = size + mapping_sizes[segment].shmem_req_size;
 	}
@@ -313,6 +320,8 @@ CreateOrAttachShmemStructs(void)
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
+	/* TODO: This should be part of BufferManagerShmemInit() */
+	ShmemControlInit();
 	BufferManagerShmemInit();
 
 	/*
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 087821311cc..c7c36f2be67 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -24,9 +24,11 @@
 #include "port/pg_bitutils.h"
 #include "replication/logicalworker.h"
 #include "replication/walsender.h"
+#include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
+#include "storage/pg_shmem.h"
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "storage/smgr.h"
@@ -109,6 +111,10 @@ static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
 
+#ifdef DEBUG_SHMEM_RESIZE
+bool delay_proc_signal_init = false;
+#endif
+
 /*
  * ProcSignalShmemSize
  *		Compute space needed for ProcSignal's shared memory
@@ -170,6 +176,43 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
 	uint32		old_pss_pid;
 
 	Assert(cancel_key_len >= 0 && cancel_key_len <= MAX_CANCEL_KEY_LENGTH);
+
+#ifdef DEBUG_SHMEM_RESIZE
+	/*
+	 * Introduced for debugging purposes. You can change the variable at
+	 * runtime using gdb, then start new backends with delayed ProcSignal
+	 * initialization. Simple pg_usleep wont work here due to SIGHUP interrupt
+	 * needed for testing. Taken from pg_sleep;
+	 */
+	if (delay_proc_signal_init)
+	{
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+		float8		endtime = GetNowFloat() + 5;
+
+		for (;;)
+		{
+			float8		delay;
+			long		delay_ms;
+
+			CHECK_FOR_INTERRUPTS();
+
+			delay = endtime - GetNowFloat();
+			if (delay >= 600.0)
+				delay_ms = 600000;
+			else if (delay > 0.0)
+				delay_ms = (long) (delay * 1000.0);
+			else
+				break;
+
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 delay_ms,
+							 WAIT_EVENT_PG_SLEEP);
+			ResetLatch(MyLatch);
+		}
+	}
+#endif
+
 	if (MyProcNumber < 0)
 		elog(ERROR, "MyProcNumber not set");
 	if (MyProcNumber >= NumProcSignalSlots)
@@ -576,6 +619,18 @@ ProcessProcSignalBarrier(void)
 					case PROCSIGNAL_BARRIER_SMGRRELEASE:
 						processed = ProcessBarrierSmgrRelease();
 						break;
+					case PROCSIGNAL_BARRIER_SHBUF_SHRINK:
+						processed = ProcessBarrierShmemShrink();
+						break;
+					case PROCSIGNAL_BARRIER_SHBUF_RESIZE_MAP_AND_MEM:
+						processed = ProcessBarrierShmemResizeMapAndMem();
+						break;
+					case PROCSIGNAL_BARRIER_SHBUF_EXPAND:
+						processed = ProcessBarrierShmemExpand();
+						break;
+					case PROCSIGNAL_BARRIER_SHBUF_RESIZE_FAILED:
+						processed = ProcessBarrierShmemResizeFailed();
+						break;
 				}
 
 				/*
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index f303a9328df..eafcb665ba9 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -69,11 +69,19 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "port/pg_numa.h"
+#include "postmaster/bgwriter.h"
+#include "storage/bufmgr.h"
+#include "storage/buf_internals.h"
+#include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
+#include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/injection_point.h"
+#include "utils/wait_event.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 static void *ShmemAllocRawInSegment(Size size, Size *allocated_size,
@@ -493,8 +501,7 @@ ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
 	{
 		/*
 		 * Structure is in the shmem index so someone else has allocated it
-		 * already.  The size better be the same as the size we are trying to
-		 * initialize to, or there is a name conflict (or worse).
+		 * already. The size better be the same as the size we are trying to
 		 */
 		if (result->size != size)
 		{
@@ -504,6 +511,7 @@ ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
 							" \"%s\": expected %zu, actual %zu",
 							name, size, result->size)));
 		}
+
 		structPtr = result->location;
 	}
 	else
@@ -538,6 +546,59 @@ ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
 	return structPtr;
 }
 
+/*
+ * ShmemUpdateStructInSegment -- Update the size of a structure in shared memory.
+ *
+ * This function updates the size of an existing shared memory structure. It
+ * finds the structure in the shmem index and updates its size information while
+ * preserving the existing memory location.
+ *
+ * Returns: pointer to the existing structure location.
+ */
+void *
+ShmemUpdateStructInSegment(const char *name, Size size, bool *foundPtr,
+						   int shmem_segment)
+{
+	ShmemIndexEnt *result;
+	void	   *structPtr;
+	Size delta;
+
+	LWLockAcquire(ShmemIndexLock, LW_EXCLUSIVE);
+
+	Assert(ShmemIndex);
+
+	/* Look up the structure in the shmem index */
+	result = (ShmemIndexEnt *)
+		hash_search(ShmemIndex, name, HASH_FIND, foundPtr);
+
+	Assert(*foundPtr);
+	Assert(result);
+	Assert(result->shmem_segment == shmem_segment);
+
+	delta = size - result->size;
+	/* Store the existing structure pointer */
+	structPtr = result->location;
+
+	/* Update the size information.
+	   TODO: Ideally we should implement repalloc kind of functionality for shared memory which will return allocated size. */
+	result->size = size;
+	result->allocated_size = size;
+
+	/* Reflect size change in the shared segment */
+	SpinLockAcquire(Segments[shmem_segment].ShmemLock);
+	Segments[shmem_segment].ShmemSegHdr->freeoffset += delta;
+	SpinLockRelease(Segments[shmem_segment].ShmemLock);
+	LWLockRelease(ShmemIndexLock);
+
+	/* Verify the structure is still in the correct segment */
+	Assert(ShmemAddrIsValidInSegment(structPtr, shmem_segment));
+	Assert(structPtr == (void *) CACHELINEALIGN(structPtr));
+
+	return structPtr;
+}
+
+
+
 /*
  * Add two Size values, checking for overflow
  */
@@ -871,4 +932,3 @@ pg_get_shmem_segments(PG_FUNCTION_ARGS)
 
 	return (Datum) 0;
 }
-
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 2bd89102686..00c8afb9fe9 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -63,6 +63,7 @@
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -4138,6 +4139,9 @@ PostgresSingleUserMain(int argc, char *argv[],
 	/* Initialize size of fast-path lock cache. */
 	InitializeFastPathLocks();
 
+	/* Initialize MaxNBuffers for buffer pool resizing. */
+	InitializeMaxNBuffers();
+
 	/*
 	 * Give preloaded libraries a chance to request additional shared memory.
 	 */
@@ -4328,6 +4332,13 @@ PostgresMain(const char *dbname, const char *username)
 	 */
 	BeginReportingGUCOptions();
 
+	/*
+	 * TODO: The new backend should fetch the shared buffers status. If the
+	 * resizing is going on, it should bring itself upto speed with it. If not,
+	 * simply fetch the latest pointers are sizes. Is this the right place to do
+	 * that?
+	 */
+
 	/*
 	 * Also set up handler to log session end; we have to wait till now to be
 	 * sure Log_disconnections has its final value.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c1ac71ff7f2..ee5887496ba 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -162,6 +162,7 @@ WAL_RECEIVER_EXIT	"Waiting for the WAL receiver to exit."
 WAL_RECEIVER_WAIT_START	"Waiting for startup process to send initial data for streaming replication."
 WAL_SUMMARY_READY	"Waiting for a new WAL summary to be generated."
 XACT_GROUP_UPDATE	"Waiting for the group leader to update transaction status at transaction end."
+PM_BUFFER_RESIZE_WAIT	"Waiting for the postmaster to complete shared buffer pool resize operations."
 
 ABI_compatibility:
 
@@ -358,6 +359,7 @@ InjectionPoint	"Waiting to read or update information related to injection point
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
+ShmemResize	"Waiting to resize shared memory."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..419c7fad890 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -139,7 +139,10 @@ int			max_parallel_maintenance_workers = 2;
  * MaxBackends is computed by PostmasterMain after modules have had a chance to
  * register background workers.
  */
-int			NBuffers = 16384;
+int			NBuffers = 0;
+int			NBuffersPending = 16384;
+bool        finalMaxNBuffers = false;
+int			MaxNBuffers = 0;
 int			MaxConnections = 100;
 int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 98f9598cd78..46a8a8a3faa 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -595,6 +595,55 @@ InitializeFastPathLocks(void)
 		   pg_nextpower2_32(FastPathLockGroupsPerBackend));
 }
 
+/*
+ * Initialize MaxNBuffers variable with validation.
+ *
+ * This must be called after GUCs have been loaded but before shared memory size
+ * is determined.
+ *
+ * Since MaxNBuffers limits the size of the buffer pool, it must be at least as
+ * much as NBuffersPending. If MaxNBuffers is 0 (default), set it to
+ * NBuffersPending. Otherwise, validate that MaxNBuffers is not less than
+ * NBuffersPending.
+ */
+void
+InitializeMaxNBuffers(void)
+{
+	if (MaxNBuffers == 0)  /* default/boot value */
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", NBuffersPending);
+		SetConfigOption("max_shared_buffers", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+
+		/*
+		 * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
+		 * However, if the DBA explicitly set max_shared_buffers = 0 in
+		 * the config file, then PGC_S_DYNAMIC_DEFAULT will fail to override
+		 * that and we must force the matter with PGC_S_OVERRIDE.
+		 */
+		if (MaxNBuffers == 0)	/* failed to apply it? */
+			SetConfigOption("max_shared_buffers", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+	else
+	{
+		if (MaxNBuffers < NBuffersPending)
+		{
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("max_shared_buffers (%d) cannot be less than current shared_buffers (%d)",
+							MaxNBuffers, NBuffersPending),
+					 errhint("Increase max_shared_buffers or decrease shared_buffers.")));
+		}
+	}
+	
+	Assert(MaxNBuffers > 0);
+	Assert(!finalMaxNBuffers);
+	finalMaxNBuffers = true;
+}
+
 /*
  * Early initialization of a backend (either standalone or under postmaster).
  * This happens even before InitPostgres.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 7e2b17cc04e..c26a02e4a42 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2599,7 +2599,7 @@ convert_to_base_unit(double value, const char *unit,
  * the value without loss.  For example, if the base unit is GUC_UNIT_KB, 1024
  * is converted to 1 MB, but 1025 is represented as 1025 kB.
  */
-static void
+void
 convert_int_from_base_unit(int64 base_value, int base_unit,
 						   int64 *value, const char **unit)
 {
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 1128167c025..539b29f0065 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -2013,6 +2013,15 @@
   max => 'MAX_BACKENDS /* XXX? */',
 },
 
+{ name => "max_shared_buffers", type => 'int', context => 'PGC_POSTMASTER', group => 'RESOURCES_MEM',
+  short_desc => 'Sets the upper limit for the shared_buffers value.',
+  flags => 'GUC_UNIT_BLOCKS',
+  variable => 'MaxNBuffers',
+  boot_val => '0',
+  min => '0',
+  max => 'INT_MAX / 2',
+},
+
 { name => 'max_slot_wal_keep_size', type => 'int', context => 'PGC_SIGHUP', group => 'REPLICATION_SENDING',
   short_desc => 'Sets the maximum WAL size that can be reserved by replication slots.',
   long_desc => 'Replication slots will be marked as failed, and segments released for deletion or recycling, if this much space is occupied by WAL on disk. -1 means no maximum.',
@@ -2581,13 +2590,15 @@
 
 # We sometimes multiply the number of shared buffers by two without
 # checking for overflow, so we mustn't allow more than INT_MAX / 2.
-{ name => 'shared_buffers', type => 'int', context => 'PGC_POSTMASTER', group => 'RESOURCES_MEM',
+{ name => 'shared_buffers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_MEM',
   short_desc => 'Sets the number of shared memory buffers used by the server.',
   flags => 'GUC_UNIT_BLOCKS',
-  variable => 'NBuffers',
+  variable => 'NBuffersPending',
   boot_val => '16384',
   min => '16',
   max => 'INT_MAX / 2',
+  check_hook => 'check_shared_buffers',
+  show_hook => 'show_shared_buffers',
 },
 
 { name => 'shared_memory_size', type => 'int', context => 'PGC_INTERNAL', group => 'PRESET_OPTIONS',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 411043ca750..119d9dd5880 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12612,4 +12612,10 @@
   proargnames => '{pid,io_id,io_generation,state,operation,off,length,target,handle_data_len,raw_result,result,target_desc,f_sync,f_localmem,f_buffered}',
   prosrc => 'pg_get_aios' },
 
+{ oid => '9999', descr => 'resize shared buffers according to the value of GUC `shared_buffers`',
+  proname => 'pg_resize_shared_buffers',
+  provolatile => 'v',
+  prorettype => 'bool',
+  proargtypes => '',
+  prosrc => 'pg_resize_shared_buffers'},
 ]
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 9a7d733ddef..b4dc2c4ba57 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -173,7 +173,11 @@ extern PGDLLIMPORT bool ExitOnAnyError;
 extern PGDLLIMPORT char *DataDir;
 extern PGDLLIMPORT int data_directory_mode;
 
+/* TODO: This is no more a GUC variable; should be moved somewhere else. */
 extern PGDLLIMPORT int NBuffers;
+extern PGDLLIMPORT int NBuffersPending;
+extern PGDLLIMPORT bool finalMaxNBuffers;
+extern PGDLLIMPORT int MaxNBuffers;
 extern PGDLLIMPORT int MaxBackends;
 extern PGDLLIMPORT int MaxConnections;
 extern PGDLLIMPORT int max_worker_processes;
@@ -502,6 +506,7 @@ extern PGDLLIMPORT ProcessingMode Mode;
 extern void pg_split_opts(char **argv, int *argcp, const char *optstr);
 extern void InitializeMaxBackends(void);
 extern void InitializeFastPathLocks(void);
+extern void InitializeMaxNBuffers(void);
 extern void InitPostgres(const char *in_dbname, Oid dboid,
 						 const char *username, Oid useroid,
 						 bits32 flags,
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 519692702a0..4c53194e13e 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -513,6 +513,7 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
+extern void StrategyReset(int activeNBuffers);
 
 /* buf_table.c */
 extern Size BufTableShmemSize(int size);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3769f4db7dc..774cf8f38ed 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -21,6 +21,7 @@
 #include "storage/bufpage.h"
 #include "storage/pg_shmem.h"
 #include "storage/relfilelocator.h"
+#include "utils/guc.h"
 #include "utils/relcache.h"
 #include "utils/snapmgr.h"
 
@@ -159,6 +160,7 @@ typedef struct WritebackContext WritebackContext;
 
 /* in globals.c ... this duplicates miscadmin.h */
 extern PGDLLIMPORT int NBuffers;
+extern PGDLLIMPORT int NBuffersPending;
 
 /* in bufmgr.c */
 extern PGDLLIMPORT bool zero_damaged_pages;
@@ -205,6 +207,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * prototypes for functions in buf_init.c
+ */
+extern const char *show_shared_buffers(void);
+extern bool check_shared_buffers(int *newval, void **extra, GucSource source);
 
 /*
  * prototypes for functions in bufmgr.c
@@ -308,6 +315,7 @@ extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(WritebackContext *wb_context);
+extern void BgBufferSyncReset(int currentNBuffers, int targetNBuffers);
 
 extern uint32 GetPinLimit(void);
 extern uint32 GetLocalPinLimit(void);
@@ -324,10 +332,13 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 									int32 *buffers_evicted,
 									int32 *buffers_flushed,
 									int32 *buffers_skipped);
+extern bool EvictExtraBuffers(int targetNBuffers, int currentNBuffers);
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(MemoryMappingSizes *mapping_sizes);
+extern void BufferManagerShmemResize(int currentNBuffers, int targetNBuffers);
+extern void BufferManagerShmemValidate(int targetNBuffers);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
@@ -376,7 +387,7 @@ extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 static inline bool
 BufferIsValid(Buffer bufnum)
 {
-	Assert(bufnum <= NBuffers);
+	Assert(bufnum <= (Buffer) pg_atomic_read_u32(&ShmemCtrl->currentNBuffers));
 	Assert(bufnum >= -NLocBuffer);
 
 	return bufnum != InvalidBuffer;
@@ -430,4 +441,11 @@ BufferGetPage(Buffer buffer)
 
 #endif							/* FRONTEND */
 
+/* buf_resize.c */
+extern Datum pg_resize_shared_buffers(PG_FUNCTION_ARGS);
+extern bool ProcessBarrierShmemShrink(void);
+extern bool ProcessBarrierShmemResizeMapAndMem(void);
+extern bool ProcessBarrierShmemExpand(void);
+extern bool ProcessBarrierShmemResizeFailed(void);
+
 #endif							/* BUFMGR_H */
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index d73f1b407db..6dbbb9ad064 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -66,6 +66,7 @@ typedef void (*shmem_startup_hook_type) (void);
 /* ipc.c */
 extern PGDLLIMPORT bool proc_exit_inprogress;
 extern PGDLLIMPORT bool shmem_exit_inprogress;
+extern PGDLLIMPORT volatile bool delay_shmem_resize;
 
 pg_noreturn extern void proc_exit(int code);
 extern void shmem_exit(int code);
@@ -85,5 +86,7 @@ extern void CreateSharedMemoryAndSemaphores(void);
 extern void AttachSharedMemoryStructs(void);
 #endif
 extern void InitializeShmemGUCs(void);
+extern void CoordinateShmemResize(void);
+extern bool AnonymousShmemResize(void);
 
 #endif							/* IPC_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 5b0ce383408..9c4b928441c 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -86,6 +86,7 @@ PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
 PG_LWLOCK(54, WaitLSN)
+PG_LWLOCK(55, ShmemResize)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index beee0a53d2d..36900068820 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -24,8 +24,13 @@
 #ifndef PG_SHMEM_H
 #define PG_SHMEM_H
 
+#include "port/atomics.h"
+#include "storage/barrier.h"
 #include "storage/dsm_impl.h"
+#include "storage/procsignal.h"
 #include "storage/spin.h"
+#include "storage/shmem.h"
+#include "utils/guc.h"
 
 typedef struct MemoryMappingSizes
 {
@@ -65,15 +70,39 @@ typedef struct ShmemSegment
 } ShmemSegment;
 
 /* Number of available segments for anonymous memory mappings */
-#define NUM_MEMORY_MAPPINGS 6
+#define NUM_MEMORY_MAPPINGS 5
 
 extern PGDLLIMPORT ShmemSegment Segments[NUM_MEMORY_MAPPINGS];
 
+/*
+ * ShmemControl is shared between backends and helps to coordinate shared
+ * memory resize.
+ * 
+ * TODO: I think we need a lock to protect this structure. If we do so, do we
+ * need to use atomic integers?
+ */
+typedef struct
+{
+	pg_atomic_flag		resize_in_progress; /* true if resizing is in progress. false otherwise. */
+	pg_atomic_uint32	currentNBuffers; /* Original NBuffers value before resize started */
+	pg_atomic_uint32	targetNBuffers;
+	pid_t				coordinator;
+} ShmemControl;
+
+extern PGDLLIMPORT ShmemControl *ShmemCtrl;
+
+/* The phases for shared memory resizing, used by for ProcSignal barrier. */
+#define SHMEM_RESIZE_REQUESTED			0
+#define SHMEM_RESIZE_START				1
+#define SHMEM_RESIZE_DONE				2
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
 extern PGDLLIMPORT int huge_page_size;
 extern PGDLLIMPORT int huge_pages_status;
+extern PGDLLIMPORT bool finalMaxNBuffers;
+extern PGDLLIMPORT int MaxNBuffers;
 
 /* Possible values for huge_pages and huge_pages_status */
 typedef enum
@@ -113,6 +142,17 @@ extern void PGSharedMemoryReAttach(void);
 extern void PGSharedMemoryNoReAttach(void);
 #endif
 
+/*
+ * round off mapping size to a multiple of a typical page size.
+ */
+static inline void
+round_off_mapping_sizes(MemoryMappingSizes *mapping_sizes)
+{
+	mapping_sizes->shmem_req_size = add_size(mapping_sizes->shmem_req_size, 8192 - (mapping_sizes->shmem_req_size % 8192));
+	mapping_sizes->shmem_reserved = add_size(mapping_sizes->shmem_reserved, 8192 - (mapping_sizes->shmem_reserved % 8192));
+}
+
+
 extern PGShmemHeader *PGSharedMemoryCreate(MemoryMappingSizes *mapping_sizes, int segment_id,
 										   PGShmemHeader **shim);
 extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
@@ -122,6 +162,13 @@ extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
 							int *memfd_flags);
 void PrepareHugePages(void);
 
+bool ProcessBarrierShmemResize(Barrier *barrier);
+const char *show_shared_buffers(void);
+bool check_shared_buffers(int *newval, void **extra, GucSource source);
+void AdjustShmemSize(void);
+extern void WaitOnShmemBarrier(void);
+extern void ShmemControlInit(void);
+
 /*
  * To be able to dynamically resize largest parts of the data stored in shared
  * memory, we split it into multiple shared memory mappings segments. Each
@@ -144,7 +191,4 @@ void PrepareHugePages(void);
 /* Checkpoint BufferIds */
 #define CHECKPOINT_BUFFERS_SHMEM_SEGMENT 4
 
-/* Buffer strategy status */
-#define STRATEGY_SHMEM_SEGMENT 5
-
 #endif							/* PG_SHMEM_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 428aa3fd68a..5ced2a83537 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -42,9 +42,10 @@ typedef enum
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
 	PMSIGNAL_XLOG_IS_SHUTDOWN,	/* ShutdownXLOG() completed */
+	PMSIGNAL_SHMEM_RESIZE,	/* resize shared memory */
 } PMSignalReason;
 
-#define NUM_PMSIGNALS (PMSIGNAL_XLOG_IS_SHUTDOWN+1)
+#define NUM_PMSIGNALS (PMSIGNAL_SHMEM_RESIZE+1)
 
 /*
  * Reasons why the postmaster would send SIGQUIT to its children.
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index afeeb1ca019..4de11faf12d 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -54,6 +54,10 @@ typedef enum
 typedef enum
 {
 	PROCSIGNAL_BARRIER_SMGRRELEASE, /* ask smgr to close files */
+	PROCSIGNAL_BARRIER_SHBUF_SHRINK, /* shrink buffer pool - restrict allocations to new size */
+	PROCSIGNAL_BARRIER_SHBUF_RESIZE_MAP_AND_MEM, /* remap shared memory segments and update structure pointers */
+	PROCSIGNAL_BARRIER_SHBUF_EXPAND, /* expand buffer pool - enable allocations in new range */
+	PROCSIGNAL_BARRIER_SHBUF_RESIZE_FAILED, /* signal backends that the shared buffer resizing failed. */
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index c56712555f0..d59e5ba6dcd 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -49,11 +49,14 @@ extern HTAB *ShmemInitHashInSegment(const char *name, long init_size,
 extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
 extern void *ShmemInitStructInSegment(const char *name, Size size,
 									  bool *foundPtr, int shmem_segment);
+extern void *ShmemUpdateStructInSegment(const char *name, Size size,
+										bool *foundPtr, int shmem_segment);
 extern Size add_size(Size s1, Size s2);
 extern Size mul_size(Size s1, Size s2);
 
 extern PGDLLIMPORT Size pg_get_shmem_pagesize(void);
 
+
 /* ipci.c */
 extern void RequestAddinShmemSpace(Size size);
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index f21ec37da89..08a84373fb7 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -459,6 +459,8 @@ extern config_handle *get_config_handle(const char *name);
 extern void AlterSystemSetConfigFile(AlterSystemStmt *altersysstmt);
 extern char *GetConfigOptionByName(const char *name, const char **varname,
 								   bool missing_ok);
+extern void convert_int_from_base_unit(int64 base_value, int base_unit,
+									   int64 *value, const char **unit);
 
 extern void TransformGUCArray(ArrayType *array, List **names,
 							  List **values);
diff --git a/src/test/Makefile b/src/test/Makefile
index 511a72e6238..95f8858a818 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,7 +12,7 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = perl postmaster regress isolation modules authentication recovery subscription
+SUBDIRS = perl postmaster regress isolation modules authentication recovery subscription buffermgr
 
 ifeq ($(with_icu),yes)
 SUBDIRS += icu
diff --git a/src/test/README b/src/test/README
index afdc7676519..77f11607ff7 100644
--- a/src/test/README
+++ b/src/test/README
@@ -15,6 +15,9 @@ examples/
   Demonstration programs for libpq that double as regression tests via
   "make check"
 
+buffermgr/
+  Tests for resizing buffer pool without restarting the server
+
 isolation/
   Tests for concurrent behavior at the SQL level
 
diff --git a/src/test/buffermgr/Makefile b/src/test/buffermgr/Makefile
new file mode 100644
index 00000000000..eb275027fa6
--- /dev/null
+++ b/src/test/buffermgr/Makefile
@@ -0,0 +1,30 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/test/buffermgr
+#
+# Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/test/buffermgr/Makefile
+#
+#-------------------------------------------------------------------------
+
+EXTRA_INSTALL = contrib/pg_buffercache
+
+REGRESS = buffer_resize
+
+# Custom configuration for buffer manager tests
+TEMP_CONFIG = $(srcdir)/buffermgr_test.conf
+
+subdir = src/test/buffermgr
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
+clean distclean:
+	rm -rf tmp_check
diff --git a/src/test/buffermgr/README b/src/test/buffermgr/README
new file mode 100644
index 00000000000..c375ad80989
--- /dev/null
+++ b/src/test/buffermgr/README
@@ -0,0 +1,26 @@
+src/test/buffermgr/README
+
+Regression tests for buffer manager
+===================================
+
+This directory contains a test suite for resizing buffer manager without restarting the server.
+
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
+
+See src/test/perl/README for more info about running these tests.
diff --git a/src/test/buffermgr/buffermgr_test.conf b/src/test/buffermgr/buffermgr_test.conf
new file mode 100644
index 00000000000..b7c0065c80b
--- /dev/null
+++ b/src/test/buffermgr/buffermgr_test.conf
@@ -0,0 +1,11 @@
+# Configuration for buffer manager regression tests
+
+# Even if max_shared_buffers is set multiple times only the last one is used to
+# as the limit on shared_buffers.
+max_shared_buffers = 128kB
+# Set initial shared_buffers as expected by test
+shared_buffers = 128MB
+# Set a larger value for max_shared_buffers to allow testing resize operations
+max_shared_buffers = 300MB 
+# Turn huge pages off, since that affects the size of memory segments
+huge_pages = off
\ No newline at end of file
diff --git a/src/test/buffermgr/expected/buffer_resize.out b/src/test/buffermgr/expected/buffer_resize.out
new file mode 100644
index 00000000000..d5cb9d78437
--- /dev/null
+++ b/src/test/buffermgr/expected/buffer_resize.out
@@ -0,0 +1,329 @@
+-- Test buffer pool resizing and shared memory allocation tracking
+-- This test resizes the buffer pool multiple times and monitors
+-- shared memory allocations related to buffer management
+-- TODO: The test sets shared_buffers values in MBs. Instead it could use values
+-- in kBs so that the test runs on very small machines.
+-- Create a view for buffer-related shared memory allocations
+CREATE VIEW buffer_allocations AS
+SELECT name, segment, size, allocated_size 
+FROM pg_shmem_allocations 
+WHERE name IN ('Buffer Blocks', 'Buffer Descriptors', 'Buffer IO Condition Variables', 
+               'Checkpoint BufferIds')
+ORDER BY name;
+-- Note: We exclude the 'main' segment even if it contains the shared buffer
+-- lookup table because it contains other shared structures whose total sizes
+-- may vary as the code changes.
+CREATE VIEW buffer_segments AS
+SELECT name, size, mapping_size, mapping_reserved_size
+FROM pg_shmem_segments
+WHERE name <> 'main'
+ORDER BY name;
+-- Enable pg_buffercache for buffer count verification
+CREATE EXTENSION IF NOT EXISTS pg_buffercache;
+-- Test 1: Default shared_buffers 
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 128MB
+(1 row)
+
+SHOW max_shared_buffers;
+ max_shared_buffers 
+--------------------
+ 300MB
+(1 row)
+
+SELECT * FROM buffer_allocations;
+             name              |   segment   |   size    | allocated_size 
+-------------------------------+-------------+-----------+----------------
+ Buffer Blocks                 | buffers     | 134221824 |      134221824
+ Buffer Descriptors            | descriptors |   1048576 |        1048576
+ Buffer IO Condition Variables | iocv        |    262144 |         262144
+ Checkpoint BufferIds          | checkpoint  |    327680 |         327680
+(4 rows)
+
+SELECT * FROM buffer_segments;
+    name     |   size    | mapping_size | mapping_reserved_size 
+-------------+-----------+--------------+-----------------------
+ buffers     | 134225920 |    134225920 |             314580992
+ checkpoint  |    335872 |       335872 |                770048
+ descriptors |   1056768 |      1056768 |               2465792
+ iocv        |    270336 |       270336 |                622592
+(4 rows)
+
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+ buffer_count 
+--------------
+        16384
+(1 row)
+
+-- Calling pg_resize_shared_buffers() without changing shared_buffers should be a no-op.
+SELECT pg_resize_shared_buffers();
+ pg_resize_shared_buffers 
+--------------------------
+ t
+(1 row)
+
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 128MB
+(1 row)
+
+SELECT * FROM buffer_allocations;
+             name              |   segment   |   size    | allocated_size 
+-------------------------------+-------------+-----------+----------------
+ Buffer Blocks                 | buffers     | 134221824 |      134221824
+ Buffer Descriptors            | descriptors |   1048576 |        1048576
+ Buffer IO Condition Variables | iocv        |    262144 |         262144
+ Checkpoint BufferIds          | checkpoint  |    327680 |         327680
+(4 rows)
+
+SELECT * FROM buffer_segments;
+    name     |   size    | mapping_size | mapping_reserved_size 
+-------------+-----------+--------------+-----------------------
+ buffers     | 134225920 |    134225920 |             314580992
+ checkpoint  |    335872 |       335872 |                770048
+ descriptors |   1056768 |      1056768 |               2465792
+ iocv        |    270336 |       270336 |                622592
+(4 rows)
+
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+ buffer_count 
+--------------
+        16384
+(1 row)
+
+-- Test 2: Set to 64MB  
+ALTER SYSTEM SET shared_buffers = '64MB';
+SELECT pg_reload_conf();
+ pg_reload_conf 
+----------------
+ t
+(1 row)
+
+-- reconnect to ensure new setting is loaded
+\c
+SHOW shared_buffers;
+    shared_buffers     
+-----------------------
+ 128MB (pending: 64MB)
+(1 row)
+
+SELECT pg_resize_shared_buffers();
+ pg_resize_shared_buffers 
+--------------------------
+ t
+(1 row)
+
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 64MB
+(1 row)
+
+SELECT * FROM buffer_allocations;
+             name              |   segment   |   size   | allocated_size 
+-------------------------------+-------------+----------+----------------
+ Buffer Blocks                 | buffers     | 67112960 |       67112960
+ Buffer Descriptors            | descriptors |   524288 |         524288
+ Buffer IO Condition Variables | iocv        |   131072 |         131072
+ Checkpoint BufferIds          | checkpoint  |   163840 |         163840
+(4 rows)
+
+SELECT * FROM buffer_segments;
+    name     |   size   | mapping_size | mapping_reserved_size 
+-------------+----------+--------------+-----------------------
+ buffers     | 67117056 |     67117056 |             314580992
+ checkpoint  |   172032 |       172032 |                770048
+ descriptors |   532480 |       532480 |               2465792
+ iocv        |   139264 |       139264 |                622592
+(4 rows)
+
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+ buffer_count 
+--------------
+         8192
+(1 row)
+
+-- Test 3: Set to 256MB
+ALTER SYSTEM SET shared_buffers = '256MB';
+SELECT pg_reload_conf();
+ pg_reload_conf 
+----------------
+ t
+(1 row)
+
+-- reconnect to ensure new setting is loaded
+\c
+SHOW shared_buffers;
+    shared_buffers     
+-----------------------
+ 64MB (pending: 256MB)
+(1 row)
+
+SELECT pg_resize_shared_buffers();
+ pg_resize_shared_buffers 
+--------------------------
+ t
+(1 row)
+
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 256MB
+(1 row)
+
+SELECT * FROM buffer_allocations;
+             name              |   segment   |   size    | allocated_size 
+-------------------------------+-------------+-----------+----------------
+ Buffer Blocks                 | buffers     | 268439552 |      268439552
+ Buffer Descriptors            | descriptors |   2097152 |        2097152
+ Buffer IO Condition Variables | iocv        |    524288 |         524288
+ Checkpoint BufferIds          | checkpoint  |    655360 |         655360
+(4 rows)
+
+SELECT * FROM buffer_segments;
+    name     |   size    | mapping_size | mapping_reserved_size 
+-------------+-----------+--------------+-----------------------
+ buffers     | 268443648 |    268443648 |             314580992
+ checkpoint  |    663552 |       663552 |                770048
+ descriptors |   2105344 |      2105344 |               2465792
+ iocv        |    532480 |       532480 |                622592
+(4 rows)
+
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+ buffer_count 
+--------------
+        32768
+(1 row)
+
+-- Test 4: Set to 100MB (non-power-of-two)
+ALTER SYSTEM SET shared_buffers = '100MB';
+SELECT pg_reload_conf();
+ pg_reload_conf 
+----------------
+ t
+(1 row)
+
+-- reconnect to ensure new setting is loaded
+\c
+SHOW shared_buffers;
+     shared_buffers     
+------------------------
+ 256MB (pending: 100MB)
+(1 row)
+
+SELECT pg_resize_shared_buffers();
+ pg_resize_shared_buffers 
+--------------------------
+ t
+(1 row)
+
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 100MB
+(1 row)
+
+SELECT * FROM buffer_allocations;
+             name              |   segment   |   size    | allocated_size 
+-------------------------------+-------------+-----------+----------------
+ Buffer Blocks                 | buffers     | 104861696 |      104861696
+ Buffer Descriptors            | descriptors |    819200 |         819200
+ Buffer IO Condition Variables | iocv        |    204800 |         204800
+ Checkpoint BufferIds          | checkpoint  |    256000 |         256000
+(4 rows)
+
+SELECT * FROM buffer_segments;
+    name     |   size    | mapping_size | mapping_reserved_size 
+-------------+-----------+--------------+-----------------------
+ buffers     | 104865792 |    104865792 |             314580992
+ checkpoint  |    262144 |       262144 |                770048
+ descriptors |    827392 |       827392 |               2465792
+ iocv        |    212992 |       212992 |                622592
+(4 rows)
+
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+ buffer_count 
+--------------
+        12800
+(1 row)
+
+-- Test 5: Set to minimum 128kB
+ALTER SYSTEM SET shared_buffers = '128kB';
+SELECT pg_reload_conf();
+ pg_reload_conf 
+----------------
+ t
+(1 row)
+
+-- reconnect to ensure new setting is loaded
+\c
+SHOW shared_buffers;
+     shared_buffers     
+------------------------
+ 100MB (pending: 128kB)
+(1 row)
+
+SELECT pg_resize_shared_buffers();
+ pg_resize_shared_buffers 
+--------------------------
+ t
+(1 row)
+
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 128kB
+(1 row)
+
+SELECT * FROM buffer_allocations;
+             name              |   segment   |  size  | allocated_size 
+-------------------------------+-------------+--------+----------------
+ Buffer Blocks                 | buffers     | 135168 |         135168
+ Buffer Descriptors            | descriptors |   1024 |           1024
+ Buffer IO Condition Variables | iocv        |    256 |            256
+ Checkpoint BufferIds          | checkpoint  |    320 |            320
+(4 rows)
+
+SELECT * FROM buffer_segments;
+    name     |  size  | mapping_size | mapping_reserved_size 
+-------------+--------+--------------+-----------------------
+ buffers     | 139264 |       139264 |             314580992
+ checkpoint  |   8192 |         8192 |                770048
+ descriptors |   8192 |         8192 |               2465792
+ iocv        |   8192 |         8192 |                622592
+(4 rows)
+
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+ buffer_count 
+--------------
+           16
+(1 row)
+
+-- Test 6: Try to set shared_buffers higher than max_shared_buffers (should fail)
+ALTER SYSTEM SET shared_buffers = '400MB';
+ERROR:  invalid value for parameter "shared_buffers": 51200
+DETAIL:  "shared_buffers" must be less than "max_shared_buffers".
+SELECT pg_reload_conf();
+ pg_reload_conf 
+----------------
+ t
+(1 row)
+
+-- reconnect to ensure new setting is loaded
+\c
+-- This should show the old value since the configuration was rejected
+SHOW shared_buffers;
+ shared_buffers 
+----------------
+ 128kB
+(1 row)
+
+SHOW max_shared_buffers;
+ max_shared_buffers 
+--------------------
+ 300MB
+(1 row)
+
diff --git a/src/test/buffermgr/meson.build b/src/test/buffermgr/meson.build
new file mode 100644
index 00000000000..c24bff721e6
--- /dev/null
+++ b/src/test/buffermgr/meson.build
@@ -0,0 +1,23 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'buffermgr',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'buffer_resize',
+    ],
+    'regress_args': ['--temp-config', files('buffermgr_test.conf')],
+  },
+  'tap': {
+    'env': {
+      'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
+    },
+    'tests': [
+      't/001_resize_buffer.pl',
+      't/003_parallel_resize_buffer.pl',
+      't/004_client_join_buffer_resize.pl',
+    ],
+  },
+}
diff --git a/src/test/buffermgr/sql/buffer_resize.sql b/src/test/buffermgr/sql/buffer_resize.sql
new file mode 100644
index 00000000000..dfaaeabfcbb
--- /dev/null
+++ b/src/test/buffermgr/sql/buffer_resize.sql
@@ -0,0 +1,95 @@
+-- Test buffer pool resizing and shared memory allocation tracking
+-- This test resizes the buffer pool multiple times and monitors
+-- shared memory allocations related to buffer management
+-- TODO: The test sets shared_buffers values in MBs. Instead it could use values
+-- in kBs so that the test runs on very small machines.
+
+-- Create a view for buffer-related shared memory allocations
+CREATE VIEW buffer_allocations AS
+SELECT name, segment, size, allocated_size 
+FROM pg_shmem_allocations 
+WHERE name IN ('Buffer Blocks', 'Buffer Descriptors', 'Buffer IO Condition Variables', 
+               'Checkpoint BufferIds')
+ORDER BY name;
+
+-- Note: We exclude the 'main' segment even if it contains the shared buffer
+-- lookup table because it contains other shared structures whose total sizes
+-- may vary as the code changes.
+CREATE VIEW buffer_segments AS
+SELECT name, size, mapping_size, mapping_reserved_size
+FROM pg_shmem_segments
+WHERE name <> 'main'
+ORDER BY name;
+
+-- Enable pg_buffercache for buffer count verification
+CREATE EXTENSION IF NOT EXISTS pg_buffercache;
+
+-- Test 1: Default shared_buffers 
+SHOW shared_buffers;
+SHOW max_shared_buffers;
+SELECT * FROM buffer_allocations;
+SELECT * FROM buffer_segments;
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+-- Calling pg_resize_shared_buffers() without changing shared_buffers should be a no-op.
+SELECT pg_resize_shared_buffers();
+SHOW shared_buffers;
+SELECT * FROM buffer_allocations;
+SELECT * FROM buffer_segments;
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+
+-- Test 2: Set to 64MB  
+ALTER SYSTEM SET shared_buffers = '64MB';
+SELECT pg_reload_conf();
+-- reconnect to ensure new setting is loaded
+\c
+SHOW shared_buffers;
+SELECT pg_resize_shared_buffers();
+SHOW shared_buffers;
+SELECT * FROM buffer_allocations;
+SELECT * FROM buffer_segments;
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+
+-- Test 3: Set to 256MB
+ALTER SYSTEM SET shared_buffers = '256MB';
+SELECT pg_reload_conf();
+-- reconnect to ensure new setting is loaded
+\c
+SHOW shared_buffers;
+SELECT pg_resize_shared_buffers();
+SHOW shared_buffers;
+SELECT * FROM buffer_allocations;
+SELECT * FROM buffer_segments;
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+
+-- Test 4: Set to 100MB (non-power-of-two)
+ALTER SYSTEM SET shared_buffers = '100MB';
+SELECT pg_reload_conf();
+-- reconnect to ensure new setting is loaded
+\c
+SHOW shared_buffers;
+SELECT pg_resize_shared_buffers();
+SHOW shared_buffers;
+SELECT * FROM buffer_allocations;
+SELECT * FROM buffer_segments;
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+
+-- Test 5: Set to minimum 128kB
+ALTER SYSTEM SET shared_buffers = '128kB';
+SELECT pg_reload_conf();
+-- reconnect to ensure new setting is loaded
+\c
+SHOW shared_buffers;
+SELECT pg_resize_shared_buffers();
+SHOW shared_buffers;
+SELECT * FROM buffer_allocations;
+SELECT * FROM buffer_segments;
+SELECT COUNT(*) AS buffer_count FROM pg_buffercache;
+
+-- Test 6: Try to set shared_buffers higher than max_shared_buffers (should fail)
+ALTER SYSTEM SET shared_buffers = '400MB';
+SELECT pg_reload_conf();
+-- reconnect to ensure new setting is loaded
+\c
+-- This should show the old value since the configuration was rejected
+SHOW shared_buffers;
+SHOW max_shared_buffers;
diff --git a/src/test/buffermgr/t/001_resize_buffer.pl b/src/test/buffermgr/t/001_resize_buffer.pl
new file mode 100644
index 00000000000..a0d7f094171
--- /dev/null
+++ b/src/test/buffermgr/t/001_resize_buffer.pl
@@ -0,0 +1,135 @@
+# Copyright (c) 2025-2025, PostgreSQL Global Development Group
+#
+# Minimal test testing shared_buffer resizing under load
+
+use strict;
+use warnings;
+use IPC::Run;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Function to resize buffer pool and verify the change.
+sub apply_and_verify_buffer_change
+{
+	my ($node, $new_size) = @_;
+	
+	# Use the new pg_resize_shared_buffers() interface which handles everything synchronously
+	$node->safe_psql('postgres', "ALTER SYSTEM SET shared_buffers = '$new_size'");
+	$node->safe_psql('postgres', "SELECT pg_reload_conf()");
+	
+	# If resize function fails, try a few times before giving up
+	my $max_retries = 5;
+	my $retry_delay = 1; # seconds
+	my $success = 0;
+	for my $attempt (1..$max_retries) {
+		my $result = $node->safe_psql('postgres', "SELECT pg_resize_shared_buffers()");
+		if ($result eq 't') {
+			$success = 1;
+			last;
+		}
+		
+		# If not the last attempt, wait before retrying
+		if ($attempt < $max_retries) {
+			note "Resizing buffer pool to $new_size, attempt $attempt failed, retrying after $retry_delay seconds...";
+			sleep($retry_delay);
+		}
+	}
+	
+	is($success, 1, 'resizing to ' . $new_size . ' succeeded after retries');
+	is($node->safe_psql('postgres', "SHOW shared_buffers"), $new_size,
+		'SHOW after resizing to '. $new_size . ' succeeded');
+}
+
+# Initialize a cluster and start pgbench in the background for concurrent load.
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+
+# Permit resizing up to 1GB for this test and let the server start with 128MB.
+$node->append_conf('postgresql.conf', qq{
+max_shared_buffers = 1GB
+shared_buffers = 128MB
+log_statement = none
+});
+
+$node->start;
+$node->safe_psql('postgres', "CREATE EXTENSION pg_buffercache");
+my $pgb_scale = 10;
+my $pgb_duration = 120;
+my $pgb_num_clients = 10;
+$node->pgbench(
+	"--initialize --init-steps=dtpvg --scale=$pgb_scale --quiet",
+	0,
+	[qr{^$}],
+	[   # stderr patterns to verify initialization stages
+		qr{dropping old tables},
+		qr{creating tables},
+		qr{done in \d+\.\d\d s }
+	],
+	"pgbench initialization (scale=$pgb_scale)"
+);
+my ($pgbench_stdin, $pgbench_stdout, $pgbench_stderr) = ('', '', '');
+# Use --exit-on-abort so that the test stops on the first server crash or error,
+# thus making it easy to debug the failure. Use -C to increase the chances of a
+# new backend being created while resizing the buffer pool.
+my $pgbench_process = IPC::Run::start(
+	[
+		'pgbench',
+		'-p', $node->port,
+		'-T', $pgb_duration,
+		'-c', $pgb_num_clients,
+		'-C',
+		'--exit-on-abort',
+		'postgres'
+	],
+	'<'  => \$pgbench_stdin,
+	'>'  => \$pgbench_stdout,
+	'2>' => \$pgbench_stderr
+);
+
+ok($pgbench_process, "pgbench started successfully");
+
+# Allow pgbench to establish connections and start generating load.
+# 
+# TODO: When creating new backends is known to work well with buffer pool
+# resizing, this wait should be removed.
+sleep(1);
+
+# Resize buffer pool to various sizes while pgbench is running in the
+# background.
+# 
+# TODO: These are pseudo-randomly picked sizes, but we can do better.
+my $tests_completed = 0;
+my @buffer_sizes = ('900MB', '500MB', '250MB', '400MB', '120MB', '600MB');
+for my $target_size (@buffer_sizes)
+{
+	# Verify workload generator is still running
+	if (!$pgbench_process->pumpable) {
+		ok(0, "pgbench is still running");
+		last;
+	}
+	
+	apply_and_verify_buffer_change($node, $target_size);
+	$tests_completed++;
+	
+	# Wait for the resized buffer pool to stabilize. If the resized buffer pool
+	# is utilized fully, it might hit any wrongly initialized areas of shared
+	# memory.
+	sleep(2);
+}
+is($tests_completed, scalar(@buffer_sizes), "All buffer sizes were tested");
+
+# Make sure that pgbench can end normally.
+$pgbench_process->signal('TERM');
+IPC::Run::finish $pgbench_process;
+ok(grep { $pgbench_process->result == $_ } (0, 15),  "pgbench exited gracefully");
+
+# Log any error output from pgbench for debugging
+diag("pgbench stderr:\n$pgbench_stderr");
+diag("pgbench stdout:\n$pgbench_stdout");
+
+# Ensure database is still functional after all the buffer changes
+$node->connect_ok("dbname=postgres", 
+	"Database remains accessible after $tests_completed buffer resize operations");
+
+done_testing();
\ No newline at end of file
diff --git a/src/test/buffermgr/t/003_parallel_resize_buffer.pl b/src/test/buffermgr/t/003_parallel_resize_buffer.pl
new file mode 100644
index 00000000000..9cbb5452fd2
--- /dev/null
+++ b/src/test/buffermgr/t/003_parallel_resize_buffer.pl
@@ -0,0 +1,71 @@
+# Copyright (c) 2025-2025, PostgreSQL Global Development Group
+#
+# Test that only one pg_resize_shared_buffers() call succeeds when multiple
+# sessions attempt to resize buffers concurrently
+
+use strict;
+use warnings;
+use IPC::Run;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Skip this test if injection points are not supported
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
+# Initialize a cluster
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->append_conf('postgresql.conf', 'shared_preload_libraries = injection_points');
+$node->append_conf('postgresql.conf', 'shared_buffers = 128kB');
+$node->append_conf('postgresql.conf', 'max_shared_buffers = 256kB');
+$node->start;
+
+# Load injection points extension for test coordination
+$node->safe_psql('postgres', "CREATE EXTENSION injection_points");
+
+# Test 1: Two concurrent pg_resize_shared_buffers() calls
+# Set up injection point to pause the first resize call
+$node->safe_psql('postgres', 
+	"SELECT injection_points_attach('pg-resize-shared-buffers-flag-set', 'wait')");
+
+# Change shared_buffers for the resize operation
+$node->safe_psql('postgres', "ALTER SYSTEM SET shared_buffers = '144kB'");
+$node->safe_psql('postgres', "SELECT pg_reload_conf()");
+
+# Start first resize session (will pause at injection point)
+my $session1 = $node->background_psql('postgres');
+$session1->query_until(
+	qr/starting_resize/,
+	q(
+		\echo starting_resize
+		SELECT pg_resize_shared_buffers();
+	)
+);
+
+# Wait until session actually reaches the injection point
+$node->wait_for_event('client backend', 'pg-resize-shared-buffers-flag-set');
+
+# Start second resize session (should fail immediately since resize is in progress)
+my $result2 = $node->safe_psql('postgres', "SELECT pg_resize_shared_buffers()");
+
+# The second call should return false (already in progress)
+is($result2, 'f', 'Second concurrent resize call returns false');
+
+# Wake up the first session
+$node->safe_psql('postgres', 
+	"SELECT injection_points_wakeup('pg-resize-shared-buffers-flag-set')");
+
+# The pg_resize_shared_buffers() in session1 should now complete successfully
+# We can't easily capture the return value from query_until, but we can
+# verify the session completes without error and the resize actually happened
+$session1->quit;
+
+# Detach injection point
+$node->safe_psql('postgres', 
+	"SELECT injection_points_detach('pg-resize-shared-buffers-flag-set')");
+
+done_testing();
\ No newline at end of file
diff --git a/src/test/buffermgr/t/004_client_join_buffer_resize.pl b/src/test/buffermgr/t/004_client_join_buffer_resize.pl
new file mode 100644
index 00000000000..06f0de6b409
--- /dev/null
+++ b/src/test/buffermgr/t/004_client_join_buffer_resize.pl
@@ -0,0 +1,241 @@
+# Copyright (c) 2025-2025, PostgreSQL Global Development Group
+#
+# Test shared_buffer resizing coordination with client connections joining using injection points
+
+use strict;
+use warnings;
+use IPC::Run;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Time::HiRes qw(sleep);
+
+# Skip this test if injection points are not supported
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
+# Function to calculate the size of test table required to fill up maximum
+# buffer pool when populating it.
+sub calculate_test_sizes
+{
+	my ($node, $block_size) = @_;
+	
+	# Get the maximum buffer pool size from configuration
+	my $max_shared_buffers = $node->safe_psql('postgres', "SHOW max_shared_buffers");
+	my ($max_val, $max_unit) = ($max_shared_buffers =~ /(\d+)(\w+)/);
+	my $max_size_bytes;
+	if (lc($max_unit) eq 'kb') {
+		$max_size_bytes = $max_val * 1024;
+	} elsif (lc($max_unit) eq 'mb') {
+		$max_size_bytes = $max_val * 1024 * 1024;
+	} elsif (lc($max_unit) eq 'gb') {
+		$max_size_bytes = $max_val * 1024 * 1024 * 1024;
+	} else {
+		# Default to kB if unit is not recognized
+		$max_size_bytes = $max_val * 1024;
+	}
+	
+	# Fill more pages than minimally required to increase the chances of pages
+	# from the test table filling the buffer cache.
+	$max_size_bytes = $max_size_bytes; 
+	my $pages_needed = int($max_size_bytes / $block_size) + 10; # Add some extra to ensure buffers are filled
+	my $rows_to_insert = $pages_needed * 100; # Assuming roughly 100 rows per page for our table structure
+	
+	return ($max_size_bytes, $pages_needed, $rows_to_insert);
+}
+
+# Function to calculate expected buffer count from size string
+sub calculate_buffer_count
+{
+	my ($size_string, $block_size) = @_;
+	
+	# Parse size and convert to bytes
+	my ($size_val, $unit) = ($size_string =~ /(\d+)(\w+)/);
+	my $size_bytes;
+	if (lc($unit) eq 'kb') {
+		$size_bytes = $size_val * 1024;
+	} elsif (lc($unit) eq 'mb') {
+		$size_bytes = $size_val * 1024 * 1024;
+	} elsif (lc($unit) eq 'gb') {
+		$size_bytes = $size_val * 1024 * 1024 * 1024;
+	} else {
+		# Default to kB if unit is not recognized
+		$size_bytes = $size_val * 1024;
+	}
+	
+	return int($size_bytes / $block_size);
+}
+
+# Initialize cluster with very small buffer sizes for testing
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+
+# Configure for buffer resizing with very small buffer pool sizes for faster tests.
+# TODO: for some reason parallel workers try to load default number of shared_buffers which doesn't work with lower max_shared_buffers. We need to fix that - somewhere it's picking default value of shared buffers. For now disable parallelism
+$node->append_conf('postgresql.conf', 'shared_preload_libraries = injection_points');
+$node->append_conf('postgresql.conf', qq{
+max_shared_buffers = 512kB
+shared_buffers = 320kB
+max_parallel_workers_per_gather = 0
+});
+
+$node->start;
+
+# Enable injection points
+$node->safe_psql('postgres', "CREATE EXTENSION injection_points");
+
+# Get the block size (this is fixed for the binary)
+my $block_size = $node->safe_psql('postgres', "SHOW block_size");
+
+# Try to create pg_buffercache extension for buffer analysis
+eval { 
+	$node->safe_psql('postgres', "CREATE EXTENSION pg_buffercache");
+};
+if ($@) {
+	$node->stop;
+	plan skip_all => 'pg_buffercache extension not available - cannot verify buffer usage';
+}
+
+# Create a small test table, and fetch its properties for later reference if required.
+$node->safe_psql('postgres', qq{
+	CREATE TABLE client_test (c1 int, data char(50));
+});
+
+my $table_oid = $node->safe_psql('postgres', "SELECT oid FROM pg_class WHERE relname = 'client_test'");
+my $table_relfilenode = $node->safe_psql('postgres', "SELECT relfilenode FROM pg_class WHERE relname = 'client_test'");
+note("Test table client_test: OID = $table_oid, relfilenode = $table_relfilenode");
+my ($max_size_bytes, $pages_needed, $rows_to_insert) = calculate_test_sizes($node, $block_size);
+
+# Create dedicated sessions for injection point handling and test queries,
+# so that we don't create new backends for test operations after starting
+# resize operation. Only one backend, which tests new backend synchronization
+# with resizing operation, should start after resizing has commenced.
+my $injection_session = $node->background_psql('postgres');
+my $query_session = $node->background_psql('postgres');
+my $resize_session = $node->background_psql('postgres');
+	
+# Function to run a single injection point test
+sub run_injection_point_test
+{
+	my ($test_name, $injection_point, $target_size, $operation_type) = @_;
+	
+	note("Test with $test_name ($operation_type)");
+	
+	# Calculate test parameters before starting resize
+	my ($max_size_bytes, $pages_needed, $rows_to_insert) = calculate_test_sizes($node, $target_size, $block_size);
+	
+	# Update buffer pool size and wait for it to reflect pending state 
+	$resize_session->query_safe("ALTER SYSTEM SET shared_buffers = '$target_size'");
+	$resize_session->query_safe("SELECT pg_reload_conf()");
+	my $pending_size_str = "pending: $target_size";
+	$resize_session->poll_query_until("SELECT substring(current_setting('shared_buffers'), '$pending_size_str')", $pending_size_str);
+
+	# Set up injection point in injection session
+	$injection_session->query_safe("SELECT injection_points_attach('$injection_point', 'wait')");
+	
+	# Trigger resize
+	$resize_session->query_until(
+		qr/starting_resize/,
+		q(
+			\echo starting_resize
+			SELECT pg_resize_shared_buffers();
+		)
+	);
+	
+	# Wait until resize actually reaches the injection point using the query session
+	$query_session->wait_for_event('client backend', $injection_point);
+	
+	# Start a client while resize is paused
+	my $client = $node->background_psql('postgres');
+	note("Background client backend PID: " . $client->query_safe("SELECT pg_backend_pid()"));
+	
+	# Wake up the injection point from injection session
+	$injection_session->query_safe("SELECT injection_points_wakeup('$injection_point')");
+	
+	# Test buffer functionality immediately after waking up injection point
+	# Insert data to test buffer pool functionality during/after resize
+	$client->query_safe("INSERT INTO client_test SELECT i, 'test_data_' || i FROM generate_series(1, $rows_to_insert) i");
+	# Verify the data was inserted correctly and can be read back
+	is($client->query_safe("SELECT COUNT(*) FROM client_test"), $rows_to_insert, "inserted $rows_to_insert during $test_name ($operation_type) successful");
+	
+	# Verify table size is reasonable (should be substantial for testing)
+	ok($query_session->query_safe("SELECT pg_total_relation_size('client_test')") >=  $max_size_bytes,"table size is large enough to overflow buffer pool in test $test_name ($operation_type)");
+	
+	# Wait for the resize operation to complete. There is no direct way to do so
+	# in background_psql. Hence fire a psql command and wait for it to finish
+	$resize_session->query(q(\echo 'done'));
+	
+	# Detach injection point from injection session
+	$injection_session->query_safe("SELECT injection_points_detach('$injection_point')");
+	
+	# Verify resize completed successfully
+	is($query_session->query_safe("SELECT current_setting('shared_buffers')"), $target_size,
+		"resize completed successfully to $target_size");
+	
+	# Check buffer pool size using pg_buffercache after resize completion
+	is($query_session->query_safe("SELECT COUNT(*) FROM pg_buffercache"), calculate_buffer_count($target_size, $block_size), "all buffers in the buffer pool used in $test_name ($operation_type)");
+	
+	# Wait for client to complete
+	ok($client->quit, "client succeeded during $test_name ($operation_type)");
+	
+	# Clean up for next test
+	$query_session->query_safe("DELETE FROM client_test");
+}
+
+# Test injection points during buffer resize with client connections
+my @common_injection_tests = (
+	{
+		name => 'flag setting phase',
+		injection_point => 'pg-resize-shared-buffers-flag-set',
+	},
+	{
+		name => 'memory remap phase',
+		injection_point => 'pgrsb-after-shmem-resize',
+	},
+	{
+		name => 'resize map barrier complete',
+		injection_point => 'pgrsb-resize-barrier-sent',
+	},
+);
+
+# Test common injection points for both shrinking and expanding
+foreach my $test (@common_injection_tests)
+{
+	# Test shrinking scenario
+	run_injection_point_test($test->{name}, $test->{injection_point}, '272kB', 'shrinking');
+
+	# Test expanding scenario
+	run_injection_point_test($test->{name}, $test->{injection_point}, '400kB', 'expanding');
+}
+
+my @shrink_only_tests = (
+	{
+		name => 'shrink barrier complete',
+		injection_point => 'pgrsb-shrink-barrier-sent',
+		size => '200kB',
+	}
+);
+foreach my $test (@shrink_only_tests)
+{
+	run_injection_point_test($test->{name}, $test->{injection_point}, $test->{size}, 'shrinking only');
+}
+
+my @expand_only_tests = (
+	{
+		name => 'expand barrier complete',
+		injection_point => 'pgrsb-expand-barrier-sent',
+		size => '416kB',
+	}
+);
+foreach my $test (@expand_only_tests)
+{
+	run_injection_point_test($test->{name}, $test->{injection_point}, $test->{size}, 'expanding only');
+}
+
+$injection_session->quit;
+$query_session->quit;
+$resize_session->quit;
+
+done_testing();
\ No newline at end of file
diff --git a/src/test/meson.build b/src/test/meson.build
index ccc31d6a86a..2a5ba1dec39 100644
--- a/src/test/meson.build
+++ b/src/test/meson.build
@@ -4,6 +4,7 @@ subdir('regress')
 subdir('isolation')
 
 subdir('authentication')
+subdir('buffermgr')
 subdir('postmaster')
 subdir('recovery')
 subdir('subscription')
diff --git a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
index 60bbd5dd445..16625e94d92 100644
--- a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
+++ b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
@@ -61,6 +61,7 @@ use Config;
 use IPC::Run;
 use PostgreSQL::Test::Utils qw(pump_until);
 use Test::More;
+use Time::HiRes                      qw(usleep);
 
 =pod
 
@@ -371,4 +372,79 @@ sub set_query_timer_restart
 	return $self->{query_timer_restart};
 }
 
+=pod
+
+=item $session->poll_query_until($query [, $expected ])
+
+Run B<$query> repeatedly in this background session, until it returns the 
+B<$expected> result ('t', or SQL boolean true, by default).
+Continues polling if the query returns an error result.
+Times out after a reasonable number of attempts.
+Returns 1 if successful, 0 if timed out.
+
+=cut
+
+sub poll_query_until
+{
+	my ($self, $query, $expected) = @_;
+
+	$expected = 't' unless defined($expected);    # default value
+
+	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
+	my $attempts = 0;
+	my ($stdout, $stderr_flag);
+
+	while ($attempts < $max_attempts)
+	{
+		($stdout, $stderr_flag) = $self->query($query);
+
+		chomp($stdout);
+
+		# If query succeeded and returned expected result
+		if (!$stderr_flag && $stdout eq $expected)
+		{
+			return 1;
+		}
+
+		# Wait 0.1 second before retrying.
+		usleep(100_000);
+
+		$attempts++;
+	}
+
+	# Give up. Print the output from the last attempt, hopefully that's useful
+	# for debugging.
+	my $stderr_output = $stderr_flag ? $self->{stderr} : '';
+	diag qq(poll_query_until timed out executing this query:
+$query
+expecting this output:
+$expected
+last actual query output:
+$stdout
+with stderr:
+$stderr_output);
+	return 0;
+}
+
+=item $session->wait_for_event(backend_type, wait_event_name)
+
+Poll pg_stat_activity until backend_type reaches wait_event_name using this
+background session.
+
+=cut
+
+sub wait_for_event
+{
+	my ($self, $backend_type, $wait_event_name) = @_;
+
+	$self->poll_query_until(qq[
+		SELECT count(*) > 0 FROM pg_stat_activity
+		WHERE backend_type = '$backend_type' AND wait_event = '$wait_event_name'
+	])
+	  or die
+	  qq(timed out when waiting for $backend_type to reach wait event '$wait_event_name');
+
+	return;
+}
+
 1;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 432509277c9..f7ce00990cc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2774,6 +2774,7 @@ ShellTypeInfo
 ShippableCacheEntry
 ShippableCacheKey
 ShmemIndexEnt
+ShmemControl
 ShutdownForeignScan_function
 ShutdownInformation
 ShutdownMode
-- 
2.34.1

0002-Memory-and-address-space-management-for-buf-20251114.patchtext/x-patch; charset=US-ASCII; name=0002-Memory-and-address-space-management-for-buf-20251114.patchDownload

From a7f25c62ef900b2b115c575c2d8aa158ec825c69 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 28 Feb 2025 19:54:47 +0100
Subject: [PATCH 2/4] Memory and address space management for buffer resizing

This has three changes

1. Allow to use multiple shared memory mappings
============================================

Currently all the work with shared memory is done via a single anonymous
memory mapping, which limits ways how the shared memory could be organized.

Introduce possibility to allocate multiple shared memory mappings, where
a single mapping is associated with a specified shared memory segment.
A new shared memory API is introduced, extended with a segment as a new
parameter. As a path of least resistance, the original API is kept in
place, utilizing the main shared memory segment.

Modifies pg_shmem_allocations to report shared memory segment as well.
Adds pg_shmem_segments to report shared memory segment information.

2. Address space reservation for shared memory
============================================

Currently the shared memory layout is designed to pack everything tight
together, leaving no space between mappings for resizing. Here is how it
looks like for one mapping in /proc/$PID/maps, /dev/zero represents the
anonymous shared memory we talk about:

    00400000-00490000         /path/bin/postgres
    ...
    012d9000-0133e000         [heap]
    7f443a800000-7f470a800000 /dev/zero (deleted)
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34
    ...

Make the layout more dynamic via splitting every shared memory segment
into two parts:

* An anonymous file, which actually contains shared memory content.
  Such an anonymous file is created via memfd_create, it lives in
  memory, behaves like a regular file and semantically equivalent to an
  anonymous memory allocated via mmap with MAP_ANONYMOUS.

* A reservation mapping, which size is much larger than required shared
  segment size. This mapping is created with flag MAP_NORESERVE (to not
  count the reserved space against memory limits). The anonymous file is
  mapped into this reservation mapping.

If we have to change the address maps while resizing the shared buffer
pool, it is needed to be done in Postmaster too, so that the new
backends will inherit the resized address space from the Postmaster.
However, Postmaster is not invovled in ProcSignalBarrier mechanism and
we don't want it to spend time in things other than its core
functionality. To achive that, maximum required address space maps are
setup upfront with read and write access when starting the server. When
resizing the buffer pool only the backing file object is resized from
the coordinator. This also makes the ProcSignalBarrier handling code
light for backends other than the coordinator.

The resulting layout looks like this:

    00400000-00490000         /path/bin/postgres
    ...
    3f526000-3f590000 rw-p 		[heap]
    7fbd827fe000-7fbd8bdde000 rw-s 	/memfd:main (deleted) -- anon file
    7fbd8bdde000-7fbe82800000 ---s 	/memfd:main (deleted) -- reservation
    7fbe82800000-7fbe90670000 r--p 	/usr/lib/locale/locale-archive
    7fbe90800000-7fbe90941000 r-xp 	/usr/lib64/libstdc++.so.6.0.34

To resize a shared memory segment in this layout it's possible to use
ftruncate on the memory mapped file.

This approach also do not impact the actual memory usage as reported by
the kernel.

TODO: Verify that Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup
was created with the memory limit 256 MB, then PostgreSQL was launched within
this cgroup with shared_buffers = 128 MB:

    $ cd /sys/fs/cgroup
    $ mkdir postgres
    $ cd postres
    $ echo 268435456 > memory.max

    $ echo $MASTER_PID_SHELL > cgroup.procs
    # postgres from the master branch has being successfully launched
    #  from that shell
    $ cat memory.current
    17465344 (~16.6 MB)
    # stop postgres

    $ echo $PATCH_PID_SHELL > cgroup.procs
    # postgres from the patch has being successfully launched from that shell
    $ cat memory.current
    20770816 (~19.8 MB)

There are also few unrelated advantages of using memory mapped files:

* We've got a file descriptor, which could be used for regular file
  operations (modification, truncation, you name it).

* The file could be given a name, which improves readability when it
  comes to process maps.

* By default, Linux will not add file-backed shared mappings into a core dump,
  making it more convenient to work with them in PostgreSQL: no more huge dumps
  to process. - Some hackers have expressed concerns over it.

The downside is that memfd_create is Linux specific.

3. Refactor CalculateShmemSize()
=============================

This function calls many functions which return the amount of shared
memory required for different shared memory data structures. Up until
now, the returned total of these sizes was used to create a single
shared memory segment. With this change, CalculateShmemSize() needs to
estimate memory requirements for each of the segments. It now takes an
array of MemoryMappingSizes, containing as many elements as the number
of segments, as an argument. The sizes returned by all the function it
calls, except BufferManagerShmemSize(), are added and saved in the first
element (index 0) of the array.  BufferManagerShmemSize() is modified to
save the amount of memory required for buffer manager related segments
in the corresponding array element. Additionally it also saves the
amount of reserved space. For now, the amount of reserved address space
is same as the amount of required memory but that is expected to change
with the next commit which implements buffer pool resize.
CalculateShmemSize() now returns the total of sizes corresponding to all
the sizes.

Author: Dmitrii Dolgov and Ashutosh Bapat
Reviewed-by: Tomas Vondra
---
 doc/src/sgml/system-views.sgml         |   9 +
 src/backend/catalog/system_views.sql   |   7 +
 src/backend/port/sysv_shmem.c          | 425 +++++++++++++++++++------
 src/backend/port/win32_sema.c          |   2 +-
 src/backend/port/win32_shmem.c         |  14 +-
 src/backend/storage/buffer/buf_init.c  |  56 ++--
 src/backend/storage/buffer/buf_table.c |   6 +-
 src/backend/storage/buffer/freelist.c  |   5 +-
 src/backend/storage/ipc/ipc.c          |   4 +-
 src/backend/storage/ipc/ipci.c         |  99 ++++--
 src/backend/storage/ipc/shmem.c        | 243 ++++++++++----
 src/backend/storage/lmgr/lwlock.c      |  15 +-
 src/include/catalog/pg_proc.dat        |  12 +-
 src/include/portability/mem.h          |   2 +-
 src/include/storage/bufmgr.h           |   3 +-
 src/include/storage/ipc.h              |   4 +-
 src/include/storage/pg_shmem.h         |  60 +++-
 src/include/storage/shmem.h            |  12 +
 src/test/regress/expected/rules.out    |  10 +-
 19 files changed, 755 insertions(+), 233 deletions(-)

diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 8f3e2741051..bc70a3ee6c9 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -4233,6 +4233,15 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>segment</structfield> <type>text</type>
+      </para>
+      <para>
+       The name of the shared memory segment concerning the allocation. 
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>off</structfield> <type>int8</type>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 059e8778ca7..59145066647 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -668,6 +668,13 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
 REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
 GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
 
+CREATE VIEW pg_shmem_segments AS
+    SELECT * FROM pg_get_shmem_segments();
+
+REVOKE ALL ON pg_shmem_segments FROM PUBLIC;
+GRANT SELECT ON pg_shmem_segments TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_segments() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_segments() TO pg_read_all_stats;
 CREATE VIEW pg_shmem_allocations_numa AS
     SELECT * FROM pg_get_shmem_allocations_numa();
 
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..cc4b2c80e1a 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -90,12 +90,49 @@ typedef enum
 	SHMSTATE_UNATTACHED,		/* pertinent to DataDir, no attached PIDs */
 } IpcMemoryState;
 
-
+/*
+ * TODO: These should be moved into ShmemSegment, now that there can be multiple
+ * shared memory segments. But there's windows specific code which will need
+ * adjustment, so leaving it here.
+ */
 unsigned long UsedShmemSegID = 0;
 void	   *UsedShmemSegAddr = NULL;
 
-static Size AnonymousShmemSize;
-static void *AnonymousShmem = NULL;
+/*
+ * Anonymous mapping layout we use looks like this:
+ *
+ * 00400000-00c2a000 r-xp 			/bin/postgres
+ * ...
+ * 3f526000-3f590000 rw-p 			[heap]
+ * 7fbd827fe000-7fbd8bdde000 rw-s 	/memfd:main (deleted)
+ * 7fbd8bdde000-7fbe82800000 ---s 	/memfd:main (deleted)
+ * 7fbe82800000-7fbe90670000 r--p 	/usr/lib/locale/locale-archive
+ * 7fbe90800000-7fbe90941000 r-xp 	/usr/lib64/libstdc++.so.6.0.34
+ * ...
+ *
+ * We need to place shared memory mappings in such a way, that there will be
+ * gaps between them in the address space. Those gaps have to be large enough
+ * to resize the mapping up to certain size, without counting towards the total
+ * memory consumption.
+ *
+ * To achieve this, for each shared memory segment we first create an anonymous
+ * file of specified size using memfd_create, which will accomodate actual
+ * shared memory mapping content. It is represented by the first /memfd:main
+ * with rw permissions. Then we create a mapping for this file using mmap, with
+ * size much larger than required and flags PROT_NONE (allows to make sure the
+ * reserved space will not be used) and MAP_NORESERVE (prevents the space from
+ * being counted against memory limits). The mapping serves as an address space
+ * reservation, into which shared memory segment can be extended and is
+ * represented by the second /memfd:main with no permissions.
+ */
+
+/*
+ * Flag telling that we have decided to use huge pages.
+ *
+ * XXX: It's possible to use GetConfigOption("huge_pages_status", false, false)
+ * instead, but it feels like an overkill.
+ */
+static bool huge_pages_on = false;
 
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
@@ -104,6 +141,27 @@ static IpcMemoryState PGSharedMemoryAttach(IpcMemoryId shmId,
 										   void *attachAt,
 										   PGShmemHeader **addr);
 
+const char*
+MappingName(int shmem_segment)
+{
+	switch (shmem_segment)
+	{
+		case MAIN_SHMEM_SEGMENT:
+			return "main";
+		case BUFFERS_SHMEM_SEGMENT:
+			return "buffers";
+		case BUFFER_DESCRIPTORS_SHMEM_SEGMENT:
+			return "descriptors";
+		case BUFFER_IOCV_SHMEM_SEGMENT:
+			return "iocv";
+		case CHECKPOINT_BUFFERS_SHMEM_SEGMENT:
+			return "checkpoint";
+		case STRATEGY_SHMEM_SEGMENT:
+			return "strategy";
+		default:
+			return "unknown";
+	}
+}
 
 /*
  *	InternalIpcMemoryCreate(memKey, size)
@@ -470,19 +528,20 @@ PGSharedMemoryAttach(IpcMemoryId shmId,
  * hugepage sizes, we might want to think about more invasive strategies,
  * such as increasing shared_buffers to absorb the extra space.
  *
- * Returns the (real, assumed or config provided) page size into
- * *hugepagesize, and the hugepage-related mmap flags to use into
- * *mmap_flags if requested by the caller.  If huge pages are not supported,
- * *hugepagesize and *mmap_flags are set to 0.
+ * Returns the (real, assumed or config provided) page size into *hugepagesize,
+ * the hugepage-related mmap and memfd flags to use into *mmap_flags and
+ * *memfd_flags if requested by the caller. If huge pages are not supported,
+ * *hugepagesize, *mmap_flags and *memfd_flags are set to 0.
  */
 void
-GetHugePageSize(Size *hugepagesize, int *mmap_flags)
+GetHugePageSize(Size *hugepagesize, int *mmap_flags, int *memfd_flags)
 {
 #ifdef MAP_HUGETLB
 
 	Size		default_hugepagesize = 0;
 	Size		hugepagesize_local = 0;
 	int			mmap_flags_local = 0;
+	int			memfd_flags_local = 0;
 
 	/*
 	 * System-dependent code to find out the default huge page size.
@@ -541,6 +600,7 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 	}
 
 	mmap_flags_local = MAP_HUGETLB;
+	memfd_flags_local = MFD_HUGETLB;
 
 	/*
 	 * On recent enough Linux, also include the explicit page size, if
@@ -551,7 +611,16 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 	{
 		int			shift = pg_ceil_log2_64(hugepagesize_local);
 
-		mmap_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
+		memfd_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
+	}
+#endif
+
+#if defined(MFD_HUGE_MASK) && defined(MFD_HUGE_SHIFT)
+	if (hugepagesize_local != default_hugepagesize)
+	{
+		int			shift = pg_ceil_log2_64(hugepagesize_local);
+
+		memfd_flags_local |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
 	}
 #endif
 
@@ -560,6 +629,8 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 		*mmap_flags = mmap_flags_local;
 	if (hugepagesize)
 		*hugepagesize = hugepagesize_local;
+	if (memfd_flags)
+		*memfd_flags = memfd_flags_local;
 
 #else
 
@@ -567,6 +638,8 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 		*hugepagesize = 0;
 	if (mmap_flags)
 		*mmap_flags = 0;
+	if (memfd_flags)
+		*memfd_flags = 0;
 
 #endif							/* MAP_HUGETLB */
 }
@@ -588,83 +661,242 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * Wrapper around posix_fallocate() to allocate memory for a given shared memory
+ * segment.
+ *
+ * Performs retry on EINTR, and raises error upon failure.
+ */
+static void
+shmem_fallocate(int fd, const char *mapping_name, Size size, int elevel)
+{
+#if defined(HAVE_POSIX_FALLOCATE) && defined(__linux__)
+	int ret;
+
+
+	/*
+	 * If there is not enough memory, trying to access a hole in address space
+	 * will cause SIGBUS. If supported, avoid that by allocating memory upfront.
+	 * 
+	 * We still use a traditional EINTR retry loop to handle SIGCONT.
+	 * posix_fallocate() doesn't restart automatically, and we don't want this to
+	 * fail if you attach a debugger.
+	 */
+	do
+	{
+		ret = posix_fallocate(fd, 0, size);
+	} while (ret == EINTR);
+
+	if (ret != 0)
+	{
+		ereport(elevel,
+				(errmsg("segment[%s]: could not allocate space for anonymous file: %s",
+						mapping_name, strerror(ret)),
+				 (ret == ENOMEM) ?
+				 errhint("This error usually means that PostgreSQL's request "
+						 "for a shared memory segment exceeded available memory, "
+						 "swap space, or huge pages. To reduce the request size "
+						 "(currently %zu bytes), reduce PostgreSQL's shared "
+						 "memory usage, perhaps by reducing \"shared_buffers\" or "
+						 "\"max_connections\".",
+						 size) : 0));
+	}
+#endif /* HAVE_POSIX_FALLOCATE && __linux__ */
+}
+
+/*
+ * Round up the required amount of memory and the amount of required reserved
+ * address space to the nearest huge page size.
+ */
+static inline void
+round_off_mapping_sizes_for_hugepages(MemoryMappingSizes *mapping, int hugepagesize)
+{
+	if (hugepagesize == 0)
+		return;
+
+	if (mapping->shmem_req_size % hugepagesize != 0)
+		mapping->shmem_req_size += hugepagesize -
+			(mapping->shmem_req_size % hugepagesize);
+
+	if (mapping->shmem_reserved % hugepagesize != 0)
+		mapping->shmem_reserved = mapping->shmem_reserved + hugepagesize -
+			(mapping->shmem_reserved % hugepagesize);
+}
+
 /*
  * Creates an anonymous mmap()ed shared memory segment.
  *
- * Pass the requested size in *size.  This function will modify *size to the
- * actual size of the allocation, if it ends up allocating a segment that is
- * larger than requested.
+ * This function will modify mapping size to the actual size of the allocation,
+ * if it ends up allocating a segment that is larger than requested. If needed,
+ * it also rounds up the mapping reserved size to be a multiple of huge page
+ * size.
+ *
+ * Note that we do not fallback from huge pages to regular pages in this
+ * function, this decision was already made in ReserveAnonymousMemory and we
+ * stick to it.
+ * 
+ * TODO: Update the prologue to be consistent with the code.
  */
-static void *
-CreateAnonymousSegment(Size *size)
+static void
+CreateAnonymousSegment(MemoryMappingSizes *mapping, int segment_id)
 {
-	Size		allocsize = *size;
 	void	   *ptr = MAP_FAILED;
-	int			mmap_errno = 0;
+	int			save_errno = 0;
+	int			mmap_flags = PG_MMAP_FLAGS, memfd_flags = 0;
+	ShmemSegment *segment = &Segments[segment_id];
 
 #ifndef MAP_HUGETLB
-	/* PGSharedMemoryCreate should have dealt with this case */
-	Assert(huge_pages != HUGE_PAGES_ON);
+	/* PrepareHugePages should have dealt with this case */
+	Assert(huge_pages != HUGE_PAGES_ON && !huge_pages_on);
 #else
-	if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
+	if (huge_pages_on)
 	{
-		/*
-		 * Round up the request size to a suitable large value.
-		 */
 		Size		hugepagesize;
-		int			mmap_flags;
 
-		GetHugePageSize(&hugepagesize, &mmap_flags);
+		/* Make sure nothing is messed up */
+		Assert(huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY);
 
-		if (allocsize % hugepagesize != 0)
-			allocsize += hugepagesize - (allocsize % hugepagesize);
+		/* Round up the request size to a suitable large value */
+		GetHugePageSize(&hugepagesize, &mmap_flags, &memfd_flags);
+		round_off_mapping_sizes_for_hugepages(mapping, hugepagesize);
 
-		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
-		mmap_errno = errno;
-		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
-			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-				 allocsize);
+		/* Verify that the new size is withing the reserved boundaries */
+		Assert(mapping->shmem_reserved >= mapping->shmem_req_size);
+
+		mmap_flags = PG_MMAP_FLAGS | mmap_flags;
 	}
 #endif
 
 	/*
-	 * Report whether huge pages are in use.  This needs to be tracked before
-	 * the second mmap() call if attempting to use huge pages failed
-	 * previously.
+	 * Prepare an anonymous file backing the segment. Its size will be
+	 * specified later via ftruncate.
+	 *
+	 * The file behaves like a regular file, but lives in memory. Once all
+	 * references to the file are dropped,  it is automatically released.
+	 * Anonymous memory is used for all backing pages of the file, thus it has
+	 * the same semantics as anonymous memory allocations using mmap with the
+	 * MAP_ANONYMOUS flag.
 	 */
-	SetConfigOption("huge_pages_status", (ptr == MAP_FAILED) ? "off" : "on",
-					PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
+	segment->segment_fd = memfd_create(MappingName(segment_id), memfd_flags);
+	if (segment->segment_fd == -1)
+		ereport(FATAL,
+				(errmsg("segment[%s]: could not create anonymous shared memory file: %m",
+						MappingName(segment_id))));
 
-	if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
-	{
-		/*
-		 * Use the original size, not the rounded-up value, when falling back
-		 * to non-huge pages.
-		 */
-		allocsize = *size;
-		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS, -1, 0);
-		mmap_errno = errno;
-	}
+	elog(DEBUG1, "segment[%s]: mmap(%zu)", MappingName(segment_id), mapping->shmem_req_size);
 
+	/*
+	 * Reserve maximum required address space for future expansion of this
+	 * memory segment. MAP_NORESERVE ensures that no memory is allocated. The
+	 * whole address space will be setup for read/write access, so that memory
+	 * allocated to this address space can be read or written to even if it is
+	 * resized.
+	 */
+	ptr = mmap(NULL, mapping->shmem_reserved, PROT_READ | PROT_WRITE,
+			   mmap_flags | MAP_NORESERVE, segment->segment_fd, 0);
 	if (ptr == MAP_FAILED)
+		ereport(FATAL,
+				(errmsg("segment[%s]: could not map anonymous shared memory: %m",
+						MappingName(segment_id))));
+
+	/*
+	 * Resize the backing file to the required size. On platforms where it is
+	 * supported, we also allocate the required memory upfront. On other
+	 * platform the memory upto the size of file will be allocated on demand.
+	 */
+	if(ftruncate(segment->segment_fd, mapping->shmem_req_size) == -1)
 	{
-		errno = mmap_errno;
+		save_errno = errno;
+
+		close(segment->segment_fd);
+
+		errno = save_errno;
 		ereport(FATAL,
-				(errmsg("could not map anonymous shared memory: %m"),
-				 (mmap_errno == ENOMEM) ?
+				(errmsg("segment[%s]: could not truncate anonymous file to size %zu: %m",
+						MappingName(segment_id), mapping->shmem_req_size),
+				 (save_errno == ENOMEM) ?
 				 errhint("This error usually means that PostgreSQL's request "
 						 "for a shared memory segment exceeded available memory, "
 						 "swap space, or huge pages. To reduce the request size "
 						 "(currently %zu bytes), reduce PostgreSQL's shared "
 						 "memory usage, perhaps by reducing \"shared_buffers\" or "
 						 "\"max_connections\".",
-						 allocsize) : 0));
+						 mapping->shmem_req_size) : 0));
 	}
+	shmem_fallocate(segment->segment_fd, MappingName(segment_id), mapping->shmem_req_size, FATAL);
 
-	*size = allocsize;
-	return ptr;
+	segment->shmem = ptr;
+	segment->shmem_size = mapping->shmem_req_size;
+	segment->shmem_reserved = mapping->shmem_reserved;
+}
+
+/*
+ * PrepareHugePages
+ *
+ * Figure out if there are enough huge pages to allocate all shared memory
+ * segments, and report that information via huge_pages_status and
+ * huge_pages_on. It needs to be called before creating shared memory segments.
+ *
+ * It is necessary to maintain the same semantic (simple on/off) for
+ * huge_pages_status, even if there are multiple shared memory segments: all
+ * segments either use huge pages or not, there is no mix of segments with
+ * different page size. The latter might be actually beneficial, in particular
+ * because only some segments may require large amount of memory, but for now
+ * we go with a simple solution.
+ */
+void
+PrepareHugePages()
+{
+	void	   *ptr = MAP_FAILED;
+	MemoryMappingSizes mapping_sizes[NUM_MEMORY_MAPPINGS];
+
+	CalculateShmemSize(mapping_sizes);
+
+	/* Complain if hugepages demanded but we can't possibly support them */
+#if !defined(MAP_HUGETLB)
+	if (huge_pages == HUGE_PAGES_ON)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("huge pages not supported on this platform")));
+#else
+	if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
+	{
+		Size		hugepagesize, total_size = 0;
+		int			mmap_flags;
+
+		GetHugePageSize(&hugepagesize, &mmap_flags, NULL);
+
+		/*
+		 * Figure out how much memory is needed for all segments, keeping in
+		 * mind that for every segment this value will be rounding up by the
+		 * huge page size. The resulting value will be used to probe memory and
+		 * decide whether we will allocate huge pages or not.
+		 */
+		for(int segment = 0; segment < NUM_MEMORY_MAPPINGS; segment++)
+		{
+			Size segment_size = mapping_sizes[segment].shmem_req_size;
+
+			if (segment_size % hugepagesize != 0)
+				segment_size += hugepagesize - (segment_size % hugepagesize);
+
+			total_size += segment_size;
+		}
+
+		/* Map total amount of memory to test its availability. */
+		elog(DEBUG1, "reserving space: probe mmap(%zu) with MAP_HUGETLB",
+					 total_size);
+		ptr = mmap(NULL, total_size, PROT_NONE,
+				   PG_MMAP_FLAGS | MAP_ANONYMOUS | mmap_flags, -1, 0);
+	}
+#endif
+
+	/*
+	 * Report whether huge pages are in use. This needs to be tracked before
+	 * creating shared memory segments.
+	 */
+	SetConfigOption("huge_pages_status", (ptr == MAP_FAILED) ? "off" : "on",
+					PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
+	huge_pages_on = ptr != MAP_FAILED;
 }
 
 /*
@@ -674,20 +906,25 @@ CreateAnonymousSegment(Size *size)
 static void
 AnonymousShmemDetach(int status, Datum arg)
 {
-	/* Release anonymous shared memory block, if any. */
-	if (AnonymousShmem != NULL)
+	for(int i = 0; i < NUM_MEMORY_MAPPINGS; i++)
 	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		ShmemSegment *segment = &Segments[i];
+
+		/* Release anonymous shared memory block, if any. */
+		if (segment->shmem != NULL)
+		{
+			if (munmap(segment->shmem, segment->shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 segment->shmem, segment->shmem_size);
+			segment->shmem = NULL;
+		}
 	}
 }
 
 /*
  * PGSharedMemoryCreate
  *
- * Create a shared memory segment of the given size and initialize its
+ * Create a shared memory segment for the given mapping and initialize its
  * standard header.  Also, register an on_shmem_exit callback to release
  * the storage.
  *
@@ -697,7 +934,7 @@ AnonymousShmemDetach(int status, Datum arg)
  * postmaster or backend.
  */
 PGShmemHeader *
-PGSharedMemoryCreate(Size size,
+PGSharedMemoryCreate(MemoryMappingSizes *mapping, int segment_id,
 					 PGShmemHeader **shim)
 {
 	IpcMemoryKey NextShmemSegID;
@@ -705,6 +942,7 @@ PGSharedMemoryCreate(Size size,
 	PGShmemHeader *hdr;
 	struct stat statbuf;
 	Size		sysvsize;
+	ShmemSegment *segment = &Segments[segment_id];
 
 	/*
 	 * We use the data directory's ID info (inode and device numbers) to
@@ -717,14 +955,6 @@ PGSharedMemoryCreate(Size size,
 				 errmsg("could not stat data directory \"%s\": %m",
 						DataDir)));
 
-	/* Complain if hugepages demanded but we can't possibly support them */
-#if !defined(MAP_HUGETLB)
-	if (huge_pages == HUGE_PAGES_ON)
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("huge pages not supported on this platform")));
-#endif
-
 	/* For now, we don't support huge pages in SysV memory */
 	if (huge_pages == HUGE_PAGES_ON && shared_memory_type != SHMEM_TYPE_MMAP)
 		ereport(ERROR,
@@ -732,12 +962,12 @@ PGSharedMemoryCreate(Size size,
 				 errmsg("huge pages not supported with the current \"shared_memory_type\" setting")));
 
 	/* Room for a header? */
-	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
+	Assert(mapping->shmem_req_size > MAXALIGN(sizeof(PGShmemHeader)));
 
 	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
-		AnonymousShmem = CreateAnonymousSegment(&size);
-		AnonymousShmemSize = size;
+		/* On success, mapping data will be modified. */
+		CreateAnonymousSegment(mapping, segment_id);
 
 		/* Register on-exit routine to unmap the anonymous segment */
 		on_shmem_exit(AnonymousShmemDetach, (Datum) 0);
@@ -747,7 +977,7 @@ PGSharedMemoryCreate(Size size,
 	}
 	else
 	{
-		sysvsize = size;
+		sysvsize = mapping->shmem_req_size;
 
 		/* huge pages are only available with mmap */
 		SetConfigOption("huge_pages_status", "off",
@@ -760,7 +990,7 @@ PGSharedMemoryCreate(Size size,
 	 * loop simultaneously.  (CreateDataDirLockFile() does not entirely ensure
 	 * that, but prefer fixing it over coping here.)
 	 */
-	NextShmemSegID = statbuf.st_ino;
+	NextShmemSegID = statbuf.st_ino + segment_id;
 
 	for (;;)
 	{
@@ -852,13 +1082,13 @@ PGSharedMemoryCreate(Size size,
 	/*
 	 * Initialize space allocation status for segment.
 	 */
-	hdr->totalsize = size;
+	hdr->totalsize = mapping->shmem_req_size;
 	hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader));
 	*shim = hdr;
 
 	/* Save info for possible future use */
-	UsedShmemSegAddr = memAddress;
-	UsedShmemSegID = (unsigned long) NextShmemSegID;
+	segment->seg_addr = memAddress;
+	segment->seg_id = (unsigned long) NextShmemSegID;
 
 	/*
 	 * If AnonymousShmem is NULL here, then we're not using anonymous shared
@@ -866,10 +1096,10 @@ PGSharedMemoryCreate(Size size,
 	 * block. Otherwise, the System V shared memory block is only a shim, and
 	 * we must return a pointer to the real block.
 	 */
-	if (AnonymousShmem == NULL)
+	if (segment->shmem == NULL)
 		return hdr;
-	memcpy(AnonymousShmem, hdr, sizeof(PGShmemHeader));
-	return (PGShmemHeader *) AnonymousShmem;
+	memcpy(segment->shmem, hdr, sizeof(PGShmemHeader));
+	return (PGShmemHeader *) segment->shmem;
 }
 
 #ifdef EXEC_BACKEND
@@ -969,23 +1199,28 @@ PGSharedMemoryNoReAttach(void)
 void
 PGSharedMemoryDetach(void)
 {
-	if (UsedShmemSegAddr != NULL)
+	for(int i = 0; i < NUM_MEMORY_MAPPINGS; i++)
 	{
-		if ((shmdt(UsedShmemSegAddr) < 0)
+		ShmemSegment *segment = &Segments[i];
+
+		if (segment->seg_addr != NULL)
+		{
+			if ((shmdt(segment->seg_addr) < 0)
 #if defined(EXEC_BACKEND) && defined(__CYGWIN__)
-		/* Work-around for cygipc exec bug */
-			&& shmdt(NULL) < 0
+			/* Work-around for cygipc exec bug */
+				&& shmdt(NULL) < 0
 #endif
-			)
-			elog(LOG, "shmdt(%p) failed: %m", UsedShmemSegAddr);
-		UsedShmemSegAddr = NULL;
-	}
+				)
+				elog(LOG, "shmdt(%p) failed: %m", segment->seg_addr);
+			segment->seg_addr = NULL;
+		}
 
-	if (AnonymousShmem != NULL)
-	{
-		if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-			elog(LOG, "munmap(%p, %zu) failed: %m",
-				 AnonymousShmem, AnonymousShmemSize);
-		AnonymousShmem = NULL;
+		if (segment->shmem != NULL)
+		{
+			if (munmap(segment->shmem, segment->shmem_size) < 0)
+				elog(LOG, "munmap(%p, %zu) failed: %m",
+					 segment->shmem, segment->shmem_size);
+			segment->shmem = NULL;
+		}
 	}
 }
diff --git a/src/backend/port/win32_sema.c b/src/backend/port/win32_sema.c
index 5854ad1f54d..e7365ff8060 100644
--- a/src/backend/port/win32_sema.c
+++ b/src/backend/port/win32_sema.c
@@ -44,7 +44,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * process exits.
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
 	mySemSet = (HANDLE *) malloc(maxSemas * sizeof(HANDLE));
 	if (mySemSet == NULL)
diff --git a/src/backend/port/win32_shmem.c b/src/backend/port/win32_shmem.c
index 4dee856d6bd..5c0c32babaf 100644
--- a/src/backend/port/win32_shmem.c
+++ b/src/backend/port/win32_shmem.c
@@ -204,7 +204,7 @@ EnableLockPagesPrivilege(int elevel)
  * standard header.
  */
 PGShmemHeader *
-PGSharedMemoryCreate(Size size,
+PGSharedMemoryCreate(MemoryMappingSizes *mapping_sizes, int segment_id,
 					 PGShmemHeader **shim)
 {
 	void	   *memAddress;
@@ -216,9 +216,10 @@ PGSharedMemoryCreate(Size size,
 	DWORD		size_high;
 	DWORD		size_low;
 	SIZE_T		largePageSize = 0;
-	Size		orig_size = size;
+	Size		size = mapping_sizes->shmem_req_size;
 	DWORD		flProtect = PAGE_READWRITE;
 	DWORD		desiredAccess;
+	ShmemSegment *segment = &Segments[segment_id]
 
 	ShmemProtectiveRegion = VirtualAlloc(NULL, PROTECTIVE_REGION_SIZE,
 										 MEM_RESERVE, PAGE_NOACCESS);
@@ -304,7 +305,7 @@ retry:
 				 * Use the original size, not the rounded-up value, when
 				 * falling back to non-huge pages.
 				 */
-				size = orig_size;
+				size = mapping_sizes->shmem_req_size;
 				flProtect = PAGE_READWRITE;
 				goto retry;
 			}
@@ -393,6 +394,11 @@ retry:
 	hdr->dsm_control = 0;
 
 	/* Save info for possible future use */
+	segment->shmem_size = size;
+	segment->seg_addr = memAddress;
+	segment->shmem = (Pointer) hdr;
+	segment->seg_id = (unsigned long) hmap2;
+
 	UsedShmemSegAddr = memAddress;
 	UsedShmemSegSize = size;
 	UsedShmemSegID = hmap2;
@@ -627,7 +633,7 @@ pgwin32_ReserveSharedMemoryRegion(HANDLE hChild)
  * use GetLargePageMinimum() instead.
  */
 void
-GetHugePageSize(Size *hugepagesize, int *mmap_flags)
+GetHugePageSize(Size *hugepagesize, int *mmap_flags, int *memfd_flags)
 {
 	if (hugepagesize)
 		*hugepagesize = 0;
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6fd3a6bbac5..4fa547f48de 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -17,6 +17,7 @@
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -62,7 +63,10 @@ CkptSortItem *CkptBufferIds;
  * Initialize shared buffer pool
  *
  * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend).
+ * postmaster, or in a standalone backend). Size of data structures initialized
+ * here depends on NBuffers, and to be able to change NBuffers without a
+ * restart we store each structure into a separate shared memory segment, which
+ * could be resized on demand.
  */
 void
 BufferManagerShmemInit(void)
@@ -74,22 +78,22 @@ BufferManagerShmemInit(void)
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
+		ShmemInitStructInSegment("Buffer Descriptors",
 						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+						&foundDescs, BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
 		TYPEALIGN(PG_IO_ALIGN_SIZE,
-				  ShmemInitStruct("Buffer Blocks",
+				  ShmemInitStructInSegment("Buffer Blocks",
 								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
-								  &foundBufs));
+								  &foundBufs, BUFFERS_SHMEM_SEGMENT));
 
 	/* Align condition variables to cacheline boundary. */
 	BufferIOCVArray = (ConditionVariableMinimallyPadded *)
-		ShmemInitStruct("Buffer IO Condition Variables",
+		ShmemInitStructInSegment("Buffer IO Condition Variables",
 						NBuffers * sizeof(ConditionVariableMinimallyPadded),
-						&foundIOCV);
+						&foundIOCV, BUFFER_IOCV_SHMEM_SEGMENT);
 
 	/*
 	 * The array used to sort to-be-checkpointed buffer ids is located in
@@ -99,8 +103,9 @@ BufferManagerShmemInit(void)
 	 * painful.
 	 */
 	CkptBufferIds = (CkptSortItem *)
-		ShmemInitStruct("Checkpoint BufferIds",
-						NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
+		ShmemInitStructInSegment("Checkpoint BufferIds",
+						NBuffers * sizeof(CkptSortItem), &foundBufCkpt,
+						CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
 
 	if (foundDescs || foundBufs || foundIOCV || foundBufCkpt)
 	{
@@ -147,33 +152,42 @@ BufferManagerShmemInit(void)
  * BufferManagerShmemSize
  *
  * compute the size of shared memory for the buffer pool including
- * data pages, buffer descriptors, hash tables, etc.
+ * data pages, buffer descriptors, hash tables, etc. based on the
+ * shared memory segment. The main segment must not allocate anything
+ * related to buffers, every other segment will receive part of the
+ * data.
  */
 Size
-BufferManagerShmemSize(void)
+BufferManagerShmemSize(MemoryMappingSizes *mapping_sizes)
 {
-	Size		size = 0;
+	size_t size;
 
-	/* size of buffer descriptors */
-	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
-	/* to allow aligning buffer descriptors */
+	/* size of buffer descriptors, plus alignment padding */
+	size = add_size(0, mul_size(NBuffers, sizeof(BufferDescPadded)));
 	size = add_size(size, PG_CACHE_LINE_SIZE);
+	mapping_sizes[BUFFER_DESCRIPTORS_SHMEM_SEGMENT].shmem_req_size = size;
+	mapping_sizes[BUFFER_DESCRIPTORS_SHMEM_SEGMENT].shmem_reserved = size;
 
 	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
+	size = add_size(0, PG_IO_ALIGN_SIZE);
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	mapping_sizes[BUFFERS_SHMEM_SEGMENT].shmem_req_size = size;
+	mapping_sizes[BUFFERS_SHMEM_SEGMENT].shmem_reserved = size;
 
 	/* size of stuff controlled by freelist.c */
-	size = add_size(size, StrategyShmemSize());
+	mapping_sizes[STRATEGY_SHMEM_SEGMENT].shmem_req_size = StrategyShmemSize();
+	mapping_sizes[STRATEGY_SHMEM_SEGMENT].shmem_reserved = StrategyShmemSize();
 
-	/* size of I/O condition variables */
-	size = add_size(size, mul_size(NBuffers,
+	/* size of I/O condition variables, plus alignment padding */
+	size = add_size(0, mul_size(NBuffers,
 								   sizeof(ConditionVariableMinimallyPadded)));
-	/* to allow aligning the above */
 	size = add_size(size, PG_CACHE_LINE_SIZE);
+	mapping_sizes[BUFFER_IOCV_SHMEM_SEGMENT].shmem_req_size = size;
+	mapping_sizes[BUFFER_IOCV_SHMEM_SEGMENT].shmem_reserved = size;
 
 	/* size of checkpoint sort array in bufmgr.c */
-	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	mapping_sizes[CHECKPOINT_BUFFERS_SHMEM_SEGMENT].shmem_req_size = mul_size(NBuffers, sizeof(CkptSortItem));
+	mapping_sizes[CHECKPOINT_BUFFERS_SHMEM_SEGMENT].shmem_reserved = mul_size(NBuffers, sizeof(CkptSortItem));
 
 	return size;
 }
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index f0c39ec2822..67e87f9935d 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -25,6 +25,7 @@
 #include "funcapi.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
+#include "storage/pg_shmem.h"
 #include "utils/rel.h"
 #include "utils/builtins.h"
 
@@ -64,10 +65,11 @@ InitBufTable(int size)
 	info.entrysize = sizeof(BufferLookupEnt);
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
-	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
+	SharedBufHash = ShmemInitHashInSegment("Shared Buffer Lookup Table",
 								  size, size,
 								  &info,
-								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION | HASH_FIXED_SIZE);
+								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION | HASH_FIXED_SIZE,
+								  STRATEGY_SHMEM_SEGMENT);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 28d952b3534..13ee840ab9f 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
@@ -418,9 +419,9 @@ StrategyInitialize(bool init)
 	 * Get or create the shared strategy control block
 	 */
 	StrategyControl = (BufferStrategyControl *)
-		ShmemInitStruct("Buffer Strategy Status",
+		ShmemInitStructInSegment("Buffer Strategy Status",
 						sizeof(BufferStrategyControl),
-						&found);
+						&found, STRATEGY_SHMEM_SEGMENT);
 
 	if (!found)
 	{
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index 2704e80b3a7..1965b2d3eb4 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -61,6 +61,8 @@ static void proc_exit_prepare(int code);
  * but provide some additional features we need --- in particular,
  * we want to register callbacks to invoke when we are disconnecting
  * from a broken shared-memory context but not exiting the postmaster.
+ * Maximum number of such exit callbacks depends on the number of shared
+ * segments.
  *
  * Callback functions can take zero, one, or two args: the first passed
  * arg is the integer exitcode, the second is the Datum supplied when
@@ -68,7 +70,7 @@ static void proc_exit_prepare(int code);
  * ----------------------------------------------------------------
  */
 
-#define MAX_ON_EXITS 20
+#define MAX_ON_EXITS 40
 
 struct ONEXIT
 {
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index b23d0c19360..41190f96639 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -81,10 +81,17 @@ RequestAddinShmemSpace(Size size)
 
 /*
  * CalculateShmemSize
- *		Calculates the amount of shared memory needed.
+ * 		Calculates the amount of shared memory needed.
+ *
+ * The amount of shared memory required per segment is saved in mapping_sizes,
+ * which is expected to be an array of size NUM_MEMORY_MAPPINGS. The total
+ * amount of memory needed across all the segments is returned. For the memory
+ * mappings which reserve address space for future expansion, the required
+ * amount of reserved space is saved in mapping_sizes of those segments.
+ * This memory is not included in the returned value.
  */
 Size
-CalculateShmemSize(void)
+CalculateShmemSize(MemoryMappingSizes *mapping_sizes)
 {
 	Size		size;
 
@@ -102,7 +109,13 @@ CalculateShmemSize(void)
 											 sizeof(ShmemIndexEnt)));
 	size = add_size(size, dsm_estimate_size());
 	size = add_size(size, DSMRegistryShmemSize());
-	size = add_size(size, BufferManagerShmemSize());
+
+	/*
+	 * Buffer manager adds estimates for memory requirements for every shared
+	 * memory segment that it uses in the corresponding AnonymousMappings.
+	 * Consider size required from only the main shared memory segment here.
+	 */
+	size = add_size(size, BufferManagerShmemSize(mapping_sizes));
 	size = add_size(size, LockManagerShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
@@ -144,8 +157,22 @@ CalculateShmemSize(void)
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
 
+	/*
+	 * All the shared memory allocations considered so far happen in the main
+	 * shared memory segment.
+	 */
+	mapping_sizes[MAIN_SHMEM_SEGMENT].shmem_req_size = size;
+	mapping_sizes[MAIN_SHMEM_SEGMENT].shmem_reserved = size;
+
+	size = 0;
 	/* might as well round it off to a multiple of a typical page size */
-	size = add_size(size, 8192 - (size % 8192));
+	for (int segment = 0; segment < NUM_MEMORY_MAPPINGS; segment++)
+	{
+		mapping_sizes[segment].shmem_req_size = add_size(mapping_sizes[segment].shmem_req_size, 8192 - (mapping_sizes[segment].shmem_req_size % 8192));
+		mapping_sizes[segment].shmem_reserved = add_size(mapping_sizes[segment].shmem_reserved, 8192 - (mapping_sizes[segment].shmem_reserved % 8192));
+		/* Compute the total size of all segments */
+		size = size + mapping_sizes[segment].shmem_req_size;
+	}
 
 	return size;
 }
@@ -191,32 +218,44 @@ CreateSharedMemoryAndSemaphores(void)
 {
 	PGShmemHeader *shim;
 	PGShmemHeader *seghdr;
-	Size		size;
+	MemoryMappingSizes mapping_sizes[NUM_MEMORY_MAPPINGS];
 
 	Assert(!IsUnderPostmaster);
 
-	/* Compute the size of the shared-memory block */
-	size = CalculateShmemSize();
-	elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
+	CalculateShmemSize(mapping_sizes);
 
-	/*
-	 * Create the shmem segment
-	 */
-	seghdr = PGSharedMemoryCreate(size, &shim);
-
-	/*
-	 * Make sure that huge pages are never reported as "unknown" while the
-	 * server is running.
-	 */
-	Assert(strcmp("unknown",
-				  GetConfigOption("huge_pages_status", false, false)) != 0);
-
-	InitShmemAccess(seghdr);
+	/* Decide if we use huge pages or regular size pages */
+	PrepareHugePages();
 
-	/*
-	 * Set up shared memory allocation mechanism
-	 */
-	InitShmemAllocation();
+	for(int segment = 0; segment < NUM_MEMORY_MAPPINGS; segment++)
+	{
+		MemoryMappingSizes *mapping = &mapping_sizes[segment];
+
+		/* Compute the size of the shared-memory block */
+		elog(DEBUG3, "invoking IpcMemoryCreate(segment %s, size=%zu, reserved address space=%zu)",
+			 MappingName(segment), mapping->shmem_req_size, mapping->shmem_reserved);
+
+		/*
+		 * Create the shmem segment.
+		 *
+		 * XXX: Do multiple shims are needed, one per segment?
+		 */
+		seghdr = PGSharedMemoryCreate(mapping, segment, &shim);
+
+		/*
+		 * Make sure that huge pages are never reported as "unknown" while the
+		 * server is running.
+		 */
+		Assert(strcmp("unknown",
+					  GetConfigOption("huge_pages_status", false, false)) != 0);
+
+		InitShmemAccessInSegment(seghdr, segment);
+
+		/*
+		 * Set up shared memory allocation mechanism
+		 */
+		InitShmemAllocationInSegment(segment);
+	}
 
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
@@ -334,7 +373,9 @@ CreateOrAttachShmemStructs(void)
  * InitializeShmemGUCs
  *
  * This function initializes runtime-computed GUCs related to the amount of
- * shared memory required for the current configuration.
+ * shared memory required for the current configuration. It assumes that the
+ * memory required by the shared memory segments is already calculated and is
+ * available in AnonymousMappings.
  */
 void
 InitializeShmemGUCs(void)
@@ -343,11 +384,13 @@ InitializeShmemGUCs(void)
 	Size		size_b;
 	Size		size_mb;
 	Size		hp_size;
+	MemoryMappingSizes mapping_sizes[NUM_MEMORY_MAPPINGS];
+
 
 	/*
 	 * Calculate the shared memory size and round up to the nearest megabyte.
 	 */
-	size_b = CalculateShmemSize();
+	size_b = CalculateShmemSize(mapping_sizes);
 	size_mb = add_size(size_b, (1024 * 1024) - 1) / (1024 * 1024);
 	sprintf(buf, "%zu", size_mb);
 	SetConfigOption("shared_memory_size", buf,
@@ -356,7 +399,7 @@ InitializeShmemGUCs(void)
 	/*
 	 * Calculate the number of huge pages required.
 	 */
-	GetHugePageSize(&hp_size, NULL);
+	GetHugePageSize(&hp_size, NULL, NULL);
 	if (hp_size != 0)
 	{
 		Size		hp_required;
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 0f18beb6ad4..f303a9328df 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -76,20 +76,19 @@
 #include "utils/builtins.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
-static void *ShmemAllocUnlocked(Size size);
+static void *ShmemAllocRawInSegment(Size size, Size *allocated_size,
+								 int shmem_segment);
 
 /* shared memory global variables */
 
-static PGShmemHeader *ShmemSegHdr;	/* shared mem segment header */
+ShmemSegment Segments[NUM_MEMORY_MAPPINGS];
 
-static void *ShmemBase;			/* start address of shared memory */
-
-static void *ShmemEnd;			/* end+1 address of shared memory */
-
-slock_t    *ShmemLock;			/* spinlock for shared memory and LWLock
-								 * allocation */
-
-static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/*
+ * Primary index hashtable for shmem, for simplicity we use a single for all
+ * shared memory segments. There can be performance consequences of that, and
+ * an alternative option would be to have one index per shared memory segments.
+ */
+static HTAB *ShmemIndex = NULL;
 
 /* To get reliable results for NUMA inquiry we need to "touch pages" once */
 static bool firstNumaTouch = true;
@@ -102,9 +101,17 @@ Datum		pg_numa_available(PG_FUNCTION_ARGS);
 void
 InitShmemAccess(PGShmemHeader *seghdr)
 {
-	ShmemSegHdr = seghdr;
-	ShmemBase = seghdr;
-	ShmemEnd = (char *) ShmemBase + seghdr->totalsize;
+	InitShmemAccessInSegment(seghdr, MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAccessInSegment(PGShmemHeader *seghdr, int shmem_segment)
+{
+	PGShmemHeader *shmhdr = (PGShmemHeader *) seghdr;
+	ShmemSegment *seg = &Segments[shmem_segment];
+	seg->ShmemSegHdr = shmhdr;
+	seg->ShmemBase = (void *) shmhdr;
+	seg->ShmemEnd = (char *) seg->ShmemBase + shmhdr->totalsize;
 }
 
 /*
@@ -115,7 +122,13 @@ InitShmemAccess(PGShmemHeader *seghdr)
 void
 InitShmemAllocation(void)
 {
-	PGShmemHeader *shmhdr = ShmemSegHdr;
+	InitShmemAllocationInSegment(MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAllocationInSegment(int shmem_segment)
+{
+	PGShmemHeader *shmhdr = Segments[shmem_segment].ShmemSegHdr;
 	char	   *aligned;
 
 	Assert(shmhdr != NULL);
@@ -124,9 +137,9 @@ InitShmemAllocation(void)
 	 * Initialize the spinlock used by ShmemAlloc.  We must use
 	 * ShmemAllocUnlocked, since obviously ShmemAlloc can't be called yet.
 	 */
-	ShmemLock = (slock_t *) ShmemAllocUnlocked(sizeof(slock_t));
+	Segments[shmem_segment].ShmemLock = (slock_t *) ShmemAllocUnlockedInSegment(sizeof(slock_t), shmem_segment);
 
-	SpinLockInit(ShmemLock);
+	SpinLockInit(Segments[shmem_segment].ShmemLock);
 
 	/*
 	 * Allocations after this point should go through ShmemAlloc, which
@@ -151,16 +164,22 @@ InitShmemAllocation(void)
  */
 void *
 ShmemAlloc(Size size)
+{
+	return ShmemAllocInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocInSegment(Size size, int shmem_segment)
 {
 	void	   *newSpace;
 	Size		allocated_size;
 
-	newSpace = ShmemAllocRaw(size, &allocated_size);
+	newSpace = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
 	if (!newSpace)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("out of shared memory (%zu bytes requested)",
-						size)));
+				 errmsg("out of shared memory in segment %s (%zu bytes requested)",
+					MappingName(shmem_segment), size)));
 	return newSpace;
 }
 
@@ -185,6 +204,12 @@ ShmemAllocNoError(Size size)
  */
 static void *
 ShmemAllocRaw(Size size, Size *allocated_size)
+{
+	return ShmemAllocRawInSegment(size, allocated_size, MAIN_SHMEM_SEGMENT);
+}
+
+static void *
+ShmemAllocRawInSegment(Size size, Size *allocated_size, int shmem_segment)
 {
 	Size		newStart;
 	Size		newFree;
@@ -204,22 +229,22 @@ ShmemAllocRaw(Size size, Size *allocated_size)
 	size = CACHELINEALIGN(size);
 	*allocated_size = size;
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-	SpinLockAcquire(ShmemLock);
+	SpinLockAcquire(Segments[shmem_segment].ShmemLock);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree <= ShmemSegHdr->totalsize)
+	if (newFree <= Segments[shmem_segment].ShmemSegHdr->totalsize)
 	{
-		newSpace = (char *) ShmemBase + newStart;
-		ShmemSegHdr->freeoffset = newFree;
+		newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
+		Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 	}
 	else
 		newSpace = NULL;
 
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[shmem_segment].ShmemLock);
 
 	/* note this assert is okay with newSpace == NULL */
 	Assert(newSpace == (void *) CACHELINEALIGN(newSpace));
@@ -228,15 +253,16 @@ ShmemAllocRaw(Size size, Size *allocated_size)
 }
 
 /*
- * ShmemAllocUnlocked -- allocate max-aligned chunk from shared memory
+ * ShmemAllocUnlockedInSegment
+ * 		allocate max-aligned chunk from given shared memory segment
  *
  * Allocate space without locking ShmemLock.  This should be used for,
  * and only for, allocations that must happen before ShmemLock is ready.
  *
  * We consider maxalign, rather than cachealign, sufficient here.
  */
-static void *
-ShmemAllocUnlocked(Size size)
+void *
+ShmemAllocUnlockedInSegment(Size size, int shmem_segment)
 {
 	Size		newStart;
 	Size		newFree;
@@ -247,19 +273,19 @@ ShmemAllocUnlocked(Size size)
 	 */
 	size = MAXALIGN(size);
 
-	Assert(ShmemSegHdr != NULL);
+	Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-	newStart = ShmemSegHdr->freeoffset;
+	newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
 	newFree = newStart + size;
-	if (newFree > ShmemSegHdr->totalsize)
+	if (newFree > Segments[shmem_segment].ShmemSegHdr->totalsize)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("out of shared memory (%zu bytes requested)",
-						size)));
-	ShmemSegHdr->freeoffset = newFree;
+				 errmsg("out of shared memory in segment %s (%zu bytes requested)",
+						MappingName(shmem_segment), size)));
+	Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 
-	newSpace = (char *) ShmemBase + newStart;
+	newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
 
 	Assert(newSpace == (void *) MAXALIGN(newSpace));
 
@@ -274,7 +300,13 @@ ShmemAllocUnlocked(Size size)
 bool
 ShmemAddrIsValid(const void *addr)
 {
-	return (addr >= ShmemBase) && (addr < ShmemEnd);
+	return ShmemAddrIsValidInSegment(addr, MAIN_SHMEM_SEGMENT);
+}
+
+bool
+ShmemAddrIsValidInSegment(const void *addr, int shmem_segment)
+{
+	return (addr >= Segments[shmem_segment].ShmemBase) && (addr < Segments[shmem_segment].ShmemEnd);
 }
 
 /*
@@ -335,6 +367,18 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 			  int64 max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
+{
+	return ShmemInitHashInSegment(name, init_size, max_size, infoP, hash_flags,
+							   MAIN_SHMEM_SEGMENT);
+}
+
+HTAB *
+ShmemInitHashInSegment(const char *name,		/* table string name for shmem index */
+			  long init_size,		/* initial table size */
+			  long max_size,		/* max size of the table */
+			  HASHCTL *infoP,		/* info about key and bucket size */
+			  int hash_flags,		/* info about infoP */
+			  int shmem_segment) 	/* in which segment to keep the table */
 {
 	bool		found;
 	void	   *location;
@@ -351,9 +395,9 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 	hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
 
 	/* look it up in the shmem index */
-	location = ShmemInitStruct(name,
+	location = ShmemInitStructInSegment(name,
 							   hash_get_shared_size(infoP, hash_flags),
-							   &found);
+							   &found, shmem_segment);
 
 	/*
 	 * if it already exists, attach to it rather than allocate and initialize
@@ -386,6 +430,13 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
  */
 void *
 ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+{
+	return ShmemInitStructInSegment(name, size, foundPtr, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
+					  int shmem_segment)
 {
 	ShmemIndexEnt *result;
 	void	   *structPtr;
@@ -394,7 +445,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
 	if (!ShmemIndex)
 	{
-		PGShmemHeader *shmemseghdr = ShmemSegHdr;
+		PGShmemHeader *shmemseghdr = Segments[shmem_segment].ShmemSegHdr;
 
 		/* Must be trying to create/attach to ShmemIndex itself */
 		Assert(strcmp(name, "ShmemIndex") == 0);
@@ -417,7 +468,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 			 * process can be accessing shared memory yet.
 			 */
 			Assert(shmemseghdr->index == NULL);
-			structPtr = ShmemAlloc(size);
+			structPtr = ShmemAllocInSegment(size, shmem_segment);
 			shmemseghdr->index = structPtr;
 			*foundPtr = false;
 		}
@@ -434,8 +485,8 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		LWLockRelease(ShmemIndexLock);
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("could not create ShmemIndex entry for data structure \"%s\"",
-						name)));
+				 errmsg("could not create ShmemIndex entry for data structure \"%s\" in segment %d",
+						name, shmem_segment)));
 	}
 
 	if (*foundPtr)
@@ -460,7 +511,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		Size		allocated_size;
 
 		/* It isn't in the table yet. allocate and initialize it */
-		structPtr = ShmemAllocRaw(size, &allocated_size);
+		structPtr = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
 		if (structPtr == NULL)
 		{
 			/* out of memory; remove the failed ShmemIndex entry */
@@ -475,18 +526,18 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 		result->size = size;
 		result->allocated_size = allocated_size;
 		result->location = structPtr;
+		result->shmem_segment = shmem_segment;
 	}
 
 	LWLockRelease(ShmemIndexLock);
 
-	Assert(ShmemAddrIsValid(structPtr));
+	Assert(ShmemAddrIsValidInSegment(structPtr, shmem_segment));
 
 	Assert(structPtr == (void *) CACHELINEALIGN(structPtr));
 
 	return structPtr;
 }
 
-
 /*
  * Add two Size values, checking for overflow
  */
@@ -527,13 +578,14 @@ mul_size(Size s1, Size s2)
 Datum
 pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 {
-#define PG_GET_SHMEM_SIZES_COLS 4
+#define PG_GET_SHMEM_SIZES_COLS 5
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	HASH_SEQ_STATUS hstat;
 	ShmemIndexEnt *ent;
-	Size		named_allocated = 0;
+	Size		named_allocated[NUM_MEMORY_MAPPINGS] = {0};
 	Datum		values[PG_GET_SHMEM_SIZES_COLS];
 	bool		nulls[PG_GET_SHMEM_SIZES_COLS];
+	int			i;
 
 	InitMaterializedSRF(fcinfo, 0);
 
@@ -546,29 +598,40 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 	while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
 	{
 		values[0] = CStringGetTextDatum(ent->key);
-		values[1] = Int64GetDatum((char *) ent->location - (char *) ShmemSegHdr);
-		values[2] = Int64GetDatum(ent->size);
-		values[3] = Int64GetDatum(ent->allocated_size);
-		named_allocated += ent->allocated_size;
+		values[1] = CStringGetTextDatum(MappingName(ent->shmem_segment));
+		values[2] = Int64GetDatum((char *) ent->location - (char *) Segments[ent->shmem_segment].ShmemSegHdr);
+		values[3] = Int64GetDatum(ent->size);
+		values[4] = Int64GetDatum(ent->allocated_size);
+		named_allocated[ent->shmem_segment] += ent->allocated_size;
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
 							 values, nulls);
 	}
 
 	/* output shared memory allocated but not counted via the shmem index */
-	values[0] = CStringGetTextDatum("<anonymous>");
-	nulls[1] = true;
-	values[2] = Int64GetDatum(ShmemSegHdr->freeoffset - named_allocated);
-	values[3] = values[2];
-	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	for (i = 0; i < NUM_MEMORY_MAPPINGS; i++)
+	{
+		values[0] = CStringGetTextDatum("<anonymous>");
+		values[1] = CStringGetTextDatum(MappingName(i));
+		nulls[2] = true;
+		values[3] = Int64GetDatum(Segments[i].ShmemSegHdr->freeoffset - named_allocated[i]);
+		values[4] = values[3];
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
 
 	/* output as-of-yet unused shared memory */
-	nulls[0] = true;
-	values[1] = Int64GetDatum(ShmemSegHdr->freeoffset);
-	nulls[1] = false;
-	values[2] = Int64GetDatum(ShmemSegHdr->totalsize - ShmemSegHdr->freeoffset);
-	values[3] = values[2];
-	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	memset(nulls, 0, sizeof(nulls));
+
+	for (i = 0; i < NUM_MEMORY_MAPPINGS; i++)
+	{
+		PGShmemHeader *shmhdr = Segments[i].ShmemSegHdr;
+		nulls[0] = true;
+		values[1] = CStringGetTextDatum(MappingName(i));
+		values[2] = Int64GetDatum(shmhdr->freeoffset);
+		values[3] = Int64GetDatum(shmhdr->totalsize - shmhdr->freeoffset);
+		values[4] = values[3];
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
 
 	LWLockRelease(ShmemIndexLock);
 
@@ -593,7 +656,7 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
 	Size		os_page_size;
 	void	  **page_ptrs;
 	int		   *pages_status;
-	uint64		shm_total_page_count,
+	uint64		shm_total_page_count = 0,
 				shm_ent_page_count,
 				max_nodes;
 	Size	   *nodes;
@@ -628,7 +691,12 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
 	 * this is not very likely, and moreover we have more entries, each of
 	 * them using only fraction of the total pages.
 	 */
-	shm_total_page_count = (ShmemSegHdr->totalsize / os_page_size) + 1;
+	for(int segment = 0; segment < NUM_MEMORY_MAPPINGS; segment++)
+	{
+		PGShmemHeader *shmhdr = Segments[segment].ShmemSegHdr;
+		shm_total_page_count += (shmhdr->totalsize / os_page_size) + 1;
+	}
+
 	page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
 	pages_status = palloc(sizeof(int) * shm_total_page_count);
 
@@ -751,7 +819,7 @@ pg_get_shmem_pagesize(void)
 	Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
 
 	if (huge_pages_status == HUGE_PAGES_ON)
-		GetHugePageSize(&os_page_size, NULL);
+		GetHugePageSize(&os_page_size, NULL, NULL);
 
 	return os_page_size;
 }
@@ -761,3 +829,46 @@ pg_numa_available(PG_FUNCTION_ARGS)
 {
 	PG_RETURN_BOOL(pg_numa_init() != -1);
 }
+
+/* SQL SRF showing shared memory segments */
+Datum
+pg_get_shmem_segments(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_SEGS_COLS 6
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Datum		values[PG_GET_SHMEM_SEGS_COLS];
+	bool		nulls[PG_GET_SHMEM_SEGS_COLS];
+	int i;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	/* output all allocated entries */
+	for (i = 0; i < NUM_MEMORY_MAPPINGS; i++)
+	{
+		ShmemSegment *segment = &Segments[i];
+		PGShmemHeader *shmhdr = segment->ShmemSegHdr;
+		int j;
+
+		if (shmhdr == NULL)
+		{
+			for (j = 0; j < PG_GET_SHMEM_SEGS_COLS; j++)
+				nulls[j] = true;
+		}
+		else
+		{
+			memset(nulls, 0, sizeof(nulls));
+			values[0] = Int32GetDatum(i);
+			values[1] = CStringGetTextDatum(MappingName(i));
+			values[2] = Int64GetDatum(shmhdr->totalsize);
+			values[3] = Int64GetDatum(shmhdr->freeoffset);
+			values[4] = Int64GetDatum(segment->shmem_size);
+			values[5] = Int64GetDatum(segment->shmem_reserved);
+		}
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
+	}
+
+	return (Datum) 0;
+}
+
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index b017880f5e4..c25dd13b63a 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -80,6 +80,8 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
+#include "postmaster/postmaster.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/proclist.h"
 #include "storage/procnumber.h"
@@ -612,12 +614,15 @@ LWLockNewTrancheId(const char *name)
 	/*
 	 * We use the ShmemLock spinlock to protect LWLockCounter and
 	 * LWLockTrancheNames.
+	 * 
+	 * XXX: Looks like this is the only use of Segments outside of shmem.c,
+	 * it's maybe worth it to reshape this part to hide Segments structure.
 	 */
-	SpinLockAcquire(ShmemLock);
+	SpinLockAcquire(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 
 	if (*LWLockCounter - LWTRANCHE_FIRST_USER_DEFINED >= MAX_NAMED_TRANCHES)
 	{
-		SpinLockRelease(ShmemLock);
+		SpinLockRelease(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 		ereport(ERROR,
 				(errmsg("maximum number of tranches already registered"),
 				 errdetail("No more than %d tranches may be registered.",
@@ -628,7 +633,7 @@ LWLockNewTrancheId(const char *name)
 	LocalLWLockCounter = *LWLockCounter;
 	strlcpy(LWLockTrancheNames[result - LWTRANCHE_FIRST_USER_DEFINED], name, NAMEDATALEN);
 
-	SpinLockRelease(ShmemLock);
+	SpinLockRelease(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 
 	return result;
 }
@@ -750,9 +755,9 @@ GetLWTrancheName(uint16 trancheId)
 	 */
 	if (trancheId >= LocalLWLockCounter)
 	{
-		SpinLockAcquire(ShmemLock);
+		SpinLockAcquire(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 		LocalLWLockCounter = *LWLockCounter;
-		SpinLockRelease(ShmemLock);
+		SpinLockRelease(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 
 		if (trancheId >= LocalLWLockCounter)
 			elog(ERROR, "tranche %d is not registered", trancheId);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5cf9e12fcb9..411043ca750 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8576,8 +8576,8 @@
 { oid => '5052', descr => 'allocations from the main shared memory segment',
   proname => 'pg_get_shmem_allocations', prorows => '50', proretset => 't',
   provolatile => 'v', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,int8,int8,int8}', proargmodes => '{o,o,o,o}',
-  proargnames => '{name,off,size,allocated_size}',
+  proallargtypes => '{text,text,int8,int8,int8}', proargmodes => '{o,o,o,o,o}',
+  proargnames => '{name,segment,off,size,allocated_size}',
   prosrc => 'pg_get_shmem_allocations' },
 
 { oid => '4099', descr => 'Is NUMA support available?',
@@ -8600,6 +8600,14 @@
   proargmodes => '{o,o,o}', proargnames => '{name,type,size}',
   prosrc => 'pg_get_dsm_registry_allocations' },
 
+# shared memory segments 
+{ oid => '5101', descr => 'shared memory segments',
+  proname => 'pg_get_shmem_segments', prorows => '6', proretset => 't',
+  provolatile => 'v', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int4,text,int8,int8,int8,int8}', proargmodes => '{o,o,o,o,o,o}',
+  proargnames => '{id,name,size,freeoffset,mapping_size,mapping_reserved_size}',
+  prosrc => 'pg_get_shmem_segments' },
+
 # memory context of local backend
 { oid => '2282',
   descr => 'information about all memory contexts of local backend',
diff --git a/src/include/portability/mem.h b/src/include/portability/mem.h
index ef9800732d9..40588ff6968 100644
--- a/src/include/portability/mem.h
+++ b/src/include/portability/mem.h
@@ -38,7 +38,7 @@
 #define MAP_NOSYNC			0
 #endif
 
-#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+#define PG_MMAP_FLAGS			(MAP_SHARED|MAP_HASSEMAPHORE)
 
 /* Some really old systems don't define MAP_FAILED. */
 #ifndef MAP_FAILED
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b5f8f3c5d42..3769f4db7dc 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -19,6 +19,7 @@
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
+#include "storage/pg_shmem.h"
 #include "storage/relfilelocator.h"
 #include "utils/relcache.h"
 #include "utils/snapmgr.h"
@@ -326,7 +327,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
-extern Size BufferManagerShmemSize(void);
+extern Size BufferManagerShmemSize(MemoryMappingSizes *mapping_sizes);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index 2a8a8f0eabd..d73f1b407db 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -18,6 +18,8 @@
 #ifndef IPC_H
 #define IPC_H
 
+#include "storage/pg_shmem.h"
+
 typedef void (*pg_on_exit_callback) (int code, Datum arg);
 typedef void (*shmem_startup_hook_type) (void);
 
@@ -77,7 +79,7 @@ extern void check_on_shmem_exit_lists_are_empty(void);
 /* ipci.c */
 extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;
 
-extern Size CalculateShmemSize(void);
+extern Size CalculateShmemSize(MemoryMappingSizes *mapping_sizes);
 extern void CreateSharedMemoryAndSemaphores(void);
 #ifdef EXEC_BACKEND
 extern void AttachSharedMemoryStructs(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 5f7d4b83a60..beee0a53d2d 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -25,6 +25,13 @@
 #define PG_SHMEM_H
 
 #include "storage/dsm_impl.h"
+#include "storage/spin.h"
+
+typedef struct MemoryMappingSizes
+{
+	Size shmem_req_size;		/* Required size of the segment */
+	Size shmem_reserved; 		/* Required size of the reserved address space. */
+} MemoryMappingSizes;
 
 typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 {
@@ -41,6 +48,27 @@ typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 #endif
 } PGShmemHeader;
 
+typedef struct ShmemSegment
+{
+	PGShmemHeader *ShmemSegHdr; 	/* shared mem segment header */
+	void *ShmemBase; 				/* start address of shared memory */
+	void *ShmemEnd; 				/* end+1 address of shared memory */
+	slock_t    *ShmemLock; 			/* spinlock for shared memory and LWLock
+									 * allocation */
+	int segment_fd; 			/* fd for the backing anon file */
+	unsigned long seg_id; 		/* IPC key */
+	int shmem_segment;			/* TODO: Do we really need it? */
+	Size shmem_size; 			/* Size of the actually used memory */
+	Size shmem_reserved; 		/* Size of the reserved mapping */
+	Pointer shmem; 				/* Pointer to the start of the mapped memory */
+	Pointer seg_addr; 			/* SysV shared memory for the header */
+} ShmemSegment;
+
+/* Number of available segments for anonymous memory mappings */
+#define NUM_MEMORY_MAPPINGS 6
+
+extern PGDLLIMPORT ShmemSegment Segments[NUM_MEMORY_MAPPINGS];
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -85,10 +113,38 @@ extern void PGSharedMemoryReAttach(void);
 extern void PGSharedMemoryNoReAttach(void);
 #endif
 
-extern PGShmemHeader *PGSharedMemoryCreate(Size size,
+extern PGShmemHeader *PGSharedMemoryCreate(MemoryMappingSizes *mapping_sizes, int segment_id,
 										   PGShmemHeader **shim);
 extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
 extern void PGSharedMemoryDetach(void);
-extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
+extern const char *MappingName(int shmem_segment);
+extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
+							int *memfd_flags);
+void PrepareHugePages(void);
+
+/*
+ * To be able to dynamically resize largest parts of the data stored in shared
+ * memory, we split it into multiple shared memory mappings segments. Each
+ * segment contains only certain part of the data, which size depends on
+ * NBuffers.
+ */
+
+/* The main segment, contains everything except buffer blocks and related data. */
+#define MAIN_SHMEM_SEGMENT 0
+
+/* Buffer blocks */
+#define BUFFERS_SHMEM_SEGMENT 1
+
+/* Buffer descriptors */
+#define BUFFER_DESCRIPTORS_SHMEM_SEGMENT 2
+
+/* Condition variables for buffers */
+#define BUFFER_IOCV_SHMEM_SEGMENT 3
+
+/* Checkpoint BufferIds */
+#define CHECKPOINT_BUFFERS_SHMEM_SEGMENT 4
+
+/* Buffer strategy status */
+#define STRATEGY_SHMEM_SEGMENT 5
 
 #endif							/* PG_SHMEM_H */
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index 70a5b8b172c..c56712555f0 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -30,14 +30,25 @@ extern PGDLLIMPORT slock_t *ShmemLock;
 typedef struct PGShmemHeader PGShmemHeader; /* avoid including
 											 * storage/pg_shmem.h here */
 extern void InitShmemAccess(PGShmemHeader *seghdr);
+extern void InitShmemAccessInSegment(struct PGShmemHeader *seghdr,
+									 int shmem_segment);
 extern void InitShmemAllocation(void);
+extern void InitShmemAllocationInSegment(int shmem_segment);
 extern void *ShmemAlloc(Size size);
+extern void *ShmemAllocInSegment(Size size, int shmem_segment);
 extern void *ShmemAllocNoError(Size size);
+extern void *ShmemAllocUnlockedInSegment(Size size, int shmem_segment);
 extern bool ShmemAddrIsValid(const void *addr);
+extern bool ShmemAddrIsValidInSegment(const void *addr, int shmem_segment);
 extern void InitShmemIndex(void);
 extern HTAB *ShmemInitHash(const char *name, int64 init_size, int64 max_size,
 						   HASHCTL *infoP, int hash_flags);
+extern HTAB *ShmemInitHashInSegment(const char *name, long init_size,
+									long max_size, HASHCTL *infoP,
+									int hash_flags, int shmem_segment);
 extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
+extern void *ShmemInitStructInSegment(const char *name, Size size,
+									  bool *foundPtr, int shmem_segment);
 extern Size add_size(Size s1, Size s2);
 extern Size mul_size(Size s1, Size s2);
 
@@ -59,6 +70,7 @@ typedef struct
 	void	   *location;		/* location in shared mem */
 	Size		size;			/* # bytes requested for the structure */
 	Size		allocated_size; /* # bytes actually allocated */
+	int			shmem_segment;	/* segment in which the structure is allocated */
 } ShmemIndexEnt;
 
 #endif							/* SHMEM_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 7c52181cbcb..bd877df5f3b 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1765,14 +1765,22 @@ pg_shadow| SELECT pg_authid.rolname AS usename,
      LEFT JOIN pg_db_role_setting s ON (((pg_authid.oid = s.setrole) AND (s.setdatabase = (0)::oid))))
   WHERE pg_authid.rolcanlogin;
 pg_shmem_allocations| SELECT name,
+    segment,
     off,
     size,
     allocated_size
-   FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+   FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, segment, off, size, allocated_size);
 pg_shmem_allocations_numa| SELECT name,
     numa_node,
     size
    FROM pg_get_shmem_allocations_numa() pg_get_shmem_allocations_numa(name, numa_node, size);
+pg_shmem_segments| SELECT id,
+    name,
+    size,
+    freeoffset,
+    mapping_size,
+    mapping_reserved_size
+   FROM pg_get_shmem_segments() pg_get_shmem_segments(id, name, size, freeoffset, mapping_size, mapping_reserved_size);
 pg_stat_activity| SELECT s.datid,
     d.datname,
     s.pid,
-- 
2.34.1

mfdtruncate.ctext/x-csrc; charset=US-ASCII; name=mfdtruncate.cDownload