Better shared data structure management and resizable shared data structures
Hi Heikki,
As discussed in [1]/messages/by-id/91265854-b3ba-45c6-aa44-7e8dcdd51470@iki.fi, starting a new thread to discuss $Subject.
0001 in the attached patchset is same as the patch shared in [1]/messages/by-id/91265854-b3ba-45c6-aa44-7e8dcdd51470@iki.fi. For
completeness, I am copy-pasting from Heikki's email the description of
what this patch does
** quote **
Attached is a proof-of-concept of what I have in mind. Don't look too
closely at how it's implemented, it's very hacky and EXEC_BACKEND mode
is slightly broken, for example. The point is to demonstrate what the
callers would look like. I converted only a few subsystems to use the
new API, the rest still use ShmemInitStruct() and ShmemInitHash().
With this, initialization of a subsystem that defines a shared memory
area looks like this:
--------------
/* This struct lives in shared memory */
typedef struct
{
int field;
} FoobarSharedCtlData;
static void FoobarShmemInit(void *arg);
/* Descriptor for the shared memory area */
ShmemStructDesc FoobarShmemDesc = {
.name = "Foobar subsystem",
.size = sizeof(FoobarSharedCtlData),
.init_fn = FoobarShmemInit,
};
/* Pointer to the shared memory struct */
#define FoobarCtl ((FoobarSharedCtlData *) FoobarShmemDesc.ptr)
/*
* Register the shared memory struct. This is called once at
* postmaster startup, before the shared memory segment is allocated,
* and in EXEC_BACKEND mode also early at backend startup.
*
* For core subsystems, there's a list of all these functions in core
* in ipci.c, similar to all the *ShmemSize() and *ShmemInit() functions
* today. In an extension, this would be done in _PG_init() or in
* the shmem_request_hook, replacing the RequestAddinShmemSpace calls
* we have today.
*/
void
FoobarShmemRegister(void)
{
ShmemRegisterStruct(&FoobarShmemDesc);
}
/*
* This callback is called once at postmaster startup, to initialize
* the shared memory struct. FoobarShmemDesc.ptr has already been
* set when this is called.
*/
static void
FoobarShmemInit(void *arg)
{
memset(FoobarCtl, 0, sizeof(FoobarSharedCtlData));
FoobarCtl->field = 123;
}
--------------
The ShmemStructDesc provides room for extending the facility in the
future. For example, you could specify alignment there, or an additional
"attach" callback when you need to do more per-backend initialization in
EXEC_BACKEND mode. And with the resizeable shared memory, a max size.
** unquote **
0002 allows pointers of the global variables pointing to the shared
memory structure to be specified in ShmemStructDesc for easier use.
This should be merged into 0001.
0003 allows resizable shared memory structures to be specified via
ShmemRegisterStruct() and then implements allocating shared memory
segments for them and allocating the structures themselves. It also
implements the ShmemResizeRegistered() API to resize registered
resizable structures. The resizable shared memory structures are
placed in their own shared memory segments which are implemented using
the same method as 0002 patch in [2]/messages/by-id/CAExHW5tSw8r06RLAArvf923cO4NGetitPhQ7AO0o7hsKx8jsNw@mail.gmail.com. It is also PoC, "Do not do not
look too closely". The pieces dealing with huge pages need some
rework. Portability is another issue. Most important is what method
should be used to implement resizable shared memory itself. More on
that later.
0003 adds APIs to register, allocate and resize shared memory
structures in shmem.c extending the infrastructure added by 0001. The
patch also has a test which demonstrates how to use those APIs. If we
think those APIs look good, we can work on finishing 0001 and then I
can work on completing 0003.
Thoughts?
I am copying the discussion about supporting resizable shared memory
from shared buffers resizing thread here, since those apply to 0003.
Andres is suggesting an alternate approach [3]/messages/by-id/aY4v1oSmokXNpQMX@alap3.anarazel.de to support resizable
shared memory. I am continuing that conversation here.
I think the multiple memory mappings approach is just too restrictive. If we
e.g. eventually want to make some of the other major allocations that depend
on NBuffers react to resizing shared buffers, it's very easy to do if all it
requires is calling
madvise(TYPEALIGN(start, page_size), MADV_REMOVE, TYPEALIGN_DOWN(end, page_size));
You mean madvise(TYPEALIGN(start, page_size), TYPEALIGN_DOWN(end,
page_size) - TYPEALIGN(start, page_size), MADV_REMOVE)? Right?
`man madvise` has this
MADV_REMOVE (since Linux 2.6.16)
Free up a given range of pages and its associated
backing store. This is equivalent to punching a
hole in the corresponding byte range of the backing
store (see fallocate(2)). Subsequent accesses
in the specified address range will see bytes containing zero.
The specified address range must be mapped shared
and writable. This flag cannot be applied to
locked pages, Huge TLB pages, or VM_PFNMAP pages.
In the initial implementation, only tmpfs(5) was
supported MADV_REMOVE; but since Linux 3.5, any
filesystem which supports the fallocate(2)
FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE.
Hugetlbfs fails with the error EINVAL and other
filesystems fail with the error EOPNOTSUPP.
It says the flag can not be applied to Huge TLB pages. We won't be
able to make resizable shared memory structures allocated with huge
pages. That seems like a serious restriction.
I may be misunderstanding something, but it seems like this is useful
to free already allocated memory, not necessarily allocate more
memory. I don't understand how a user would start with a larger
reserved address space with only small portions of that space being
backed by memory.
There are several cases that are pretty easy to handle that way:
- Buffer Blocks
- Buffer Descriptors
- Sync request queue (part of the "Checkpointer Data" allocation)
- Checkpoint BufferIds (for sorting the to-be-checkpointed data)
- Buffer IO Condition VariablesBut if you want to support making these resizable with the separate mappings
approach, it gets considerably more complicated and the number of mappings
increases more substantially.We also don't need a lot less infrastructure in shmem.c that way. We could
e.g. make ShmemInitStruct() reservere the entire requested size (to avoid OOM
killer issues) and have a ShmemInitStructExt() that allows the caller choose
whether to reserve. No different segment IDs etc are needed.
I agree that if we can devise a mechanism to allocate a single mapping
with holes placed around resizable structure, we could use it for
shared memory structures other than buffer pool as well. However, as
far as I can understand we will still need the concept of segments
inside shmem.c (not necessarily in pg_shmem.h) to track the
allocations for each of the individual structures OR may be we could
use the resizable shmem structure itself to track it.
[1]: /messages/by-id/91265854-b3ba-45c6-aa44-7e8dcdd51470@iki.fi
[2]: /messages/by-id/CAExHW5tSw8r06RLAArvf923cO4NGetitPhQ7AO0o7hsKx8jsNw@mail.gmail.com
[3]: /messages/by-id/aY4v1oSmokXNpQMX@alap3.anarazel.de
--
Best Wishes,
Ashutosh Bapat
Attachments:
0002-Get-rid-of-global-shared-memory-pointer-mac-20260213.patchtext/x-patch; charset=US-ASCII; name=0002-Get-rid-of-global-shared-memory-pointer-mac-20260213.patchDownload+77-52
0001-wip-Introduce-a-new-way-of-registering-shar-20260213.patchtext/x-patch; charset=US-ASCII; name=0001-wip-Introduce-a-new-way-of-registering-shar-20260213.patchDownload+665-389
0003-WIP-Resizable-shared-memory-structures-20260213.patchtext/x-patch; charset=US-ASCII; name=0003-WIP-Resizable-shared-memory-structures-20260213.patchDownload+1657-387
On 13/02/2026 13:47, Ashutosh Bapat wrote:
`man madvise` has this
MADV_REMOVE (since Linux 2.6.16)
Free up a given range of pages and its associated
backing store. This is equivalent to punching a
hole in the corresponding byte range of the backing
store (see fallocate(2)). Subsequent accesses
in the specified address range will see bytes containing zero.The specified address range must be mapped shared
and writable. This flag cannot be applied to
locked pages, Huge TLB pages, or VM_PFNMAP pages.In the initial implementation, only tmpfs(5) was
supported MADV_REMOVE; but since Linux 3.5, any
filesystem which supports the fallocate(2)
FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE.
Hugetlbfs fails with the error EINVAL and other
filesystems fail with the error EOPNOTSUPP.It says the flag can not be applied to Huge TLB pages. We won't be
able to make resizable shared memory structures allocated with huge
pages. That seems like a serious restriction.
Per https://man7.org/linux/man-pages/man2/madvise.2.html:
MADV_REMOVE (since Linux 2.6.16)
...
Support for the Huge TLB filesystem was added in Linux
v4.3.
I may be misunderstanding something, but it seems like this is useful
to free already allocated memory, not necessarily allocate more
memory. I don't understand how a user would start with a larger
reserved address space with only small portions of that space being
backed by memory.
Hmm, I guess you'll need to use MAP_NORESERVE in the first mmap() call.
to reserve address space for the maximum size, and then
madvise(MADV_POPULATE_WRITE) using the initial size. Later,
madvise(MADV_REMOVE) to shrink, and madvise(MADV_POPULATE_WRITE) to grow
again.
- Heikki
On Fri, Feb 13, 2026 at 5:33 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 13/02/2026 13:47, Ashutosh Bapat wrote:
`man madvise` has this
MADV_REMOVE (since Linux 2.6.16)
Free up a given range of pages and its associated
backing store. This is equivalent to punching a
hole in the corresponding byte range of the backing
store (see fallocate(2)). Subsequent accesses
in the specified address range will see bytes containing zero.The specified address range must be mapped shared
and writable. This flag cannot be applied to
locked pages, Huge TLB pages, or VM_PFNMAP pages.In the initial implementation, only tmpfs(5) was
supported MADV_REMOVE; but since Linux 3.5, any
filesystem which supports the fallocate(2)
FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE.
Hugetlbfs fails with the error EINVAL and other
filesystems fail with the error EOPNOTSUPP.It says the flag can not be applied to Huge TLB pages. We won't be
able to make resizable shared memory structures allocated with huge
pages. That seems like a serious restriction.Per https://man7.org/linux/man-pages/man2/madvise.2.html:
MADV_REMOVE (since Linux 2.6.16)
...Support for the Huge TLB filesystem was added in Linux
v4.3.I may be misunderstanding something, but it seems like this is useful
to free already allocated memory, not necessarily allocate more
memory. I don't understand how a user would start with a larger
reserved address space with only small portions of that space being
backed by memory.Hmm, I guess you'll need to use MAP_NORESERVE in the first mmap() call.
to reserve address space for the maximum size, and then
madvise(MADV_POPULATE_WRITE) using the initial size. Later,
madvise(MADV_REMOVE) to shrink, and madvise(MADV_POPULATE_WRITE) to grow
again.
Thank you for the hint. Also thanks to Andres's idea, the resizable
structure patch is quite small now. Actually, after experimenting with
madvise, memfd_create and ftruncate(), I see that MADV_POPULATE_WRITE
is not required at all. We don't have to do anything to expand a
structure. Memory will be allocated as and when the program writes to
it. I also discovered things that I didn't know about.
1. ftruncate() sets the size of the file but it doesn't allocate the
memory pages.
2. to use madvise() the address needs to be backed by a file, so
memfd_create is a must.
3. We can't write to a file backed memory at a location beyond the
size of the file. Hence we have to set the size of the file to the
maximum size at the beginning.
4. the address and length passed to madvise needs to be page aligned,
but that passed to fallocate() needn't be. `man fallocate` says
"Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
range starting at offset and continuing for len bytes. Within the
specified range, partial filesystem blocks are zeroed, and whole
filesystem blocks are removed from the file.". It seems to be
automatically taking care of the page size. So using fallocate()
simplifies logic. Further `man madvise` says "but since Linux 3.5, any
filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
guaranteed to be available on a system which supports MADV_REMOVE.
Using fallocate() (or madvise()) to free memory, we don't need
multiple segments. So much less code churn compared to the multiple
mappings approach. However, there is one drawback. In the multiple
mapping approach access beyond the current size of the structure would
result in segfault or bus error. But in the fallocate/madvise approach
such an access does not cause a crash. A write beyond the pages that
fit the current size of the structure causes more memory to be
allocated silently. A read returns 0s. So, there's a possibility that
bugs in size calculations might go unnoticed. I think that's how it
works even today, access in the yet un-allocated part of the shared
memory will simply go unnoticed.
PFA the patches with 0003 implementing resizable structures using
fallocate(). There are TODOs, and also I need to make sure that
resizable structures are disabled where memfd_create(), fallocate()
and anonymous memory mappings are not available. Also the test is
unstable since it prints the memory consumption numbers obtained from
/proc/self/status. But it demonstrates that allocation and freeing of
shared memory as the shared structures undergo resizing. I don't think
there is a stable way to use the numbers though; so we might have to
remove those ultimately.
--
Best Wishes,
Ashutosh Bapat
Attachments:
0001-wip-Introduce-a-new-way-of-registering-shar-20260216.patchtext/x-patch; charset=US-ASCII; name=0001-wip-Introduce-a-new-way-of-registering-shar-20260216.patchDownload+665-389
0002-Get-rid-of-global-shared-memory-pointer-mac-20260216.patchtext/x-patch; charset=US-ASCII; name=0002-Get-rid-of-global-shared-memory-pointer-mac-20260216.patchDownload+77-52
0003-WIP-resizable-shared-memory-structures-20260216.patchtext/x-patch; charset=US-ASCII; name=0003-WIP-resizable-shared-memory-structures-20260216.patchDownload+830-35
On 16/02/2026 16:52, Ashutosh Bapat wrote:
2. to use madvise() the address needs to be backed by a file, so
memfd_create is a must.
It seems to work fine for anonymous mmapped memory here. See attached
test program.
- Heikki
Attachments:
test_mmap.ctext/x-csrc; charset=UTF-8; name=test_mmap.cDownload
Hi,
On 2026-02-16 20:22:51 +0530, Ashutosh Bapat wrote:
On Fri, Feb 13, 2026 at 5:33 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 13/02/2026 13:47, Ashutosh Bapat wrote:
`man madvise` has this
MADV_REMOVE (since Linux 2.6.16)
Free up a given range of pages and its associated
backing store. This is equivalent to punching a
hole in the corresponding byte range of the backing
store (see fallocate(2)). Subsequent accesses
in the specified address range will see bytes containing zero.The specified address range must be mapped shared
and writable. This flag cannot be applied to
locked pages, Huge TLB pages, or VM_PFNMAP pages.In the initial implementation, only tmpfs(5) was
supported MADV_REMOVE; but since Linux 3.5, any
filesystem which supports the fallocate(2)
FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE.
Hugetlbfs fails with the error EINVAL and other
filesystems fail with the error EOPNOTSUPP.It says the flag can not be applied to Huge TLB pages. We won't be
able to make resizable shared memory structures allocated with huge
pages. That seems like a serious restriction.Per https://man7.org/linux/man-pages/man2/madvise.2.html:
MADV_REMOVE (since Linux 2.6.16)
...Support for the Huge TLB filesystem was added in Linux
v4.3.I may be misunderstanding something, but it seems like this is useful
to free already allocated memory, not necessarily allocate more
memory. I don't understand how a user would start with a larger
reserved address space with only small portions of that space being
backed by memory.Hmm, I guess you'll need to use MAP_NORESERVE in the first mmap() call.
to reserve address space for the maximum size, and then
madvise(MADV_POPULATE_WRITE) using the initial size. Later,
madvise(MADV_REMOVE) to shrink, and madvise(MADV_POPULATE_WRITE) to grow
again.Thank you for the hint. Also thanks to Andres's idea, the resizable
structure patch is quite small now. Actually, after experimenting with
madvise, memfd_create and ftruncate(), I see that MADV_POPULATE_WRITE
is not required at all. We don't have to do anything to expand a
structure. Memory will be allocated as and when the program writes to
it.
I think we *do* want the MADV_POPULATE_WRITE, at least when using huge pages,
because otherwise you'll get a SIGBUS when accessing the memory if there is no
huge page available anymore.
I also discovered things that I didn't know about.
1. ftruncate() sets the size of the file but it doesn't allocate the
memory pages.
Right.
2. to use madvise() the address needs to be backed by a file, so
memfd_create is a must.
I am quite sure that that is not true. I hacked this up with today's
postgres, and the madvise works with the mmap() backed allocation from
sysv_shmem.c, which is anonymous.
What made you conclude that that is the case?
4. the address and length passed to madvise needs to be page aligned,
but that passed to fallocate() needn't be. `man fallocate` says
"Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
range starting at offset and continuing for len bytes. Within the
specified range, partial filesystem blocks are zeroed, and whole
filesystem blocks are removed from the file.". It seems to be
automatically taking care of the page size. So using fallocate()
simplifies logic. Further `man madvise` says "but since Linux 3.5, any
filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
guaranteed to be available on a system which supports MADV_REMOVE.
I think it makes no sense to support resizing below page size
granularity. What's the point of doing that?
Using fallocate() (or madvise()) to free memory, we don't need
multiple segments. So much less code churn compared to the multiple
mappings approach. However, there is one drawback. In the multiple
mapping approach access beyond the current size of the structure would
result in segfault or bus error. But in the fallocate/madvise approach
such an access does not cause a crash. A write beyond the pages that
fit the current size of the structure causes more memory to be
allocated silently. A read returns 0s. So, there's a possibility that
bugs in size calculations might go unnoticed. I think that's how it
works even today, access in the yet un-allocated part of the shared
memory will simply go unnoticed.
If that's something you care about, you can mprotect(PROT_NONE) the relevant
regions.
Greetings,
Andres Freund
On Mon, Feb 16, 2026 at 11:26 PM Andres Freund <andres@anarazel.de> wrote:
I think we *do* want the MADV_POPULATE_WRITE, at least when using huge pages,
because otherwise you'll get a SIGBUS when accessing the memory if there is no
huge page available anymore.
Ok.
Jakub's experiments [1]/messages/by-id/CAKZiRmwxVqEbp7JgOed=BCT6cq8RNuHk3N0vuwro65Tsw9E8NA@mail.gmail.com showed that fallocate()ing shared memory would
slow down postmaster start on a slow machine. I suppose the same thing
applies to MADV_POPULATE_WRITE. And we don't do that today even in the
case of huge pages; so we already have that problem.
If we perform MADV_POPULATE_WRITE, do we want it only for resizable
shared memory structures or all the structures in the shared memory?
On Mon, Feb 16, 2026 at 11:02 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 16/02/2026 16:52, Ashutosh Bapat wrote:
2. to use madvise() the address needs to be backed by a file, so
memfd_create is a must.It seems to work fine for anonymous mmapped memory here. See attached
test program.
On Mon, Feb 16, 2026 at 11:26 PM Andres Freund <andres@anarazel.de> wrote:
2. to use madvise() the address needs to be backed by a file, so
memfd_create is a must.I am quite sure that that is not true. I hacked this up with today's
postgres, and the madvise works with the mmap() backed allocation from
sysv_shmem.c, which is anonymous.What made you conclude that that is the case?
You are right. I was misled by the following sentence in the `man
madvise`: "but since Linux 3.5, any filesystem which supports the
fallocate(2) FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE.
Filesystems which do not support MADV_REMOVE fail with the error
EOPNOTSUPP." And in a subsequent experiment I dropped MAP_ANONYMOUS
from mmap() and used madvise() which didn't work obviously. My bad.
In the attached patches, I have got rid of memfd_create. That simplifies code.
4. the address and length passed to madvise needs to be page aligned,
but that passed to fallocate() needn't be. `man fallocate` says
"Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
range starting at offset and continuing for len bytes. Within the
specified range, partial filesystem blocks are zeroed, and whole
filesystem blocks are removed from the file.". It seems to be
automatically taking care of the page size. So using fallocate()
simplifies logic. Further `man madvise` says "but since Linux 3.5, any
filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
guaranteed to be available on a system which supports MADV_REMOVE.I think it makes no sense to support resizing below page size
granularity. What's the point of doing that?
No point really. But we can not control the extensions which want to
specify a maximum size smaller than a page size. They wouldn't know
what page size the underlying machine will have, especially with huge
pages which have a wide range of sizes. Even in the case of shared
buffers, a value of max_shared_buffers may cause buffer blocks to span
pages but other structures may fit a page.
In the attached patches, if a resizable structure is such that its
max_size is smaller than a page size, it is treated as a fixed
structure with size = max_size. Any request to resize such structures
will simply update the metadata without actual madvise operation. Only
the structures whose max_size > page_size would be treated as truly
resizable and will use madvise. You bring another interesting point.
If a resizable structure has a maximum size higher than the page size,
but it is allocated such that the initial part of it is on a partially
allocated page and the last part of it is on another partially
allocated page, those pages are never freed because of adjoining
structures. Per the logic in the attached patches, all the fixed (or
pseudo-resizable structures) are packed together. The resizable
structures start on a page boundary and their max_sizes are adjusted
to be page aligned. That way we can release pages when the structure
shrinks more than a page.
Using fallocate() (or madvise()) to free memory, we don't need
multiple segments. So much less code churn compared to the multiple
mappings approach. However, there is one drawback. In the multiple
mapping approach access beyond the current size of the structure would
result in segfault or bus error. But in the fallocate/madvise approach
such an access does not cause a crash. A write beyond the pages that
fit the current size of the structure causes more memory to be
allocated silently. A read returns 0s. So, there's a possibility that
bugs in size calculations might go unnoticed. I think that's how it
works even today, access in the yet un-allocated part of the shared
memory will simply go unnoticed.If that's something you care about, you can mprotect(PROT_NONE) the relevant
regions.
I am fine, if we let go of this protection while getting rid of
multiple segments, if we all agree to do so.
I could be wrong, but mprotect needs to be executed in every backend
where the memory is mapped and then a new backend needs to inherit it
from the postmaster. Makes resizing complex since it has to touch
every backend. So avoiding mprotect is better.
[1]: /messages/by-id/CAKZiRmwxVqEbp7JgOed=BCT6cq8RNuHk3N0vuwro65Tsw9E8NA@mail.gmail.com
PFA patches.
--
Best Wishes,
Ashutosh Bapat
Attachments:
0002-Get-rid-of-global-shared-memory-pointer-mac-20260217.patchtext/x-patch; charset=US-ASCII; name=0002-Get-rid-of-global-shared-memory-pointer-mac-20260217.patchDownload+77-52
0001-wip-Introduce-a-new-way-of-registering-shar-20260217.patchtext/x-patch; charset=US-ASCII; name=0001-wip-Introduce-a-new-way-of-registering-shar-20260217.patchDownload+665-389
0003-WIP-resizable-shared-memory-structures-20260217.patchtext/x-patch; charset=US-ASCII; name=0003-WIP-resizable-shared-memory-structures-20260217.patchDownload+874-23
On Tue, Feb 17, 2026 at 5:06 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
On Mon, Feb 16, 2026 at 11:26 PM Andres Freund <andres@anarazel.de> wrote:
I think we *do* want the MADV_POPULATE_WRITE, at least when using huge pages,
because otherwise you'll get a SIGBUS when accessing the memory if there is no
huge page available anymore.Ok.
Jakub's experiments [1] showed that fallocate()ing shared memory would
slow down postmaster start on a slow machine. I suppose the same thing
applies to MADV_POPULATE_WRITE. And we don't do that today even in the
case of huge pages; so we already have that problem.If we perform MADV_POPULATE_WRITE, do we want it only for resizable
shared memory structures or all the structures in the shared memory?
In the attached patches, I have used MADV_POPULATE_WRITE during
resizing, which is run time operation. When the structures are
allocated when server starts, they are usually initialised, so we end
up allocating memory for the same. So we don't need
MADV_POPULATE_WRITE at that time, and thus avoid affecting startup
slowness, if any. Buffer blocks are not initialised at the time of
starting the server, so their memory is allocated as they are
accessed. But that's how it works today, so no change there.
4. the address and length passed to madvise needs to be page aligned,
but that passed to fallocate() needn't be. `man fallocate` says
"Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
range starting at offset and continuing for len bytes. Within the
specified range, partial filesystem blocks are zeroed, and whole
filesystem blocks are removed from the file.". It seems to be
automatically taking care of the page size. So using fallocate()
simplifies logic. Further `man madvise` says "but since Linux 3.5, any
filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
guaranteed to be available on a system which supports MADV_REMOVE.I think it makes no sense to support resizing below page size
granularity. What's the point of doing that?No point really. But we can not control the extensions which want to
specify a maximum size smaller than a page size. They wouldn't know
what page size the underlying machine will have, especially with huge
pages which have a wide range of sizes. Even in the case of shared
buffers, a value of max_shared_buffers may cause buffer blocks to span
pages but other structures may fit a page.In the attached patches, if a resizable structure is such that its
max_size is smaller than a page size, it is treated as a fixed
structure with size = max_size. Any request to resize such structures
will simply update the metadata without actual madvise operation. Only
the structures whose max_size > page_size would be treated as truly
resizable and will use madvise. You bring another interesting point.
If a resizable structure has a maximum size higher than the page size,
but it is allocated such that the initial part of it is on a partially
allocated page and the last part of it is on another partially
allocated page, those pages are never freed because of adjoining
structures. Per the logic in the attached patches, all the fixed (or
pseudo-resizable structures) are packed together. The resizable
structures start on a page boundary and their max_sizes are adjusted
to be page aligned. That way we can release pages when the structure
shrinks more than a page.
Using fallocate() (or madvise()) to free memory, we don't need
multiple segments. So much less code churn compared to the multiple
mappings approach. However, there is one drawback. In the multiple
mapping approach access beyond the current size of the structure would
result in segfault or bus error. But in the fallocate/madvise approach
such an access does not cause a crash. A write beyond the pages that
fit the current size of the structure causes more memory to be
allocated silently. A read returns 0s. So, there's a possibility that
bugs in size calculations might go unnoticed. I think that's how it
works even today, access in the yet un-allocated part of the shared
memory will simply go unnoticed.If that's something you care about, you can mprotect(PROT_NONE) the relevant
regions.I am fine, if we let go of this protection while getting rid of
multiple segments, if we all agree to do so.I could be wrong, but mprotect needs to be executed in every backend
where the memory is mapped and then a new backend needs to inherit it
from the postmaster. Makes resizing complex since it has to touch
every backend. So avoiding mprotect is better.
If the general approach in the attached patches looks good, we can
work on improving the 0001 + 0002 to be committable and then work on
0003.
--
Best Wishes,
Ashutosh Bapat
Attachments:
0001-wip-Introduce-a-new-way-of-registering-shar-20260218.patchtext/x-patch; charset=US-ASCII; name=0001-wip-Introduce-a-new-way-of-registering-shar-20260218.patchDownload+665-389
0002-Get-rid-of-global-shared-memory-pointer-mac-20260218.patchtext/x-patch; charset=US-ASCII; name=0002-Get-rid-of-global-shared-memory-pointer-mac-20260218.patchDownload+77-52
0003-WIP-resizable-shared-memory-structures-20260218.patchtext/x-patch; charset=US-ASCII; name=0003-WIP-resizable-shared-memory-structures-20260218.patchDownload+860-23
On Wed, Feb 18, 2026 at 9:17 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
On Tue, Feb 17, 2026 at 5:06 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:On Mon, Feb 16, 2026 at 11:26 PM Andres Freund <andres@anarazel.de> wrote:
I think we *do* want the MADV_POPULATE_WRITE, at least when using huge pages,
because otherwise you'll get a SIGBUS when accessing the memory if there is no
huge page available anymore.Ok.
Jakub's experiments [1] showed that fallocate()ing shared memory would
slow down postmaster start on a slow machine. I suppose the same thing
applies to MADV_POPULATE_WRITE. And we don't do that today even in the
case of huge pages; so we already have that problem.If we perform MADV_POPULATE_WRITE, do we want it only for resizable
shared memory structures or all the structures in the shared memory?In the attached patches, I have used MADV_POPULATE_WRITE during
resizing, which is run time operation. When the structures are
allocated when server starts, they are usually initialised, so we end
up allocating memory for the same. So we don't need
MADV_POPULATE_WRITE at that time, and thus avoid affecting startup
slowness, if any. Buffer blocks are not initialised at the time of
starting the server, so their memory is allocated as they are
accessed. But that's how it works today, so no change there.4. the address and length passed to madvise needs to be page aligned,
but that passed to fallocate() needn't be. `man fallocate` says
"Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
range starting at offset and continuing for len bytes. Within the
specified range, partial filesystem blocks are zeroed, and whole
filesystem blocks are removed from the file.". It seems to be
automatically taking care of the page size. So using fallocate()
simplifies logic. Further `man madvise` says "but since Linux 3.5, any
filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
guaranteed to be available on a system which supports MADV_REMOVE.I think it makes no sense to support resizing below page size
granularity. What's the point of doing that?No point really. But we can not control the extensions which want to
specify a maximum size smaller than a page size. They wouldn't know
what page size the underlying machine will have, especially with huge
pages which have a wide range of sizes. Even in the case of shared
buffers, a value of max_shared_buffers may cause buffer blocks to span
pages but other structures may fit a page.In the attached patches, if a resizable structure is such that its
max_size is smaller than a page size, it is treated as a fixed
structure with size = max_size. Any request to resize such structures
will simply update the metadata without actual madvise operation. Only
the structures whose max_size > page_size would be treated as truly
resizable and will use madvise. You bring another interesting point.
If a resizable structure has a maximum size higher than the page size,
but it is allocated such that the initial part of it is on a partially
allocated page and the last part of it is on another partially
allocated page, those pages are never freed because of adjoining
structures. Per the logic in the attached patches, all the fixed (or
pseudo-resizable structures) are packed together. The resizable
structures start on a page boundary and their max_sizes are adjusted
to be page aligned. That way we can release pages when the structure
shrinks more than a page.Using fallocate() (or madvise()) to free memory, we don't need
multiple segments. So much less code churn compared to the multiple
mappings approach. However, there is one drawback. In the multiple
mapping approach access beyond the current size of the structure would
result in segfault or bus error. But in the fallocate/madvise approach
such an access does not cause a crash. A write beyond the pages that
fit the current size of the structure causes more memory to be
allocated silently. A read returns 0s. So, there's a possibility that
bugs in size calculations might go unnoticed. I think that's how it
works even today, access in the yet un-allocated part of the shared
memory will simply go unnoticed.If that's something you care about, you can mprotect(PROT_NONE) the relevant
regions.I am fine, if we let go of this protection while getting rid of
multiple segments, if we all agree to do so.I could be wrong, but mprotect needs to be executed in every backend
where the memory is mapped and then a new backend needs to inherit it
from the postmaster. Makes resizing complex since it has to touch
every backend. So avoiding mprotect is better.
Sent too soon.
I have also reworked the test into a TAP test which looks stable than
the earlier version. Haven't had any failures on my laptop.
If the general approach in the attached patches looks good, we can
work on improving the 0001 + 0002 to be committable and then work on
0003.
The resizable memory patch works only in linux where
MADV_POPULATE_WRITE and MADV_REMOVE are supported on anonymous shared
memory. On other platforms and where that support doesn't exist, we
will need to disable the feature for now. That work remains. Also the
TODOs need to be addressed.
--
Best Wishes,
Ashutosh Bapat
On Wed, Feb 18, 2026 at 9:17 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
4. the address and length passed to madvise needs to be page aligned,
but that passed to fallocate() needn't be. `man fallocate` says
"Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
range starting at offset and continuing for len bytes. Within the
specified range, partial filesystem blocks are zeroed, and whole
filesystem blocks are removed from the file.". It seems to be
automatically taking care of the page size. So using fallocate()
simplifies logic. Further `man madvise` says "but since Linux 3.5, any
filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
guaranteed to be available on a system which supports MADV_REMOVE.I think it makes no sense to support resizing below page size
granularity. What's the point of doing that?No point really. But we can not control the extensions which want to
specify a maximum size smaller than a page size. They wouldn't know
what page size the underlying machine will have, especially with huge
pages which have a wide range of sizes. Even in the case of shared
buffers, a value of max_shared_buffers may cause buffer blocks to span
pages but other structures may fit a page.In the attached patches, if a resizable structure is such that its
max_size is smaller than a page size, it is treated as a fixed
structure with size = max_size. Any request to resize such structures
will simply update the metadata without actual madvise operation. Only
the structures whose max_size > page_size would be treated as truly
resizable and will use madvise. You bring another interesting point.
If a resizable structure has a maximum size higher than the page size,
but it is allocated such that the initial part of it is on a partially
allocated page and the last part of it is on another partially
allocated page, those pages are never freed because of adjoining
structures. Per the logic in the attached patches, all the fixed (or
pseudo-resizable structures) are packed together. The resizable
structures start on a page boundary and their max_sizes are adjusted
to be page aligned. That way we can release pages when the structure
shrinks more than a page.
It was a mistake on my part to assume that more memory will be freed
if we page align the start and end of a resizable structure. I didn't
account for the memory wasted in alignment itself. That amount comes
out to be same as the amount of memory wasted if we don't page align
the structure. But the code is simpler if we don't page align the
structure as seen in the attached patches.
Using fallocate() (or madvise()) to free memory, we don't need
multiple segments. So much less code churn compared to the multiple
mappings approach. However, there is one drawback. In the multiple
mapping approach access beyond the current size of the structure would
result in segfault or bus error. But in the fallocate/madvise approach
such an access does not cause a crash. A write beyond the pages that
fit the current size of the structure causes more memory to be
allocated silently. A read returns 0s. So, there's a possibility that
bugs in size calculations might go unnoticed. I think that's how it
works even today, access in the yet un-allocated part of the shared
memory will simply go unnoticed.If that's something you care about, you can mprotect(PROT_NONE) the relevant
regions.I am fine, if we let go of this protection while getting rid of
multiple segments, if we all agree to do so.I could be wrong, but mprotect needs to be executed in every backend
where the memory is mapped and then a new backend needs to inherit it
from the postmaster. Makes resizing complex since it has to touch
every backend. So avoiding mprotect is better.
I discussed this point with Andres offlist. Here's a summary of that
discussion. Any serious users of resizable shared memory structures
would need to send proc signal barriers to synchronize the resizing
across the backends. This barrier can be used to perform mprotect() in
the backends and a separate signal to Postmaster, if mprotect is
needed in Postmaster. But whether mprotect is needed depends upon the
usecase. It should be responsibility of the resizable structure user
and not of the ShmemResizeRegistered()
Following points need a bit of discussion.
1. calculation of allocated_size
For fixed sized shared memory structures, allocated_size is the size
of the structure after cache aligning it. Assuming that the shared
memory is allcoated in pages, this also is the actually memory
allocated to the structure when the whole structure is written to. For
resizable structure, it's a bit more complicated. We allocate and
reserve the maximum space required by the structure. At a given point
in time, the memory page where the next structure begins and the page
which contains the end of the structure at that point in time are
allocated. The pages in-between are not allocated. Thus the
allocated_size should be the length from the start the structure to
the end of the page containing the current end of the structure + part
of the page where the next structure starts upto the start of the next
structure. That is what is implemented in the attached patches.
2. GUCs shared_memory_size, shared_memory_size_in_huge_pages
These GUCs indicate the size of the shared memory in bytes and in huge
pages. Without resizable shared memory structures calculating these is
straight forward, we sum all the sizes of all the requested
structures. With resizable shared memory structures, these GUCs do not
make much sense. Since the memory allocated to the resizable
structures can be anywhere between 0 to maximum, neither the sum of
the their initial sizes nor the sum of their maximum sizes can be
reported as shared_memory_size. Similarly for
shared_memory_size_in_huge_pages. We need two GUCs to replace each of
the existing GUCs - max_shared_memory_size, initial_shared_memory_size
and their huge page peers. max_shared_memory_size is the sum of the
maximum sizes of resizable structures + the requested sizes of the
fixed structure. initial_shared_memory_size is the sum of the initial
sizes requested for all the structures.
3. Testing the memory allocation
I couldn't find a way to reliably know the shared memory allocated at
a given address in the process. RSS Shmem given the amount of shared
memory accessed by the process which includes memory allocated to the
fixed structures accessed by the process. This value isn't stable
across runs of the test in the patch. The test adds the RSS shmem
reported against the variations in the resizable shared memory
structure which can be visually inspected to be within limits. But
those limits are hard to test in the test code. Looking for some
suggestions here.
Disabling resizable structures in the builds which do not support
resizable structures is still a TODO.
--
Best Wishes,
Ashutosh Bapat
Attachments:
0001-wip-Introduce-a-new-way-of-registering-shar-20260223.patchtext/x-patch; charset=US-ASCII; name=0001-wip-Introduce-a-new-way-of-registering-shar-20260223.patchDownload+665-389
0002-Get-rid-of-global-shared-memory-pointer-mac-20260223.patchtext/x-patch; charset=US-ASCII; name=0002-Get-rid-of-global-shared-memory-pointer-mac-20260223.patchDownload+77-52
0003-WIP-resizable-shared-memory-structures-20260223.patchtext/x-patch; charset=US-ASCII; name=0003-WIP-resizable-shared-memory-structures-20260223.patchDownload+789-30
I spent some time cleaning up the new registration machinery. I didn't
look at the "resizeable" part yet, but it should fit in nicely, as
Ashutosh demonstrated.
I added a lot of comments, changed the existing docs on how to allocate
shared memory in extensions to use the machinery, and tons of other
cleanups. I'm getting pretty happy with this, but there are a couple of
weak spots:
Firstly, I'm not sure what to do with ShmemRegisterHash() and the
'HASHCTL *infoP' argument to it. I feel it'd be nicer if the HASHCTL was
just part of the ShmemHashDesc struct, but I'm not sure if that fits all
the callers. I'll have to try that out I guess.
Secondly, I'm not 100% happy with the facilities we provide to
extensions. The lifecycle of _PG_init() and shmem_request/startup_hooks
is a little messy. The status quo is that a shared library gets control
in three different places:
1. _PG_init() gets called early at postmaster startup, if the library is
in shared_preload_libraries. If it's not in shared_preload_libraries, it
gets called whenever the module is loaded.
2. The library can install a shmem_request_hook, which gets called early
at postmaster startup, but after initializing the MaxBackends GUC. It
only gets called when the library is loaded via shared_preload_libraries.
3. The library can install a shmem_startup_hook. It gets called later at
postmaster startup, after the shared memory segment has been allocated.
In EXEC_BACKEND mode it also gets called at backend startup. It does not
get called if the library is not listed in shared_preload_libraries.
None of these is quite the right moment to call the new
ShmemRegisterStruct() function. _PG_init() is too early if the extension
needs MaxBackends for sizing the shared memory area. shmem_request_hook
is otherwise good, but in EXEC_BACKEND mode, the ShmemRegisterStruct()
function needs to also be called backend startup and shmem_request_hook
is not called at backend startup. shmem_startup_hook() is too late.
For now, I documented that an extension should call
ShmemRegisterStruct() from _PG_init(), but may adjust the size in the
shmem_request_hook, if needed.
Another wrinkle here is that you still need the shmem_request_hook, if
you want to call RequestNamedLWLockTranche(). It cannot be called from
_PG_init(). I'm not sure why that is.
So I think that requires a little more refactoring, I think an extension
should need to use shmem_request/startup_hook with the new APIs anymore.
It should provide more ergonomic callbacks or other mechanisms to
accomplish the same things.
- Heikki