Better shared data structure management and resizable shared data structures

Per https://man7.org/linux/man-pages/man2/madvise.2.html:

heikki.linnakangas@enterprisedb.com

4 months ago

In reply to: Ashutosh Bapat (#1)

Re: Better shared data structure management and resizable shared data structures

On 13/02/2026 13:47, Ashutosh Bapat wrote:

`man madvise` has this
MADV_REMOVE (since Linux 2.6.16)
Free up a given range of pages and its associated
backing store. This is equivalent to punching a
hole in the corresponding byte range of the backing
store (see fallocate(2)). Subsequent accesses
in the specified address range will see bytes containing zero.

The specified address range must be mapped shared
and writable. This flag cannot be applied to
locked pages, Huge TLB pages, or VM_PFNMAP pages.

In the initial implementation, only tmpfs(5) was
supported MADV_REMOVE; but since Linux 3.5, any
filesystem which supports the fallocate(2)
FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE.
Hugetlbfs fails with the error EINVAL and other
filesystems fail with the error EOPNOTSUPP.

It says the flag can not be applied to Huge TLB pages. We won't be
able to make resizable shared memory structures allocated with huge
pages. That seems like a serious restriction.

MADV_REMOVE (since Linux 2.6.16)
...

Support for the Huge TLB filesystem was added in Linux
v4.3.

I may be misunderstanding something, but it seems like this is useful
to free already allocated memory, not necessarily allocate more
memory. I don't understand how a user would start with a larger
reserved address space with only small portions of that space being
backed by memory.

Hmm, I guess you'll need to use MAP_NORESERVE in the first mmap() call.
to reserve address space for the maximum size, and then
madvise(MADV_POPULATE_WRITE) using the initial size. Later,
madvise(MADV_REMOVE) to shrink, and madvise(MADV_POPULATE_WRITE) to grow
again.

- Heikki

ashutosh.bapat@enterprisedb.com

4 months ago

In reply to: Heikki Linnakangas (#2)

Re: Better shared data structure management and resizable shared data structures

On Fri, Feb 13, 2026 at 5:33 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 13/02/2026 13:47, Ashutosh Bapat wrote:

`man madvise` has this
MADV_REMOVE (since Linux 2.6.16)
Free up a given range of pages and its associated
backing store. This is equivalent to punching a
hole in the corresponding byte range of the backing
store (see fallocate(2)). Subsequent accesses
in the specified address range will see bytes containing zero.

The specified address range must be mapped shared
and writable. This flag cannot be applied to
locked pages, Huge TLB pages, or VM_PFNMAP pages.

In the initial implementation, only tmpfs(5) was
supported MADV_REMOVE; but since Linux 3.5, any
filesystem which supports the fallocate(2)
FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE.
Hugetlbfs fails with the error EINVAL and other
filesystems fail with the error EOPNOTSUPP.

It says the flag can not be applied to Huge TLB pages. We won't be
able to make resizable shared memory structures allocated with huge
pages. That seems like a serious restriction.

Per https://man7.org/linux/man-pages/man2/madvise.2.html:

MADV_REMOVE (since Linux 2.6.16)
...

Support for the Huge TLB filesystem was added in Linux
v4.3.

I may be misunderstanding something, but it seems like this is useful
to free already allocated memory, not necessarily allocate more
memory. I don't understand how a user would start with a larger
reserved address space with only small portions of that space being
backed by memory.

Hmm, I guess you'll need to use MAP_NORESERVE in the first mmap() call.
to reserve address space for the maximum size, and then
madvise(MADV_POPULATE_WRITE) using the initial size. Later,
madvise(MADV_REMOVE) to shrink, and madvise(MADV_POPULATE_WRITE) to grow
again.

Thank you for the hint. Also thanks to Andres's idea, the resizable
structure patch is quite small now. Actually, after experimenting with
madvise, memfd_create and ftruncate(), I see that MADV_POPULATE_WRITE
is not required at all. We don't have to do anything to expand a
structure. Memory will be allocated as and when the program writes to
it. I also discovered things that I didn't know about.
1. ftruncate() sets the size of the file but it doesn't allocate the
memory pages.
2. to use madvise() the address needs to be backed by a file, so
memfd_create is a must.
3. We can't write to a file backed memory at a location beyond the
size of the file. Hence we have to set the size of the file to the
maximum size at the beginning.
4. the address and length passed to madvise needs to be page aligned,
but that passed to fallocate() needn't be. `man fallocate` says
"Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
range starting at offset and continuing for len bytes. Within the
specified range, partial filesystem blocks are zeroed, and whole
filesystem blocks are removed from the file.". It seems to be
automatically taking care of the page size. So using fallocate()
simplifies logic. Further `man madvise` says "but since Linux 3.5, any
filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
guaranteed to be available on a system which supports MADV_REMOVE.

Using fallocate() (or madvise()) to free memory, we don't need
multiple segments. So much less code churn compared to the multiple
mappings approach. However, there is one drawback. In the multiple
mapping approach access beyond the current size of the structure would
result in segfault or bus error. But in the fallocate/madvise approach
such an access does not cause a crash. A write beyond the pages that
fit the current size of the structure causes more memory to be
allocated silently. A read returns 0s. So, there's a possibility that
bugs in size calculations might go unnoticed. I think that's how it
works even today, access in the yet un-allocated part of the shared
memory will simply go unnoticed.

PFA the patches with 0003 implementing resizable structures using
fallocate(). There are TODOs, and also I need to make sure that
resizable structures are disabled where memfd_create(), fallocate()
and anonymous memory mappings are not available. Also the test is
unstable since it prints the memory consumption numbers obtained from
/proc/self/status. But it demonstrates that allocation and freeing of
shared memory as the shared structures undergo resizing. I don't think
there is a stable way to use the numbers though; so we might have to
remove those ultimately.

--
Best Wishes,
Ashutosh Bapat

heikki.linnakangas@enterprisedb.com

4 months ago

In reply to: Ashutosh Bapat (#3)

Re: Better shared data structure management and resizable shared data structures

On 16/02/2026 16:52, Ashutosh Bapat wrote:

2. to use madvise() the address needs to be backed by a file, so
memfd_create is a must.

It seems to work fine for anonymous mmapped memory here. See attached
test program.

- Heikki

Andres Freund

andres@anarazel.de

4 months ago

In reply to: Ashutosh Bapat (#3)

Re: Better shared data structure management and resizable shared data structures

Hi,

On 2026-02-16 20:22:51 +0530, Ashutosh Bapat wrote:

On Fri, Feb 13, 2026 at 5:33 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 13/02/2026 13:47, Ashutosh Bapat wrote:

`man madvise` has this
MADV_REMOVE (since Linux 2.6.16)
Free up a given range of pages and its associated
backing store. This is equivalent to punching a
hole in the corresponding byte range of the backing
store (see fallocate(2)). Subsequent accesses
in the specified address range will see bytes containing zero.

The specified address range must be mapped shared
and writable. This flag cannot be applied to
locked pages, Huge TLB pages, or VM_PFNMAP pages.

In the initial implementation, only tmpfs(5) was
supported MADV_REMOVE; but since Linux 3.5, any
filesystem which supports the fallocate(2)
FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE.
Hugetlbfs fails with the error EINVAL and other
filesystems fail with the error EOPNOTSUPP.

It says the flag can not be applied to Huge TLB pages. We won't be
able to make resizable shared memory structures allocated with huge
pages. That seems like a serious restriction.

Per https://man7.org/linux/man-pages/man2/madvise.2.html:

MADV_REMOVE (since Linux 2.6.16)
...

Support for the Huge TLB filesystem was added in Linux
v4.3.

I may be misunderstanding something, but it seems like this is useful
to free already allocated memory, not necessarily allocate more
memory. I don't understand how a user would start with a larger
reserved address space with only small portions of that space being
backed by memory.

Hmm, I guess you'll need to use MAP_NORESERVE in the first mmap() call.
to reserve address space for the maximum size, and then
madvise(MADV_POPULATE_WRITE) using the initial size. Later,
madvise(MADV_REMOVE) to shrink, and madvise(MADV_POPULATE_WRITE) to grow
again.

Thank you for the hint. Also thanks to Andres's idea, the resizable
structure patch is quite small now. Actually, after experimenting with
madvise, memfd_create and ftruncate(), I see that MADV_POPULATE_WRITE
is not required at all. We don't have to do anything to expand a
structure. Memory will be allocated as and when the program writes to
it.

I think we *do* want the MADV_POPULATE_WRITE, at least when using huge pages,
because otherwise you'll get a SIGBUS when accessing the memory if there is no
huge page available anymore.

I also discovered things that I didn't know about.
1. ftruncate() sets the size of the file but it doesn't allocate the
memory pages.

Right.

2. to use madvise() the address needs to be backed by a file, so
memfd_create is a must.

I am quite sure that that is not true. I hacked this up with today's
postgres, and the madvise works with the mmap() backed allocation from
sysv_shmem.c, which is anonymous.

What made you conclude that that is the case?

4. the address and length passed to madvise needs to be page aligned,
but that passed to fallocate() needn't be. `man fallocate` says
"Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
range starting at offset and continuing for len bytes. Within the
specified range, partial filesystem blocks are zeroed, and whole
filesystem blocks are removed from the file.". It seems to be
automatically taking care of the page size. So using fallocate()
simplifies logic. Further `man madvise` says "but since Linux 3.5, any
filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
guaranteed to be available on a system which supports MADV_REMOVE.

I think it makes no sense to support resizing below page size
granularity. What's the point of doing that?

Using fallocate() (or madvise()) to free memory, we don't need
multiple segments. So much less code churn compared to the multiple
mappings approach. However, there is one drawback. In the multiple
mapping approach access beyond the current size of the structure would
result in segfault or bus error. But in the fallocate/madvise approach
such an access does not cause a crash. A write beyond the pages that
fit the current size of the structure causes more memory to be
allocated silently. A read returns 0s. So, there's a possibility that
bugs in size calculations might go unnoticed. I think that's how it
works even today, access in the yet un-allocated part of the shared
memory will simply go unnoticed.

If that's something you care about, you can mprotect(PROT_NONE) the relevant
regions.

Greetings,

Andres Freund

[1]: /messages/by-id/CAKZiRmwxVqEbp7JgOed=BCT6cq8RNuHk3N0vuwro65Tsw9E8NA@mail.gmail.com

ashutosh.bapat@enterprisedb.com

4 months ago

In reply to: Andres Freund (#5)

Re: Better shared data structure management and resizable shared data structures

On Mon, Feb 16, 2026 at 11:26 PM Andres Freund <andres@anarazel.de> wrote:

I think we *do* want the MADV_POPULATE_WRITE, at least when using huge pages,
because otherwise you'll get a SIGBUS when accessing the memory if there is no
huge page available anymore.

Ok.

Jakub's experiments [1]/messages/by-id/CAKZiRmwxVqEbp7JgOed=BCT6cq8RNuHk3N0vuwro65Tsw9E8NA@mail.gmail.com showed that fallocate()ing shared memory would
slow down postmaster start on a slow machine. I suppose the same thing
applies to MADV_POPULATE_WRITE. And we don't do that today even in the
case of huge pages; so we already have that problem.

If we perform MADV_POPULATE_WRITE, do we want it only for resizable
shared memory structures or all the structures in the shared memory?

On Mon, Feb 16, 2026 at 11:02 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 16/02/2026 16:52, Ashutosh Bapat wrote:

2. to use madvise() the address needs to be backed by a file, so
memfd_create is a must.

It seems to work fine for anonymous mmapped memory here. See attached
test program.

On Mon, Feb 16, 2026 at 11:26 PM Andres Freund <andres@anarazel.de> wrote:

2. to use madvise() the address needs to be backed by a file, so
memfd_create is a must.

I am quite sure that that is not true. I hacked this up with today's
postgres, and the madvise works with the mmap() backed allocation from
sysv_shmem.c, which is anonymous.

What made you conclude that that is the case?

You are right. I was misled by the following sentence in the `man
madvise`: "but since Linux 3.5, any filesystem which supports the
fallocate(2) FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE.
Filesystems which do not support MADV_REMOVE fail with the error
EOPNOTSUPP." And in a subsequent experiment I dropped MAP_ANONYMOUS
from mmap() and used madvise() which didn't work obviously. My bad.

In the attached patches, I have got rid of memfd_create. That simplifies code.

4. the address and length passed to madvise needs to be page aligned,
but that passed to fallocate() needn't be. `man fallocate` says
"Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
range starting at offset and continuing for len bytes. Within the
specified range, partial filesystem blocks are zeroed, and whole
filesystem blocks are removed from the file.". It seems to be
automatically taking care of the page size. So using fallocate()
simplifies logic. Further `man madvise` says "but since Linux 3.5, any
filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
guaranteed to be available on a system which supports MADV_REMOVE.

I think it makes no sense to support resizing below page size
granularity. What's the point of doing that?

No point really. But we can not control the extensions which want to
specify a maximum size smaller than a page size. They wouldn't know
what page size the underlying machine will have, especially with huge
pages which have a wide range of sizes. Even in the case of shared
buffers, a value of max_shared_buffers may cause buffer blocks to span
pages but other structures may fit a page.

In the attached patches, if a resizable structure is such that its
max_size is smaller than a page size, it is treated as a fixed
structure with size = max_size. Any request to resize such structures
will simply update the metadata without actual madvise operation. Only
the structures whose max_size > page_size would be treated as truly
resizable and will use madvise. You bring another interesting point.
If a resizable structure has a maximum size higher than the page size,
but it is allocated such that the initial part of it is on a partially
allocated page and the last part of it is on another partially
allocated page, those pages are never freed because of adjoining
structures. Per the logic in the attached patches, all the fixed (or
pseudo-resizable structures) are packed together. The resizable
structures start on a page boundary and their max_sizes are adjusted
to be page aligned. That way we can release pages when the structure
shrinks more than a page.

Using fallocate() (or madvise()) to free memory, we don't need
multiple segments. So much less code churn compared to the multiple
mappings approach. However, there is one drawback. In the multiple
mapping approach access beyond the current size of the structure would
result in segfault or bus error. But in the fallocate/madvise approach
such an access does not cause a crash. A write beyond the pages that
fit the current size of the structure causes more memory to be
allocated silently. A read returns 0s. So, there's a possibility that
bugs in size calculations might go unnoticed. I think that's how it
works even today, access in the yet un-allocated part of the shared
memory will simply go unnoticed.

If that's something you care about, you can mprotect(PROT_NONE) the relevant
regions.

I am fine, if we let go of this protection while getting rid of
multiple segments, if we all agree to do so.

I could be wrong, but mprotect needs to be executed in every backend
where the memory is mapped and then a new backend needs to inherit it
from the postmaster. Makes resizing complex since it has to touch
every backend. So avoiding mprotect is better.

PFA patches.

--
Best Wishes,
Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

4 months ago

In reply to: Ashutosh Bapat (#6)

Re: Better shared data structure management and resizable shared data structures

On Tue, Feb 17, 2026 at 5:06 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Mon, Feb 16, 2026 at 11:26 PM Andres Freund <andres@anarazel.de> wrote:

I think we *do* want the MADV_POPULATE_WRITE, at least when using huge pages,
because otherwise you'll get a SIGBUS when accessing the memory if there is no
huge page available anymore.

Ok.

Jakub's experiments [1] showed that fallocate()ing shared memory would
slow down postmaster start on a slow machine. I suppose the same thing
applies to MADV_POPULATE_WRITE. And we don't do that today even in the
case of huge pages; so we already have that problem.

If we perform MADV_POPULATE_WRITE, do we want it only for resizable
shared memory structures or all the structures in the shared memory?

In the attached patches, I have used MADV_POPULATE_WRITE during
resizing, which is run time operation. When the structures are
allocated when server starts, they are usually initialised, so we end
up allocating memory for the same. So we don't need
MADV_POPULATE_WRITE at that time, and thus avoid affecting startup
slowness, if any. Buffer blocks are not initialised at the time of
starting the server, so their memory is allocated as they are
accessed. But that's how it works today, so no change there.

4. the address and length passed to madvise needs to be page aligned,
but that passed to fallocate() needn't be. `man fallocate` says
"Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
range starting at offset and continuing for len bytes. Within the
specified range, partial filesystem blocks are zeroed, and whole
filesystem blocks are removed from the file.". It seems to be
automatically taking care of the page size. So using fallocate()
simplifies logic. Further `man madvise` says "but since Linux 3.5, any
filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
guaranteed to be available on a system which supports MADV_REMOVE.

I think it makes no sense to support resizing below page size
granularity. What's the point of doing that?

No point really. But we can not control the extensions which want to
specify a maximum size smaller than a page size. They wouldn't know
what page size the underlying machine will have, especially with huge
pages which have a wide range of sizes. Even in the case of shared
buffers, a value of max_shared_buffers may cause buffer blocks to span
pages but other structures may fit a page.

In the attached patches, if a resizable structure is such that its
max_size is smaller than a page size, it is treated as a fixed
structure with size = max_size. Any request to resize such structures
will simply update the metadata without actual madvise operation. Only
the structures whose max_size > page_size would be treated as truly
resizable and will use madvise. You bring another interesting point.
If a resizable structure has a maximum size higher than the page size,
but it is allocated such that the initial part of it is on a partially
allocated page and the last part of it is on another partially
allocated page, those pages are never freed because of adjoining
structures. Per the logic in the attached patches, all the fixed (or
pseudo-resizable structures) are packed together. The resizable
structures start on a page boundary and their max_sizes are adjusted
to be page aligned. That way we can release pages when the structure
shrinks more than a page.

Using fallocate() (or madvise()) to free memory, we don't need
multiple segments. So much less code churn compared to the multiple
mappings approach. However, there is one drawback. In the multiple
mapping approach access beyond the current size of the structure would
result in segfault or bus error. But in the fallocate/madvise approach
such an access does not cause a crash. A write beyond the pages that
fit the current size of the structure causes more memory to be
allocated silently. A read returns 0s. So, there's a possibility that
bugs in size calculations might go unnoticed. I think that's how it
works even today, access in the yet un-allocated part of the shared
memory will simply go unnoticed.

If that's something you care about, you can mprotect(PROT_NONE) the relevant
regions.

I am fine, if we let go of this protection while getting rid of
multiple segments, if we all agree to do so.

I could be wrong, but mprotect needs to be executed in every backend
where the memory is mapped and then a new backend needs to inherit it
from the postmaster. Makes resizing complex since it has to touch
every backend. So avoiding mprotect is better.

If the general approach in the attached patches looks good, we can
work on improving the 0001 + 0002 to be committable and then work on
0003.

--
Best Wishes,
Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

4 months ago

In reply to: Ashutosh Bapat (#7)

Re: Better shared data structure management and resizable shared data structures

On Wed, Feb 18, 2026 at 9:17 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Tue, Feb 17, 2026 at 5:06 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Mon, Feb 16, 2026 at 11:26 PM Andres Freund <andres@anarazel.de> wrote:

I think we *do* want the MADV_POPULATE_WRITE, at least when using huge pages,
because otherwise you'll get a SIGBUS when accessing the memory if there is no
huge page available anymore.

Ok.

Jakub's experiments [1] showed that fallocate()ing shared memory would
slow down postmaster start on a slow machine. I suppose the same thing
applies to MADV_POPULATE_WRITE. And we don't do that today even in the
case of huge pages; so we already have that problem.

If we perform MADV_POPULATE_WRITE, do we want it only for resizable
shared memory structures or all the structures in the shared memory?

In the attached patches, I have used MADV_POPULATE_WRITE during
resizing, which is run time operation. When the structures are
allocated when server starts, they are usually initialised, so we end
up allocating memory for the same. So we don't need
MADV_POPULATE_WRITE at that time, and thus avoid affecting startup
slowness, if any. Buffer blocks are not initialised at the time of
starting the server, so their memory is allocated as they are
accessed. But that's how it works today, so no change there.

4. the address and length passed to madvise needs to be page aligned,
but that passed to fallocate() needn't be. `man fallocate` says
"Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
range starting at offset and continuing for len bytes. Within the
specified range, partial filesystem blocks are zeroed, and whole
filesystem blocks are removed from the file.". It seems to be
automatically taking care of the page size. So using fallocate()
simplifies logic. Further `man madvise` says "but since Linux 3.5, any
filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
guaranteed to be available on a system which supports MADV_REMOVE.

I think it makes no sense to support resizing below page size
granularity. What's the point of doing that?

No point really. But we can not control the extensions which want to
specify a maximum size smaller than a page size. They wouldn't know
what page size the underlying machine will have, especially with huge
pages which have a wide range of sizes. Even in the case of shared
buffers, a value of max_shared_buffers may cause buffer blocks to span
pages but other structures may fit a page.

In the attached patches, if a resizable structure is such that its
max_size is smaller than a page size, it is treated as a fixed
structure with size = max_size. Any request to resize such structures
will simply update the metadata without actual madvise operation. Only
the structures whose max_size > page_size would be treated as truly
resizable and will use madvise. You bring another interesting point.
If a resizable structure has a maximum size higher than the page size,
but it is allocated such that the initial part of it is on a partially
allocated page and the last part of it is on another partially
allocated page, those pages are never freed because of adjoining
structures. Per the logic in the attached patches, all the fixed (or
pseudo-resizable structures) are packed together. The resizable
structures start on a page boundary and their max_sizes are adjusted
to be page aligned. That way we can release pages when the structure
shrinks more than a page.

Using fallocate() (or madvise()) to free memory, we don't need
multiple segments. So much less code churn compared to the multiple
mappings approach. However, there is one drawback. In the multiple
mapping approach access beyond the current size of the structure would
result in segfault or bus error. But in the fallocate/madvise approach
such an access does not cause a crash. A write beyond the pages that
fit the current size of the structure causes more memory to be
allocated silently. A read returns 0s. So, there's a possibility that
bugs in size calculations might go unnoticed. I think that's how it
works even today, access in the yet un-allocated part of the shared
memory will simply go unnoticed.

If that's something you care about, you can mprotect(PROT_NONE) the relevant
regions.

I am fine, if we let go of this protection while getting rid of
multiple segments, if we all agree to do so.

I could be wrong, but mprotect needs to be executed in every backend
where the memory is mapped and then a new backend needs to inherit it
from the postmaster. Makes resizing complex since it has to touch
every backend. So avoiding mprotect is better.

Sent too soon.

I have also reworked the test into a TAP test which looks stable than
the earlier version. Haven't had any failures on my laptop.

If the general approach in the attached patches looks good, we can
work on improving the 0001 + 0002 to be committable and then work on
0003.

The resizable memory patch works only in linux where
MADV_POPULATE_WRITE and MADV_REMOVE are supported on anonymous shared
memory. On other platforms and where that support doesn't exist, we
will need to disable the feature for now. That work remains. Also the
TODOs need to be addressed.

--
Best Wishes,
Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

4 months ago

In reply to: Ashutosh Bapat (#7)

Re: Better shared data structure management and resizable shared data structures

On Wed, Feb 18, 2026 at 9:17 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

4. the address and length passed to madvise needs to be page aligned,
but that passed to fallocate() needn't be. `man fallocate` says
"Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
range starting at offset and continuing for len bytes. Within the
specified range, partial filesystem blocks are zeroed, and whole
filesystem blocks are removed from the file.". It seems to be
automatically taking care of the page size. So using fallocate()
simplifies logic. Further `man madvise` says "but since Linux 3.5, any
filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
guaranteed to be available on a system which supports MADV_REMOVE.

I think it makes no sense to support resizing below page size
granularity. What's the point of doing that?

No point really. But we can not control the extensions which want to
specify a maximum size smaller than a page size. They wouldn't know
what page size the underlying machine will have, especially with huge
pages which have a wide range of sizes. Even in the case of shared
buffers, a value of max_shared_buffers may cause buffer blocks to span
pages but other structures may fit a page.

In the attached patches, if a resizable structure is such that its
max_size is smaller than a page size, it is treated as a fixed
structure with size = max_size. Any request to resize such structures
will simply update the metadata without actual madvise operation. Only
the structures whose max_size > page_size would be treated as truly
resizable and will use madvise. You bring another interesting point.
If a resizable structure has a maximum size higher than the page size,
but it is allocated such that the initial part of it is on a partially
allocated page and the last part of it is on another partially
allocated page, those pages are never freed because of adjoining
structures. Per the logic in the attached patches, all the fixed (or
pseudo-resizable structures) are packed together. The resizable
structures start on a page boundary and their max_sizes are adjusted
to be page aligned. That way we can release pages when the structure
shrinks more than a page.

It was a mistake on my part to assume that more memory will be freed
if we page align the start and end of a resizable structure. I didn't
account for the memory wasted in alignment itself. That amount comes
out to be same as the amount of memory wasted if we don't page align
the structure. But the code is simpler if we don't page align the
structure as seen in the attached patches.

Using fallocate() (or madvise()) to free memory, we don't need
multiple segments. So much less code churn compared to the multiple
mappings approach. However, there is one drawback. In the multiple
mapping approach access beyond the current size of the structure would
result in segfault or bus error. But in the fallocate/madvise approach
such an access does not cause a crash. A write beyond the pages that
fit the current size of the structure causes more memory to be
allocated silently. A read returns 0s. So, there's a possibility that
bugs in size calculations might go unnoticed. I think that's how it
works even today, access in the yet un-allocated part of the shared
memory will simply go unnoticed.

If that's something you care about, you can mprotect(PROT_NONE) the relevant
regions.

I am fine, if we let go of this protection while getting rid of
multiple segments, if we all agree to do so.

I could be wrong, but mprotect needs to be executed in every backend
where the memory is mapped and then a new backend needs to inherit it
from the postmaster. Makes resizing complex since it has to touch
every backend. So avoiding mprotect is better.

I discussed this point with Andres offlist. Here's a summary of that
discussion. Any serious users of resizable shared memory structures
would need to send proc signal barriers to synchronize the resizing
across the backends. This barrier can be used to perform mprotect() in
the backends and a separate signal to Postmaster, if mprotect is
needed in Postmaster. But whether mprotect is needed depends upon the
usecase. It should be responsibility of the resizable structure user
and not of the ShmemResizeRegistered()

Following points need a bit of discussion.

1. calculation of allocated_size
For fixed sized shared memory structures, allocated_size is the size
of the structure after cache aligning it. Assuming that the shared
memory is allcoated in pages, this also is the actually memory
allocated to the structure when the whole structure is written to. For
resizable structure, it's a bit more complicated. We allocate and
reserve the maximum space required by the structure. At a given point
in time, the memory page where the next structure begins and the page
which contains the end of the structure at that point in time are
allocated. The pages in-between are not allocated. Thus the
allocated_size should be the length from the start the structure to
the end of the page containing the current end of the structure + part
of the page where the next structure starts upto the start of the next
structure. That is what is implemented in the attached patches.

2. GUCs shared_memory_size, shared_memory_size_in_huge_pages
These GUCs indicate the size of the shared memory in bytes and in huge
pages. Without resizable shared memory structures calculating these is
straight forward, we sum all the sizes of all the requested
structures. With resizable shared memory structures, these GUCs do not
make much sense. Since the memory allocated to the resizable
structures can be anywhere between 0 to maximum, neither the sum of
the their initial sizes nor the sum of their maximum sizes can be
reported as shared_memory_size. Similarly for
shared_memory_size_in_huge_pages. We need two GUCs to replace each of
the existing GUCs - max_shared_memory_size, initial_shared_memory_size
and their huge page peers. max_shared_memory_size is the sum of the
maximum sizes of resizable structures + the requested sizes of the
fixed structure. initial_shared_memory_size is the sum of the initial
sizes requested for all the structures.

3. Testing the memory allocation
I couldn't find a way to reliably know the shared memory allocated at
a given address in the process. RSS Shmem given the amount of shared
memory accessed by the process which includes memory allocated to the
fixed structures accessed by the process. This value isn't stable
across runs of the test in the patch. The test adds the RSS shmem
reported against the variations in the resizable shared memory
structure which can be visually inspected to be within limits. But
those limits are hard to test in the test code. Looking for some
suggestions here.

Disabling resizable structures in the builds which do not support
resizable structures is still a TODO.

--
Best Wishes,
Ashutosh Bapat

#10

heikki.linnakangas@enterprisedb.com

4 months ago

In reply to: Ashutosh Bapat (#9)

Re: Better shared data structure management and resizable shared data structures

I spent some time cleaning up the new registration machinery. I didn't
look at the "resizeable" part yet, but it should fit in nicely, as
Ashutosh demonstrated.

I added a lot of comments, changed the existing docs on how to allocate
shared memory in extensions to use the machinery, and tons of other
cleanups. I'm getting pretty happy with this, but there are a couple of
weak spots:

Firstly, I'm not sure what to do with ShmemRegisterHash() and the
'HASHCTL *infoP' argument to it. I feel it'd be nicer if the HASHCTL was
just part of the ShmemHashDesc struct, but I'm not sure if that fits all
the callers. I'll have to try that out I guess.

Secondly, I'm not 100% happy with the facilities we provide to
extensions. The lifecycle of _PG_init() and shmem_request/startup_hooks
is a little messy. The status quo is that a shared library gets control
in three different places:

1. _PG_init() gets called early at postmaster startup, if the library is
in shared_preload_libraries. If it's not in shared_preload_libraries, it
gets called whenever the module is loaded.

2. The library can install a shmem_request_hook, which gets called early
at postmaster startup, but after initializing the MaxBackends GUC. It
only gets called when the library is loaded via shared_preload_libraries.

3. The library can install a shmem_startup_hook. It gets called later at
postmaster startup, after the shared memory segment has been allocated.
In EXEC_BACKEND mode it also gets called at backend startup. It does not
get called if the library is not listed in shared_preload_libraries.

None of these is quite the right moment to call the new
ShmemRegisterStruct() function. _PG_init() is too early if the extension
needs MaxBackends for sizing the shared memory area. shmem_request_hook
is otherwise good, but in EXEC_BACKEND mode, the ShmemRegisterStruct()
function needs to also be called backend startup and shmem_request_hook
is not called at backend startup. shmem_startup_hook() is too late.

For now, I documented that an extension should call
ShmemRegisterStruct() from _PG_init(), but may adjust the size in the
shmem_request_hook, if needed.

Another wrinkle here is that you still need the shmem_request_hook, if
you want to call RequestNamedLWLockTranche(). It cannot be called from
_PG_init(). I'm not sure why that is.

So I think that requires a little more refactoring, I think an extension
should need to use shmem_request/startup_hook with the new APIs anymore.
It should provide more ergonomic callbacks or other mechanisms to
accomplish the same things.

- Heikki

#11

robertmhaas@gmail.com

3 months ago

In reply to: Heikki Linnakangas (#10)

Re: Better shared data structure management and resizable shared data structures

On Fri, Mar 6, 2026 at 9:13 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

1. _PG_init() gets called early at postmaster startup, if the library is
in shared_preload_libraries. If it's not in shared_preload_libraries, it
gets called whenever the module is loaded.

2. The library can install a shmem_request_hook, which gets called early
at postmaster startup, but after initializing the MaxBackends GUC. It
only gets called when the library is loaded via shared_preload_libraries.

3. The library can install a shmem_startup_hook. It gets called later at
postmaster startup, after the shared memory segment has been allocated.
In EXEC_BACKEND mode it also gets called at backend startup. It does not
get called if the library is not listed in shared_preload_libraries.

None of these is quite the right moment to call the new
ShmemRegisterStruct() function. _PG_init() is too early if the extension
needs MaxBackends for sizing the shared memory area. shmem_request_hook
is otherwise good, but in EXEC_BACKEND mode, the ShmemRegisterStruct()
function needs to also be called backend startup and shmem_request_hook
is not called at backend startup. shmem_startup_hook() is too late.

I believe that the design goal of
4f2400cb3f10aa79f99fba680c198237da28dd38 was to make it so that people
who had working extensions already didn't need to change their code,
but those for whom the restrictions of doing things in _PG_init were
annoying would have a workable alternative. I think that's a pretty
good goal, although I don't feel we absolutely have to stick to it. It
could easily be worth breaking that if we get something cool out of
it. But is there a reason we can't make it so that this new mechanism
can be used either from _PG_init() or shmem_startup_hook()? (I assume
there is or you likely would have done it already, but it's not clear
to me what that reason is.)

--
Robert Haas
EDB: http://www.enterprisedb.com

#12

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Robert Haas (#11)

Re: Better shared data structure management and resizable shared data structures

On 12/03/2026 19:03, Robert Haas wrote:

On Fri, Mar 6, 2026 at 9:13 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

1. _PG_init() gets called early at postmaster startup, if the library is
in shared_preload_libraries. If it's not in shared_preload_libraries, it
gets called whenever the module is loaded.

2. The library can install a shmem_request_hook, which gets called early
at postmaster startup, but after initializing the MaxBackends GUC. It
only gets called when the library is loaded via shared_preload_libraries.

3. The library can install a shmem_startup_hook. It gets called later at
postmaster startup, after the shared memory segment has been allocated.
In EXEC_BACKEND mode it also gets called at backend startup. It does not
get called if the library is not listed in shared_preload_libraries.

None of these is quite the right moment to call the new
ShmemRegisterStruct() function. _PG_init() is too early if the extension
needs MaxBackends for sizing the shared memory area. shmem_request_hook
is otherwise good, but in EXEC_BACKEND mode, the ShmemRegisterStruct()
function needs to also be called backend startup and shmem_request_hook
is not called at backend startup. shmem_startup_hook() is too late.

I believe that the design goal of
4f2400cb3f10aa79f99fba680c198237da28dd38 was to make it so that people
who had working extensions already didn't need to change their code,
but those for whom the restrictions of doing things in _PG_init were
annoying would have a workable alternative. I think that's a pretty
good goal, although I don't feel we absolutely have to stick to it. It
could easily be worth breaking that if we get something cool out of
it. But is there a reason we can't make it so that this new mechanism
can be used either from _PG_init() or shmem_startup_hook()? (I assume
there is or you likely would have done it already, but it's not clear
to me what that reason is.)

shmem_startup_hook() is too late. The shmem structs need to be
registered at postmaster startup before the shmem segment is allocated,
so that we can calculate the total size needed.

I'm currently leaning towards _PG_init(), except for allocations that
depend on MaxBackends. For those, you can install a shmem_request_hook
that sets the size in the descriptor. In other words, you can leave the
'size' as empty in _PG_init(), but set it later in the shmem_request_hook.

Another option is to add a new bespoken callback in the descriptor for
such size adjustments, which would get called at the same time as
shmem_request_hook. That might be a little more ergonomic, there would
no longer be any need for extensions to use the old
shmem_request/startup_hooks with the new ShmemRegisterStruct() mechanism.

Except that you'd still need them for RequestNamedLWLockTranche(). I
wonder if we should recommend extensions to embed the LWLock struct into
their shared memory struct and use the LWLockInitialize() and
LWLockNewTrancheId() functions instead. That fits the new
ShmemRegisterStruct() API a little better than RequestNamedLWLockTranche().

- Heikki

#13

robertmhaas@gmail.com

3 months ago

In reply to: Heikki Linnakangas (#12)

Re: Better shared data structure management and resizable shared data structures

On Thu, Mar 12, 2026 at 2:41 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

shmem_startup_hook() is too late. The shmem structs need to be
registered at postmaster startup before the shmem segment is allocated,
so that we can calculate the total size needed.

Sorry, I meant shmem_request_hook.

I'm currently leaning towards _PG_init(), except for allocations that
depend on MaxBackends. For those, you can install a shmem_request_hook
that sets the size in the descriptor. In other words, you can leave the
'size' as empty in _PG_init(), but set it later in the shmem_request_hook.

Why can't you just do the whole thing later?

Another option is to add a new bespoken callback in the descriptor for
such size adjustments, which would get called at the same time as
shmem_request_hook. That might be a little more ergonomic, there would
no longer be any need for extensions to use the old
shmem_request/startup_hooks with the new ShmemRegisterStruct() mechanism.

Yeah, worth considering.

Except that you'd still need them for RequestNamedLWLockTranche(). I
wonder if we should recommend extensions to embed the LWLock struct into
their shared memory struct and use the LWLockInitialize() and
LWLockNewTrancheId() functions instead. That fits the new
ShmemRegisterStruct() API a little better than RequestNamedLWLockTranche().

Yeah, I think RequestNamedLWLockTranche() might be fine if you just
need LWLocks, but if you need a bunch of resources, putting them all
into the same chunk of memory seems cleaner.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Robert Haas (#13)

Re: Better shared data structure management and resizable shared data structures

On 12/03/2026 20:56, Robert Haas wrote:

On Thu, Mar 12, 2026 at 2:41 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

shmem_startup_hook() is too late. The shmem structs need to be
registered at postmaster startup before the shmem segment is allocated,
so that we can calculate the total size needed.

Sorry, I meant shmem_request_hook.

Ah ok

I'm currently leaning towards _PG_init(), except for allocations that
depend on MaxBackends. For those, you can install a shmem_request_hook
that sets the size in the descriptor. In other words, you can leave the
'size' as empty in _PG_init(), but set it later in the shmem_request_hook.

Why can't you just do the whole thing later?

shmem_request_hook won't work in EXEC_BACKEND mode, because in
EXEC_BACKEND mode, ShmemRegisterStruct() also needs to be called at
backend startup.

One of my design goals is to avoid EXEC_BACKEND specific steps so that
if you write your extension oblivious to EXEC_BACKEND mode, it will
still usually work with EXEC_BACKEND. For example, if it was necessary
to call a separate AttachShmem() function for every shmem struct in
EXEC_BACKEND mode, but which was not needed on Unix, that would be bad.

Except that you'd still need them for RequestNamedLWLockTranche(). I
wonder if we should recommend extensions to embed the LWLock struct into
their shared memory struct and use the LWLockInitialize() and
LWLockNewTrancheId() functions instead. That fits the new
ShmemRegisterStruct() API a little better than RequestNamedLWLockTranche().

Yeah, I think RequestNamedLWLockTranche() might be fine if you just
need LWLocks, but if you need a bunch of resources, putting them all
into the same chunk of memory seems cleaner.

Agreed. Then again, how often do you need just a LWLock (or multiple
LWLocks)? Surely you have a struct you want to protect with the lock. I
guess having shmem hash table but no struct would be pretty common, though.

- Heikki

#15

robertmhaas@gmail.com

3 months ago

In reply to: Heikki Linnakangas (#14)

Re: Better shared data structure management and resizable shared data structures

On Thu, Mar 12, 2026 at 3:21 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm currently leaning towards _PG_init(), except for allocations that
depend on MaxBackends. For those, you can install a shmem_request_hook
that sets the size in the descriptor. In other words, you can leave the
'size' as empty in _PG_init(), but set it later in the shmem_request_hook.

Why can't you just do the whole thing later?

shmem_request_hook won't work in EXEC_BACKEND mode, because in
EXEC_BACKEND mode, ShmemRegisterStruct() also needs to be called at
backend startup.

One of my design goals is to avoid EXEC_BACKEND specific steps so that
if you write your extension oblivious to EXEC_BACKEND mode, it will
still usually work with EXEC_BACKEND. For example, if it was necessary
to call a separate AttachShmem() function for every shmem struct in
EXEC_BACKEND mode, but which was not needed on Unix, that would be bad.

That's *definitely* a good goal. A less important but still valuable
goal is to maximize the notational simplicity of the mechanism. Your
callback idea is elegant in theory but in practice it seems like it
might make it harder for people to get started quickly on a new
module, and having to create the object in one place and then fill in
the size in another sort of has the same problem. I don't really know
what to do about that, but it's something to think about. The
complexity of getting the details right is annoyingly high in this
area.

Yeah, I think RequestNamedLWLockTranche() might be fine if you just
need LWLocks, but if you need a bunch of resources, putting them all
into the same chunk of memory seems cleaner.

Agreed. Then again, how often do you need just a LWLock (or multiple
LWLocks)? Surely you have a struct you want to protect with the lock. I
guess having shmem hash table but no struct would be pretty common, though.

Yeah, we've developed an annoying number of different ways to do this
stuff. I don't entirely know how to fix that.

--
Robert Haas
EDB: http://www.enterprisedb.com

#16

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Robert Haas (#15)

Re: Better shared data structure management and resizable shared data structures

On 12/03/2026 22:05, Robert Haas wrote:

On Thu, Mar 12, 2026 at 3:21 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm currently leaning towards _PG_init(), except for allocations that
depend on MaxBackends. For those, you can install a shmem_request_hook
that sets the size in the descriptor. In other words, you can leave the
'size' as empty in _PG_init(), but set it later in the shmem_request_hook.

Why can't you just do the whole thing later?

shmem_request_hook won't work in EXEC_BACKEND mode, because in
EXEC_BACKEND mode, ShmemRegisterStruct() also needs to be called at
backend startup.

One of my design goals is to avoid EXEC_BACKEND specific steps so that
if you write your extension oblivious to EXEC_BACKEND mode, it will
still usually work with EXEC_BACKEND. For example, if it was necessary
to call a separate AttachShmem() function for every shmem struct in
EXEC_BACKEND mode, but which was not needed on Unix, that would be bad.

That's *definitely* a good goal. A less important but still valuable
goal is to maximize the notational simplicity of the mechanism. Your
callback idea is elegant in theory but in practice it seems like it
might make it harder for people to get started quickly on a new
module, and having to create the object in one place and then fill in
the size in another sort of has the same problem. I don't really know
what to do about that, but it's something to think about. The
complexity of getting the details right is annoyingly high in this
area.

Yeah. IMHO the existing shmem_request/startup_hook mechanism is pretty
awkward too, and in most cases, the new mechanism is more convenient. It
might be slightly less convenient for some things, but mostly it's
better. Would you agree with that, or do you actually like the old hooks
and ShmemInitStruct() better?

One such wrinkle with ShmemRegisterStruct() in the patch now is that
it's harder to do initialization that touches multiple structs or hash
tables. Currently the callbacks are called in the same order that the
structs are registered, so you can do all the initialization in the last
struct's callback. The single pair of shmem_request/startup_hooks per
module was more clear in that aspect. Fortunately, that kind of
cross-struct dependencies are pretty rare. So I think it's fine. (The
order that the callbacks are called needs be documented explicitly though).

If we want to improve on that, one idea would be to introduce a
ShmemRegisterCallbacks() function to register callbacks that are not
tied to any particular struct and are called after all the per-struct
callbacks.

Yeah, I think RequestNamedLWLockTranche() might be fine if you just
need LWLocks, but if you need a bunch of resources, putting them all
into the same chunk of memory seems cleaner.

Agreed. Then again, how often do you need just a LWLock (or multiple
LWLocks)? Surely you have a struct you want to protect with the lock. I
guess having shmem hash table but no struct would be pretty common, though.

Yeah, we've developed an annoying number of different ways to do this
stuff. I don't entirely know how to fix that.

Here's a new version that doubles down on the
LWLockNewTrancheId+LWLockInitialize method, by changing the example in
the docs, and contrib/pg_stat_statements, to use that method.
RequestNamedLWLockTranche() still works, there are no changes to it,
it's just not as convenient to use with ShmemRegisterStruct(). This has
the advantage that we don't introduce yet another way of allocating LWLocks.

P.S. Thanks for chiming in on this. It's pretty subjective how natural a
new API like this feels like, so opinions are very welcome.

- Heikki

#17

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Heikki Linnakangas (#10)

Re: Better shared data structure management and resizable shared data structures

On 06/03/2026 16:12, Heikki Linnakangas wrote:

Firstly, I'm not sure what to do with ShmemRegisterHash() and the
'HASHCTL *infoP' argument to it. I feel it'd be nicer if the HASHCTL was
just part of the ShmemHashDesc struct, but I'm not sure if that fits all
the callers. I'll have to try that out I guess.

I took a stab at that, and it turned out to be straightforward. I'm not
sure why I hesitated on that earlier.

Here's a new version with that change, and a ton of little comment
cleanups and such.

- Heikki

#18

Zsolt Parragi

zsolt.parragi@percona.com

3 months ago

In reply to: Heikki Linnakangas (#17)

Re: Better shared data structure management and resizable shared data structures

Hello

dsm_shmem_init(void)
{
size_t size = dsm_estimate_size();
- bool found;

if (size == 0)
return;

Isn't there an assignment missing from this function now? Size is
calculated but never used. With the current code if
min_dynamic_shared_memory > 0 the server can crash.

+static ShmemHashDesc WaitEventCustomHashByNameDesc =
+{
+ .name = "WaitEventCustom hash by name",
+ .ptr = &WaitEventCustomHashByName,
+
+ .init_size = WAIT_EVENT_CUSTOM_HASH_INIT_SIZE,
+ .max_size = WAIT_EVENT_CUSTOM_HASH_MAX_SIZE,
+ /* key is a NULL-terminated string */
+ .hash_info.keysize = sizeof(char[NAMEDATALEN]),
+ .hash_info.entrysize = sizeof(WaitEventCustomEntryByName),
+ .hash_flags = HASH_ELEM | HASH_BLOBS,
+};

This was HASH_STRINGS originally, and it is used with plain const
char* parameters. Shouldn't it use HASH_SRINGS as before?

size = add_size(size, ShmemRegisteredSize());

  size = add_size(size, dsm_estimate_size());
- size = add_size(size, DSMRegistryShmemSize());
+
+ size = add_size(size, ShmemRegisteredSize());

ShmemRegisteredSize is now called twice.

+    /* Initialize the lock */
+    tranche_id = LWLockNewTrancheId("my tranche name");
+    LWLockInitialize(&amp;MyShmem->lock);

Second parameter is missing for LWLockInitialize

#19

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Heikki Linnakangas (#16)

Re: Better shared data structure management and resizable shared data structures

On Fri, Mar 13, 2026 at 5:11 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 12/03/2026 22:05, Robert Haas wrote:

On Thu, Mar 12, 2026 at 3:21 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm currently leaning towards _PG_init(), except for allocations that
depend on MaxBackends. For those, you can install a shmem_request_hook
that sets the size in the descriptor. In other words, you can leave the
'size' as empty in _PG_init(), but set it later in the shmem_request_hook.

Why can't you just do the whole thing later?

shmem_request_hook won't work in EXEC_BACKEND mode, because in
EXEC_BACKEND mode, ShmemRegisterStruct() also needs to be called at
backend startup.

One of my design goals is to avoid EXEC_BACKEND specific steps so that
if you write your extension oblivious to EXEC_BACKEND mode, it will
still usually work with EXEC_BACKEND. For example, if it was necessary
to call a separate AttachShmem() function for every shmem struct in
EXEC_BACKEND mode, but which was not needed on Unix, that would be bad.

That's *definitely* a good goal. A less important but still valuable
goal is to maximize the notational simplicity of the mechanism. Your
callback idea is elegant in theory but in practice it seems like it
might make it harder for people to get started quickly on a new
module, and having to create the object in one place and then fill in
the size in another sort of has the same problem. I don't really know
what to do about that, but it's something to think about. The
complexity of getting the details right is annoyingly high in this
area.

Yeah. IMHO the existing shmem_request/startup_hook mechanism is pretty
awkward too, and in most cases, the new mechanism is more convenient. It
might be slightly less convenient for some things, but mostly it's
better. Would you agree with that, or do you actually like the old hooks
and ShmemInitStruct() better?

FWIW, I like the new way.

If your goal is to get rid of ShmemInitStruct, ShmemInitHash,
shmem_request/startup_hook, we could replace the hooks by another hook
which gets called after MaxBackends is initialized but before
CalculateShmemSize() gets called. All StructShmemRegister() can be
called in that hook. _PG_init() is used to register the hook as it's
done today. If we are going to deprecate the old hooks in a couple of
releases, we will need to maintain three hooks but given that we don't
have to wait for several releases for the extensions to adapt to the
new hooks, it should be only a temporary measure.

One such wrinkle with ShmemRegisterStruct() in the patch now is that
it's harder to do initialization that touches multiple structs or hash
tables. Currently the callbacks are called in the same order that the
structs are registered, so you can do all the initialization in the last
struct's callback. The single pair of shmem_request/startup_hooks per
module was more clear in that aspect. Fortunately, that kind of
cross-struct dependencies are pretty rare. So I think it's fine. (The
order that the callbacks are called needs be documented explicitly though).

If we want to improve on that, one idea would be to introduce a
ShmemRegisterCallbacks() function to register callbacks that are not
tied to any particular struct and are called after all the per-struct
callbacks.

I think as long as we stick to calling the init/attach functions in
the order of their registration (or any well defined suitable order),
the module can use perform initization/setup of individual structures
to which the callback is attached and also modify/use previously
registered structures or do the setup in the last registered
structure's init. I think this is more flexible that required; maybe
overwhelmingly flexible.

Yeah, I think RequestNamedLWLockTranche() might be fine if you just
need LWLocks, but if you need a bunch of resources, putting them all
into the same chunk of memory seems cleaner.

Agreed. Then again, how often do you need just a LWLock (or multiple
LWLocks)? Surely you have a struct you want to protect with the lock. I
guess having shmem hash table but no struct would be pretty common, though.

Yeah, we've developed an annoying number of different ways to do this
stuff. I don't entirely know how to fix that.

Here's a new version that doubles down on the
LWLockNewTrancheId+LWLockInitialize method, by changing the example in
the docs, and contrib/pg_stat_statements, to use that method.
RequestNamedLWLockTranche() still works, there are no changes to it,
it's just not as convenient to use with ShmemRegisterStruct(). This has
the advantage that we don't introduce yet another way of allocating LWLocks.

The new hook could be used to request LWLockTranche as well.

--
Best Wishes,
Ashutosh Bapat

#20

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Heikki Linnakangas (#17)

Re: Better shared data structure management and resizable shared data structures

On Sat, Mar 14, 2026 at 2:39 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/03/2026 16:12, Heikki Linnakangas wrote:

Firstly, I'm not sure what to do with ShmemRegisterHash() and the
'HASHCTL *infoP' argument to it. I feel it'd be nicer if the HASHCTL was
just part of the ShmemHashDesc struct, but I'm not sure if that fits all
the callers. I'll have to try that out I guess.

I took a stab at that, and it turned out to be straightforward. I'm not
sure why I hesitated on that earlier.

Yeah. I wondered about that too.

Here's a new version with that change, and a ton of little comment
cleanups and such.

Here are initial comments on these patches.

0001

@@ -3236,6 +3239,8 @@ PostmasterStateMachine(void)
LocalProcessControlFile(true);
/* re-create shared memory and semaphores */
+ ResetShmemAllocator();

This name is misleading. The function does not touch ShmemAllocator at
all. Instead it resets the ShmemStruct registry. I suggest
ResetShememRegistry() instead.

+ *
+ * There are two kinds of shared memory data structures: fixed-size structures
+ * and hash tables.

In future we will have resizable "fixed" structures and we may also
have resizable hash tables i.e. hash tables whose directory would be
resizable. The later would be help support resizable shared buffers
lookup table. It will be good to write the above sentence so that we
can just add more types of data structures without needing to rewrite
everything. If we could find a good term for "fixed-size structures"
which are really "structures that require contiguous memory", we will
be able to write the above sentence as "There are two kinds of shared
memory structures: contiguous structures and hash tables.". When we
add resizable structures, we can just add a sentence "A contiguous
structure may be fixed size or resizable". When we add resizable hash
tables, we can just replace that with "Both of these kinds can be
fixed-size or resizable". I am not sure whether "contiguous
structures" is a good term though (for one, word contiguous can be
confused with continuous). Whatever term we use should be something
that we can carry further in the remaining paragraphs.

+ Fixed-size structures contain things like global
+ * variables for a module and should never be allocated after the shared
+ * memory initialization phase.

I think the existing comment is not accurate. The term "global
variables" in the sentence can be confused with process global
variables. We should be using the term "shared variables" or better
"shared state". If we adopt "contiguous structures" as the term for
the first kind of data structure, we can write "Contiguous structures
contain shared state, maintained in a contiguous chunk of memory, for
a module. It should never be allocated after the shared memory
initialization phase.".

+ * postmaster calls ShmemInitRegistered(), which calls the init_fn callbacks
+ * of each registered area, in the order that they were registered.

... calls the init_fn, if any, of each registered area ....

- infoP->dsize = infoP->max_dsize = hash_select_dirsize(max_size);
- infoP->alloc = ShmemAllocNoError;
- hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
+ desc->hash_info.dsize = desc->hash_info.max_dsize =
hash_select_dirsize(desc->max_size);
+ desc->hash_info.alloc = ShmemAllocNoError;
+ desc->hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
/* look it up in the shmem index */

The next several lines of code look up shmem index. Should we remove
this comments or modify it to say "Register and initialize the hash
table".

+HTAB *
+ShmemInitHash(const char *name, /* table string name for shmem index */
+ int64 init_size, /* initial table size */
+ int64 max_size, /* max size of the table */
+ HASHCTL *infoP, /* info about key and bucket size */
+ int hash_flags) /* info about infoP */
+{
+ ShmemHashDesc *desc;
+

... snip ...

+
+ ShmemRegisterHash(desc);
+ return *desc->ptr;
+}

I like the way these functions are written using the new API. I think
we should keep these legacy interface at the end of section of shmem
APIs, rather than keeping those at the end of the file where we have
monitoring and arithmetic functions. If you want to get rid of the
legacy APIs in this release itself, I think it's ok to keep them at
the end of the file.

ShmemInitStruct() now calls ShmemRegisterStruct(). Earlier it could be
called from any backend, in any state to fetch a pointer to a shared
memory structure. It didn't add a new structure. Now it can add a new
structure. I am wondering whether that can cause registry in different
backends to get out of sync. Should we limit the window when it can be
called just like how shmem_request_hook call is limited. In that sense
ShmemRegisterStruct() looks something to be called from a
shmem_register_hook which is also called from EXEC_BACKEND. Sorry to
expand it here rather in my previous reply. In case we replace all the
current calls to ShmemInitStruct() with ShmemRegisterStruct(), we may
be able to get rid of the Shmem Index altogether; after all it's used
only for fetching the pointers to the shared memory areas in
EXEC_BACKEND mode. I thought that we could save the registry in the
shared memory. In EXEC_BACKEND mode, we go over the registry calling
attach_fn for each entry. But since the binary is overwritten in
EXEC_BACKEND case, attach/init fns are not guaranteed to have the same
address in all the backends. Maybe we have to resort to
launch_backend() to transfer the registry to the backend through the
file (to keep it in sycn in all the backends): a solution you may not
like.

+ void **ptr;
+} ShmemStructDesc;

I think the comments for each member should highlight which of these
fields are required (non-zero) and which can be optional (zero'ed
out).

+ */
+ ShmemStructDesc base_desc;

Once we have calculated the base_desc in ShmemRegisterHash() and
called ShmemRegisterStruct(), we don't need base_desc anymore. Even
the pointer to the allocated hash table memory is available through
*ptr. Probably we could just remove this member from here.
ShmemRegisterHash() can declare a variable of type ShmemStructDesc,
populate it based on the members in this structure and pass it to
ShmemRegisterStruct(). I am not comfortable with specification
structure being modified by the registration function.

0003
- pages = (size / FPM_PAGE_SIZE) - first_page;
- FreePageManagerPut(fpm, first_page, pages);
- }
+ ShmemRegisterStruct(&dsm_main_space_shmem_desc);

Shouldn't we be setting dsm_main_space_shmem_desc.size here to size
before calling ShmemRegisterStruct()?

@@ -102,15 +102,14 @@ CalculateShmemSize(void)
size = add_size(size, ShmemRegisteredSize());
size = add_size(size, dsm_estimate_size());

We have defined dsm_main_space_shmem_desc, but we still use
dsm_estimate_size() here and initialize the memory in
dsm_shmem_init(), which is explicitily called from
CreateOrAttachShmemStructs(). Why is that? Shouldn't we be registering
the structure in RegisterShmemStructs(), and let ShmemInitRegistered()
initialize it? Am I missing something here?

I will continue to review the patches further.

--
Best Wishes,
Ashutosh Bapat

#21

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#20)

#22

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Heikki Linnakangas (#21)

#23

robertmhaas@gmail.com

3 months ago

In reply to: Heikki Linnakangas (#16)

#24

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Robert Haas (#23)

#25

robertmhaas@gmail.com

3 months ago

In reply to: Heikki Linnakangas (#24)

#26

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Robert Haas (#25)

#27

robertmhaas@gmail.com

3 months ago

In reply to: Heikki Linnakangas (#26)

#28

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Robert Haas (#27)

#29

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Heikki Linnakangas (#28)

#30

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#29)

#31

robertmhaas@gmail.com

3 months ago

In reply to: Heikki Linnakangas (#28)

#32

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#29)

#33

Daniel Gustafsson

daniel@yesql.se

3 months ago

In reply to: Heikki Linnakangas (#28)

#34

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Robert Haas (#31)

#35

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#30)

#36

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#35)

#37

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#35)

#38

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Heikki Linnakangas (#37)

#39

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#35)

#40

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#38)

#41

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Heikki Linnakangas (#40)

#42

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#41)

#43

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Heikki Linnakangas (#42)

#44

boekewurm+postgres@gmail.com

3 months ago

In reply to: Heikki Linnakangas (#42)

#45

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Matthias van de Meent (#44)

#46

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#45)

#47

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#43)

#48

boekewurm+postgres@gmail.com

3 months ago

In reply to: Heikki Linnakangas (#46)

#49

boekewurm+postgres@gmail.com

3 months ago

In reply to: Heikki Linnakangas (#47)

#50

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Heikki Linnakangas (#47)

#51

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Matthias van de Meent (#48)

#52

boekewurm+postgres@gmail.com

3 months ago

In reply to: Heikki Linnakangas (#51)

#53

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Heikki Linnakangas (#51)

#54

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#53)

#55

boekewurm+postgres@gmail.com

3 months ago

In reply to: Ashutosh Bapat (#54)

#56

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Heikki Linnakangas (#34)

#57

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Matthias van de Meent (#55)

#58

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#53)

#59

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#57)

#60

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Matthias van de Meent (#52)

#61

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#50)

#62

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Matthias van de Meent (#52)

#63

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Daniel Gustafsson (#33)

#64

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#58)

#65

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#64)

#66

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Matthias van de Meent (#44)

#67

boekewurm+postgres@gmail.com

3 months ago

In reply to: Ashutosh Bapat (#57)

#68

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Matthias van de Meent (#49)

#69

boekewurm+postgres@gmail.com

3 months ago

In reply to: Heikki Linnakangas (#60)

#70

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Matthias van de Meent (#52)

#71

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Heikki Linnakangas (#70)

#72

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Heikki Linnakangas (#71)

#73

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#72)

#74

Dagfinn Ilmari Mannsåker

ilmari@ilmari.org

3 months ago

In reply to: Heikki Linnakangas (#47)

#75

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Dagfinn Ilmari Mannsåker (#74)

#76

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Heikki Linnakangas (#75)

#77

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#73)

#78

boekewurm+postgres@gmail.com

3 months ago

In reply to: Ashutosh Bapat (#77)

#79

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Ashutosh Bapat (#77)

#80

Andres Freund

andres@anarazel.de

3 months ago

In reply to: Heikki Linnakangas (#79)

#81

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Matthias van de Meent (#78)

#82

ashutosh.bapat@enterprisedb.com

3 months ago

In reply to: Andres Freund (#80)

#83

heikki.linnakangas@enterprisedb.com

2 months ago

In reply to: Ashutosh Bapat (#76)

#84

ashutosh.bapat@enterprisedb.com

2 months ago

In reply to: Heikki Linnakangas (#83)

#85

heikki.linnakangas@enterprisedb.com

12 days ago

In reply to: Ashutosh Bapat (#84)

#86

Haoyu Huang

haoyu.huang.68@gmail.com

12 days ago

In reply to: Heikki Linnakangas (#85)

#87