Changing shared_buffers without restart
TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
changing shared memory mapping layout. Any feedback is appreciated.
Hi,
Being able to change PostgreSQL configuration on the fly is an important
property for performance tuning, since it reduces the feedback time and
invasiveness of the process. In certain cases it even becomes highly desired,
e.g. when doing automatic tuning. But there are couple of important
configuration options that could not be modified without a restart, the most
notorious example is shared_buffers.
I've been working recently on an idea how to change that, allowing to modify
shared_buffers without a restart. To demonstrate the approach, I've prepared a
PoC that ignores lots of stuff, but works in a limited set of use cases I was
testing. I would like to discuss the idea and get some feedback.
Patches 1-3 prepare the infrastructure and shared memory layout. They could be
useful even with multithreaded PostgreSQL, when there will be no need for
shared memory. I assume, in the multithreaded world there still will be need
for a contiguous chunk of memory to share between threads, and its layout would
be similar to the one with shared memory mappings.
Patch 4 actually does resizing. It's shared memory specific of course, and
utilized Linux specific mremap, meaning open portability questions.
Patch 5 is somewhat independent, but quite convenient to have. It also utilizes
Linux specific call memfd_create.
The patch set still doesn't address lots of things, e.g. shared memory segment
detach/reattach, portability questions, it doesn't touch EXEC_BACKEND code and
huge pages.
So far I was doing some rudimentary testing: spinning up PostgreSQL, then
increasing shared_buffers and running pgbench with the scale factor large
enough to extend the data set into newly allocated buffers:
-- shared_buffers 128 MB
=# SELECT * FROM pg_buffercache_summary();
buffers_used | buffers_unused | buffers_dirty | buffers_pinned
--------------+----------------+---------------+----------------
134 | 16250 | 1 | 0
-- change shared_buffers to 512 MB
=# select pg_reload_conf();
=# SELECT * FROM pg_buffercache_summary();
buffers_used | buffers_unused | buffers_dirty | buffers_pinned
--------------+----------------+---------------+---------------
221 | 65315 | 1 | 0
-- round of pgbench read-only load
=# SELECT * FROM pg_buffercache_summary();
buffers_used | buffers_unused | buffers_dirty | buffers_pinned
--------------+----------------+---------------+---------------
41757 | 23779 | 216 | 0
Here is the breakdown:
v1-0001-Allow-to-use-multiple-shared-memory-mappings.patch
Preparation, introduces the possibility to work with many shmem mappings. To
make it less invasive, I've duplicated the shmem API to extend it with the
shmem_slot argument, while redirecting the original API to it. There are
probably better ways of doing that, I'm open for suggestions.
v1-0002-Allow-placing-shared-memory-mapping-with-an-offse.patch
Implements a new layout of shared memory mappings to include room for resizing.
I've done a couple of tests to verify that such space in between doesn't affect
how the kernel calculates actual used memory, to make sure that e.g. cgroup
will not trigger OOM. The only change seems to be in VmPeak, which is total
mapped pages.
v1-0003-Introduce-multiple-shmem-slots-for-shared-buffers.patch
Splits shared_buffers into multiple slots, moving out structures that depend on
NBuffers into separate mappings. There are two large gaps here:
* Shmem size calculation for those mappings is not correct yet, it includes too
many other things (no particular issues here, just haven't had time).
* It makes hardcoded assumptions about what is the upper limit for resizing,
which is currently low purely for experiments. Ideally there should be a new
configuration option to specify the total available memory, which would be a
base for subsequent calculations.
v1-0004-Allow-to-resize-shared-memory-without-restart.patch
Do shared_buffers change without a restart. Current approach is clumsy, it adds
an assign hook for shared_buffers and goes from there using mremap to resize
mappings. But I haven't immediately found any better approach. Currently it
supports only an increase of shared_buffers.
v1-0005-Use-anonymous-files-to-back-shared-memory-segment.patch
Allows an anonyous file to back a shared mapping. This makes certain things
easier, e.g. mappings visual representation, and gives an fd for possible
future customizations.
In this thread I'm hoping to answer following questions:
* Are there any concerns about this approach?
* What would be a better mechanism to handle resizing than an assign hook?
* Assuming I'll be able to address already known missing bits, what are the
chances the patch series could be accepted?
Attachments:
v1-0001-Allow-to-use-multiple-shared-memory-mappings.patchtext/plain; charset=us-asciiDownload+258-125
v1-0002-Allow-placing-shared-memory-mapping-with-an-offse.patchtext/plain; charset=us-asciiDownload+119-2
v1-0003-Introduce-multiple-shmem-slots-for-shared-buffers.patchtext/plain; charset=us-asciiDownload+97-36
v1-0004-Allow-to-resize-shared-memory-without-restart.patchtext/plain; charset=us-asciiDownload+171-12
v1-0005-Use-anonymous-files-to-back-shared-memory-segment.patchtext/plain; charset=us-asciiDownload+40-10
On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
changing shared memory mapping layout. Any feedback is appreciated.
It was pointed out to me, that earlier this year there was a useful
discussion about similar matters "PGC_SIGHUP shared_buffers?" [1]/messages/by-id/CA+TgmoaGCFPhMjz7veJOeef30=KdpOxgywcLwNbr-Gny-mXwcg@mail.gmail.com. From
what I see the patch series falls into the "re-map" category in that
thread.
[1]: /messages/by-id/CA+TgmoaGCFPhMjz7veJOeef30=KdpOxgywcLwNbr-Gny-mXwcg@mail.gmail.com
Hi
I tried to apply patches, but failed. I suppose the problem with CRLF in the end of lines in the patch files. At least, after manual change of v1-0001 and v1-0002 from CRLF to LF patches applied, but it was not helped for v1-0003 - v1.0005 - they have also other mistakes during patch process. Could you check patch files and place them in correct format?
The new status of this patch is: Waiting on Author
On Sat, Oct 19, 2024 at 8:21 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Currently it
supports only an increase of shared_buffers.
Just BTW in case it is interesting, Palak and I experimented with how
to shrink the buffer pool while PostgreSQL is running, while we were
talking about 13453ee (which it shares infrastructure with). This
version fails if something is pinned and in the way of the shrink
operation, but you could imagine other policies (wait, cancel it,
...):
https://github.com/macdice/postgres/commit/db26fe0c98476cdbbd1bcf553f3b7864cb142247
On Thu, Nov 07, 2024 at 02:05:52PM GMT, Thomas Munro wrote:
On Sat, Oct 19, 2024 at 8:21 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:Currently it
supports only an increase of shared_buffers.Just BTW in case it is interesting, Palak and I experimented with how
to shrink the buffer pool while PostgreSQL is running, while we were
talking about 13453ee (which it shares infrastructure with). This
version fails if something is pinned and in the way of the shrink
operation, but you could imagine other policies (wait, cancel it,
...):https://github.com/macdice/postgres/commit/db26fe0c98476cdbbd1bcf553f3b7864cb142247
Thanks, looks interesting. I'll try to experiment with that in the next
version of the patch.
On Wed, Nov 06, 2024 at 07:10:06PM GMT, Vladlen Popolitov wrote:
HiI tried to apply patches, but failed. I suppose the problem with CRLF in the end of lines in the patch files. At least, after manual change of v1-0001 and v1-0002 from CRLF to LF patches applied, but it was not helped for v1-0003 - v1.0005 - they have also other mistakes during patch process. Could you check patch files and place them in correct format?
The new status of this patch is: Waiting on Author
Well, I'm going to rebase the patch if that's what you mean. But just
FYI -- it could be applied without any issues to the base commit
mentioned in the series.
On 18.10.24 21:21, Dmitry Dolgov wrote:
v1-0001-Allow-to-use-multiple-shared-memory-mappings.patch
Preparation, introduces the possibility to work with many shmem mappings. To
make it less invasive, I've duplicated the shmem API to extend it with the
shmem_slot argument, while redirecting the original API to it. There are
probably better ways of doing that, I'm open for suggestions.
After studying this a bit, I tend to think you should just change the
existing APIs in place. So for example,
void *ShmemAlloc(Size size);
becomes
void *ShmemAlloc(int shmem_slot, Size size);
There aren't that many callers, and all these duplicated interfaces
almost add more new code than they save.
It might be worth making exceptions for interfaces that are likely to be
used by extensions. For example, I see pg_stat_statements using
ShmemInitStruct() and ShmemInitHash(). But that seems to be it. Are
there any other examples out there? Maybe there are many more that I
don't see right now. But at least for the initialization functions, it
doesn't seem worth it to preserve the existing interfaces exactly.
In any case, I think the slot number should be the first argument. This
matches how MemoryContextAlloc() or also talloc() work.
(Now here is an idea: Could these just be memory contexts? Instead of
making six shared memory slots, could you make six memory contexts with
a special shared memory type. And ShmemAlloc becomes the allocation
function, etc.?)
I noticed the existing code made inconsistent use of PGShmemHeader * vs.
void *, which also bled into your patch. I made the attached little
patch to clean that up a bit.
I suggest splitting the struct ShmemSegment into one struct for the
three memory addresses and a separate array just for the slock_t's. The
former struct can then stay private in storage/ipc/shmem.c, only the
locks need to be exported.
Maybe rename ANON_MAPPINGS to something like NUM_ANON_MAPPINGS.
Also, maybe some of this should be declared in storage/shmem.h rather
than in storage/pg_shmem.h. We have the existing ShmemLock in there, so
it would be a bit confusing to have the per-segment locks elsewhere.
v1-0003-Introduce-multiple-shmem-slots-for-shared-buffers.patch
Splits shared_buffers into multiple slots, moving out structures that depend on
NBuffers into separate mappings. There are two large gaps here:* Shmem size calculation for those mappings is not correct yet, it includes too
many other things (no particular issues here, just haven't had time).
* It makes hardcoded assumptions about what is the upper limit for resizing,
which is currently low purely for experiments. Ideally there should be a new
configuration option to specify the total available memory, which would be a
base for subsequent calculations.
Yes, I imagine a shared_buffers_hard_limit setting. We could maybe
default that to the total available memory, but it would also be good to
be able to specify it directly, for testing.
v1-0005-Use-anonymous-files-to-back-shared-memory-segment.patch
Allows an anonyous file to back a shared mapping. This makes certain things
easier, e.g. mappings visual representation, and gives an fd for possible
future customizations.
I think this could be a useful patch just by itself, without the rest of
the series, because of
* By default, Linux will not add file-backed shared mappings into a
core dump, making it more convenient to work with them in PostgreSQL:
no more huge dumps to process.
This could be significant operational benefit.
When you say "by default", is this adjustable? Does someone actually
want the whole shared memory in their core file? (If it's adjustable,
is it also adjustable for anonymous mappings?)
I'm wondering about this change:
-#define PG_MMAP_FLAGS
(MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+#define PG_MMAP_FLAGS (MAP_SHARED|MAP_HASSEMAPHORE)
It looks like this would affect all mmap() calls, not only the one
you're changing. But that's the only one that uses this macro! I don't
understand why we need this; I don't see anything in the commit log
about this ever being used for any portability. I think we should just
get rid of it and have mmap() use the right flags directly.
I see that FreeBSD has a memfd_create() function. Might be worth a try.
Obviously, this whole thing needs a configure test for memfd_create()
anyway.
I see that memfd_create() has a MFD_HUGETLB flag. It's not very clear
how that interacts with the MAP_HUGETLB flag for mmap(). Do you need to
specify both of them if you want huge pages?
Attachments:
0001-More-thorough-use-of-PGShmemHeader-instead-of-void.patch.nocfbottext/plain; charset=UTF-8; name=0001-More-thorough-use-of-PGShmemHeader-instead-of-void.patch.nocfbotDownload+12-17
On Tue, Nov 19, 2024 at 01:57:00PM GMT, Peter Eisentraut wrote:
On 18.10.24 21:21, Dmitry Dolgov wrote:v1-0001-Allow-to-use-multiple-shared-memory-mappings.patch
Preparation, introduces the possibility to work with many shmem mappings. To
make it less invasive, I've duplicated the shmem API to extend it with the
shmem_slot argument, while redirecting the original API to it. There are
probably better ways of doing that, I'm open for suggestions.After studying this a bit, I tend to think you should just change the
existing APIs in place. So for example,void *ShmemAlloc(Size size);
becomes
void *ShmemAlloc(int shmem_slot, Size size);
There aren't that many callers, and all these duplicated interfaces almost
add more new code than they save.It might be worth making exceptions for interfaces that are likely to be
used by extensions. For example, I see pg_stat_statements using
ShmemInitStruct() and ShmemInitHash(). But that seems to be it. Are there
any other examples out there? Maybe there are many more that I don't see
right now. But at least for the initialization functions, it doesn't seem
worth it to preserve the existing interfaces exactly.In any case, I think the slot number should be the first argument. This
matches how MemoryContextAlloc() or also talloc() work.
Yeah, agree. I'll reshape this part, thanks.
(Now here is an idea: Could these just be memory contexts? Instead of
making six shared memory slots, could you make six memory contexts with a
special shared memory type. And ShmemAlloc becomes the allocation function,
etc.?)
Sound interesting. I don't know how good the memory context interface
would fit here, but I'll do some investigation.
I noticed the existing code made inconsistent use of PGShmemHeader * vs.
void *, which also bled into your patch. I made the attached little patch
to clean that up a bit.
Right, it was bothering me the whole time, but not strong enough to make
me fix this in the PoC just yet.
I suggest splitting the struct ShmemSegment into one struct for the three
memory addresses and a separate array just for the slock_t's. The former
struct can then stay private in storage/ipc/shmem.c, only the locks need to
be exported.Maybe rename ANON_MAPPINGS to something like NUM_ANON_MAPPINGS.
Also, maybe some of this should be declared in storage/shmem.h rather than
in storage/pg_shmem.h. We have the existing ShmemLock in there, so it would
be a bit confusing to have the per-segment locks elsewhere.[...]
I'm wondering about this change:
-#define PG_MMAP_FLAGS (MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE) +#define PG_MMAP_FLAGS (MAP_SHARED|MAP_HASSEMAPHORE)It looks like this would affect all mmap() calls, not only the one you're
changing. But that's the only one that uses this macro! I don't understand
why we need this; I don't see anything in the commit log about this ever
being used for any portability. I think we should just get rid of it and
have mmap() use the right flags directly.I see that FreeBSD has a memfd_create() function. Might be worth a try.
Obviously, this whole thing needs a configure test for memfd_create()
anyway.
Yep, those points make sense to me.
v1-0005-Use-anonymous-files-to-back-shared-memory-segment.patch
Allows an anonyous file to back a shared mapping. This makes certain things
easier, e.g. mappings visual representation, and gives an fd for possible
future customizations.I think this could be a useful patch just by itself, without the rest of the
series, because of* By default, Linux will not add file-backed shared mappings into a
core dump, making it more convenient to work with them in PostgreSQL:
no more huge dumps to process.This could be significant operational benefit.
When you say "by default", is this adjustable? Does someone actually want
the whole shared memory in their core file? (If it's adjustable, is it also
adjustable for anonymous mappings?)
Yes, there is /proc/<pid>/coredump_filter [1]https://www.kernel.org/doc/html/latest/filesystems/proc.html#proc-pid-coredump-filter-core-dump-filtering-settings, that allows to specify
what to include. One can ask to exclude anon, file-backed and hugetlb
shared memory, with the only caveat that it's per process. I guess
normally no one wants to have a full shared memory in the coredump, but
there could be exceptions.
I see that memfd_create() has a MFD_HUGETLB flag. It's not very clear how
that interacts with the MAP_HUGETLB flag for mmap(). Do you need to specify
both of them if you want huge pages?
Correct, both (one flag in memfd_create and one for mmap) are needed to
use huge pages.
On 19.11.24 14:29, Dmitry Dolgov wrote:
I see that memfd_create() has a MFD_HUGETLB flag. It's not very clear how
that interacts with the MAP_HUGETLB flag for mmap(). Do you need to specify
both of them if you want huge pages?Correct, both (one flag in memfd_create and one for mmap) are needed to
use huge pages.
I was worried because the FreeBSD man page says
MFD_HUGETLB This flag is currently unsupported.
It looks like FreeBSD doesn't have MAP_HUGETLB, so maybe this is irrelevant.
But you should make sure in your patch that the right set of flags for
huge pages is passed.
On Fri, Oct 18, 2024 at 3:21 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
changing shared memory mapping layout. Any feedback is appreciated.
A lot of people would like to have this feature, so I hope this
proposal works out. Thanks for working on it.
I think the idea of having multiple shared memory segments is
interesting and makes sense, but I would prefer to see them called
"segments" rather than "slots" just as do we do for DSMs. The name
"slot" is somewhat overused, and invites confusion with replication
slots, inter alia. I think it's possible that having multiple fixed
shared memory segments will spell trouble on Windows, where we already
need to use a retry loop to try to get the main shared memory segment
mapped at the correct address. If there are multiple segments and we
need whatever ASLR stuff happens on Windows to not place anything else
overlapping with any of them, that means there's more chances for
stuff to fail than if we just need one address range to be free.
Granted, the individual ranges are smaller, so maybe it's fine? But I
don't know.
The big thing that worries me is synchronization, and while I've only
looked at the patch set briefly, it doesn't look to me as though
there's enough machinery here to make that work correctly. Suppose
that shared_buffers=8GB (a million buffers) and I change it to
shared_buffers=16GB (2 million buffers). As soon as any one backend
has seen that changed and expanded shared_buffers, there's a
possibility that some other backend which has not yet seen the change
might see a buffer number greater than a million. If it tries to use
that buffer number before it absorbs the change, something bad will
happen. The most obvious way for it to see such a buffer number - and
possibly the only one - is to do a lookup in the buffer mapping table
and find a buffer ID there that was inserted by some other backend
that has already seen the change.
Fixing this seems tricky. My understanding is that BufferGetBlock() is
extremely performance-critical, so having to do a bounds check there
to make sure that a given buffer number is in range would probably be
bad for performance. Also, even if the overhead weren't prohibitive, I
don't think we can safely stick code that unmaps and remaps shared
memory segments into a function that currently just does math, because
we've probably got places where we assume this operation can't fail --
as well as places where we assume that if we call BufferGetBlock(i)
and then BufferGetBlock(j), the second call won't change the answer to
the first.
It seems to me that it's probably only safe to swap out a backend's
notion of where shared_buffers is located when the backend holds on
buffer pins, and maybe not even all such places, because it would be a
problem if a backend looks up the address of a buffer before actually
pinning it, on the assumption that the answer can't change. I don't
know if that ever happens, but it would be a legal coding pattern
today. Doing it between statements seems safe as long as there are no
cursors holding pins. Doing it in the middle of a statement is
probably possible if we can verify that we're at a "safe" point in the
code, but I'm not sure exactly which points are safe. If we have no
code anywhere that assumes the address of an unpinned buffer can't
change before we pin it, then I guess the check for pins is the only
thing we need, but I don't know that to be the case.
I guess I would have imagined that a change like this would have to be
done in phases. In phase 1, we'd tell all of the backends that
shared_buffers had expanded to some new, larger value; but the new
buffers wouldn't be usable for anything yet. Then, once we confirmed
that everyone had the memo, we'd tell all the backends that those
buffers are now available for use. If shared_buffers were contracted,
phase 1 would tell all of the backends that shared_buffers had
contracted to some new, smaller value. Once a particular backend
learns about that, they will refuse to put any new pages into those
high-numbered buffers, but the existing contents would still be valid.
Once everyone has been told about this, we can go through and evict
all of those buffers, and then let everyone know that's done. Then
they shrink their mappings.
It looks to me like the patch doesn't expand the buffer mapping table,
which seems essential. But maybe I missed that.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 19.11.24 14:29, Dmitry Dolgov wrote:
I noticed the existing code made inconsistent use of PGShmemHeader * vs.
void *, which also bled into your patch. I made the attached little patch
to clean that up a bit.Right, it was bothering me the whole time, but not strong enough to make
me fix this in the PoC just yet.
I committed a bit of this, so check that when you're rebasing your patch
set.
On Mon, Nov 25, 2024 at 02:33:48PM GMT, Robert Haas wrote:
I think the idea of having multiple shared memory segments is
interesting and makes sense, but I would prefer to see them called
"segments" rather than "slots" just as do we do for DSMs. The name
"slot" is somewhat overused, and invites confusion with replication
slots, inter alia. I think it's possible that having multiple fixed
shared memory segments will spell trouble on Windows, where we already
need to use a retry loop to try to get the main shared memory segment
mapped at the correct address. If there are multiple segments and we
need whatever ASLR stuff happens on Windows to not place anything else
overlapping with any of them, that means there's more chances for
stuff to fail than if we just need one address range to be free.
Granted, the individual ranges are smaller, so maybe it's fine? But I
don't know.
I haven't had a chance to experiment with that on Windows, but I'm
hoping that in the worst case fallback to a single mapping via proposed
infrastructure (and the consequent limitations) would be acceptable.
The big thing that worries me is synchronization, and while I've only
looked at the patch set briefly, it doesn't look to me as though
there's enough machinery here to make that work correctly. Suppose
that shared_buffers=8GB (a million buffers) and I change it to
shared_buffers=16GB (2 million buffers). As soon as any one backend
has seen that changed and expanded shared_buffers, there's a
possibility that some other backend which has not yet seen the change
might see a buffer number greater than a million. If it tries to use
that buffer number before it absorbs the change, something bad will
happen. The most obvious way for it to see such a buffer number - and
possibly the only one - is to do a lookup in the buffer mapping table
and find a buffer ID there that was inserted by some other backend
that has already seen the change.
Right, I haven't put much efforts into synchronization yet. It's in my
bucket list for the next iteration of the patch.
code, but I'm not sure exactly which points are safe. If we have no
code anywhere that assumes the address of an unpinned buffer can't
change before we pin it, then I guess the check for pins is the only
thing we need, but I don't know that to be the case.
Probably I'm missing something here. What scenario do you have in mind,
when the address of a buffer is changing?
I guess I would have imagined that a change like this would have to be
done in phases. In phase 1, we'd tell all of the backends that
shared_buffers had expanded to some new, larger value; but the new
buffers wouldn't be usable for anything yet. Then, once we confirmed
that everyone had the memo, we'd tell all the backends that those
buffers are now available for use. If shared_buffers were contracted,
phase 1 would tell all of the backends that shared_buffers had
contracted to some new, smaller value. Once a particular backend
learns about that, they will refuse to put any new pages into those
high-numbered buffers, but the existing contents would still be valid.
Once everyone has been told about this, we can go through and evict
all of those buffers, and then let everyone know that's done. Then
they shrink their mappings.
Yep, sounds good. I was pondering about more crude approach, but doing
this in phases seems to be a way to go.
It looks to me like the patch doesn't expand the buffer mapping table,
which seems essential. But maybe I missed that.
Do you mean the "Shared Buffer Lookup Table"? It does expand it, but
under somewhat unfitting name STRATEGY_SHMEM_SLOT. But now that I look
at the code, I see a few issues around that -- so I would have to
improve it anyway, thanks for pointing that out.
On Tue, Nov 26, 2024 at 2:18 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
I haven't had a chance to experiment with that on Windows, but I'm
hoping that in the worst case fallback to a single mapping via proposed
infrastructure (and the consequent limitations) would be acceptable.
Yeah, if you can still fall back to a single mapping, I think that's
OK. It would be nicer if it could work on every platform in the same
way, but half a loaf is better than none.
code, but I'm not sure exactly which points are safe. If we have no
code anywhere that assumes the address of an unpinned buffer can't
change before we pin it, then I guess the check for pins is the only
thing we need, but I don't know that to be the case.Probably I'm missing something here. What scenario do you have in mind,
when the address of a buffer is changing?
I was assuming that if you expand the mapping for shared_buffers, you
can't count on the new mapping being at the same address as the old
mapping. If you can, that makes things simpler, but what if the OS has
mapped something else just afterward, in the address space that you're
hoping to use when you expand the mapping?
It looks to me like the patch doesn't expand the buffer mapping table,
which seems essential. But maybe I missed that.Do you mean the "Shared Buffer Lookup Table"? It does expand it, but
under somewhat unfitting name STRATEGY_SHMEM_SLOT. But now that I look
at the code, I see a few issues around that -- so I would have to
improve it anyway, thanks for pointing that out.
Yeah, we -- or at least I -- usually call that the buffer mapping
table. There are identifiers like BufMappingPartitionLock, for
example. I'm slightly surprised that the ShmemInitHash() call uses
something else as the identifier, but I guess that's how it is.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, Nov 27, 2024 at 10:20:27AM GMT, Robert Haas wrote:
code, but I'm not sure exactly which points are safe. If we have no
code anywhere that assumes the address of an unpinned buffer can't
change before we pin it, then I guess the check for pins is the only
thing we need, but I don't know that to be the case.Probably I'm missing something here. What scenario do you have in mind,
when the address of a buffer is changing?I was assuming that if you expand the mapping for shared_buffers, you
can't count on the new mapping being at the same address as the old
mapping. If you can, that makes things simpler, but what if the OS has
mapped something else just afterward, in the address space that you're
hoping to use when you expand the mapping?
Yes, that's the whole point of the exercise with remap -- to keep
addresses unchanged, making buffer management simpler and allowing
resize mappings quicker. The trade off is that we would need to take
care of shared mapping placing.
My understanding is that clashing of mappings (either at creation time
or when resizing) could happen only withing the process address space,
and the assumption is that by the time we prepare the mapping layout all
the rest of mappings for the process are already done. But I agree, it's
an interesting question -- I'm going to investigate if those assumptions
could be wrong under certain conditions. Currently if something else is
mapped at the same address where we want to expand the mapping, we will
get an error and can decide how to proceed (e.g. if it happens at
creation time, proceed with a single mapping, otherwise ignore mapping
resize).
On Wed, Nov 27, 2024 at 3:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
My understanding is that clashing of mappings (either at creation time
or when resizing) could happen only withing the process address space,
and the assumption is that by the time we prepare the mapping layout all
the rest of mappings for the process are already done.
I don't think that's correct at all. First, the user could type LOAD
'whatever' at any time. But second, even if they don't or you prohibit
them from doing so, the process could allocate memory for any of a
million different things, and that could require mapping a new region
of memory, and the OS could choose to place that just after an
existing mapping, or at least close enough that we can't expand the
object size as much as desired.
If we had an upper bound on the size of shared_buffers and could
reserve that amount of address space at startup time but only actually
map a portion of it, then we could later remap and expand into the
reserved space. Without that, I think there's absolutely no guarantee
that the amount of address space that we need is available when we
want to extend a mapping.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, 27 Nov 2024 at 22:06, Robert Haas <robertmhaas@gmail.com> wrote:
If we had an upper bound on the size of shared_buffers
I think a fairly reliable upper bound is the amount of physical memory
on the system at time of postmaster start. We could make it a GUC to
set the upper bound for the rare cases where people do stuff like
adding swap space later or doing online VM growth. We could even have
the default be something like 4x the physical memory to accommodate
those people by default.
reserve that amount of address space at startup time but only actually
map a portion of it
Or is this the difficult part?
Hi,
On 2024-11-27 16:05:47 -0500, Robert Haas wrote:
On Wed, Nov 27, 2024 at 3:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
My understanding is that clashing of mappings (either at creation time
or when resizing) could happen only withing the process address space,
and the assumption is that by the time we prepare the mapping layout all
the rest of mappings for the process are already done.I don't think that's correct at all. First, the user could type LOAD
'whatever' at any time. But second, even if they don't or you prohibit
them from doing so, the process could allocate memory for any of a
million different things, and that could require mapping a new region
of memory, and the OS could choose to place that just after an
existing mapping, or at least close enough that we can't expand the
object size as much as desired.If we had an upper bound on the size of shared_buffers and could
reserve that amount of address space at startup time but only actually
map a portion of it, then we could later remap and expand into the
reserved space. Without that, I think there's absolutely no guarantee
that the amount of address space that we need is available when we
want to extend a mapping.
Strictly speaking we don't actually need to map shared buffers to the same
location in each process... We do need that for most other uses of shared
memory, including the buffer mapping table, but not for the buffer data
itself.
Whether it's worth the complexity of dealing with differing locations is
another matter.
Greetings,
Andres Freund
On Wed, Nov 27, 2024 at 4:28 PM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
On Wed, 27 Nov 2024 at 22:06, Robert Haas <robertmhaas@gmail.com> wrote:
If we had an upper bound on the size of shared_buffers
I think a fairly reliable upper bound is the amount of physical memory
on the system at time of postmaster start. We could make it a GUC to
set the upper bound for the rare cases where people do stuff like
adding swap space later or doing online VM growth. We could even have
the default be something like 4x the physical memory to accommodate
those people by default.
Yes, Peter mentioned similar ideas on this thread last week.
reserve that amount of address space at startup time but only actually
map a portion of itOr is this the difficult part?
I'm not sure how difficult this is, although I'm pretty sure that it's
more difficult than adding a GUC. My point wasn't so much whether this
is easy or hard but rather that it's essential if you want to avoid
having addresses change when the resizing happens.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, Nov 27, 2024 at 4:41 PM Andres Freund <andres@anarazel.de> wrote:
Strictly speaking we don't actually need to map shared buffers to the same
location in each process... We do need that for most other uses of shared
memory, including the buffer mapping table, but not for the buffer data
itself.
Well, if it can move, then you have to make sure it doesn't move while
someone's holding onto a pointer into it. I'm not exactly sure how
hard it is to guarantee that, but we certainly do construct pointers
into shared_buffers and use them at least for short periods of time,
so it's not a purely academic concern.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, Nov 27, 2024 at 04:05:47PM GMT, Robert Haas wrote:
On Wed, Nov 27, 2024 at 3:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:My understanding is that clashing of mappings (either at creation time
or when resizing) could happen only withing the process address space,
and the assumption is that by the time we prepare the mapping layout all
the rest of mappings for the process are already done.I don't think that's correct at all. First, the user could type LOAD
'whatever' at any time. But second, even if they don't or you prohibit
them from doing so, the process could allocate memory for any of a
million different things, and that could require mapping a new region
of memory, and the OS could choose to place that just after an
existing mapping, or at least close enough that we can't expand the
object size as much as desired.If we had an upper bound on the size of shared_buffers and could
reserve that amount of address space at startup time but only actually
map a portion of it, then we could later remap and expand into the
reserved space. Without that, I think there's absolutely no guarantee
that the amount of address space that we need is available when we
want to extend a mapping.
Just done a couple of experiments, and I think this could be addressed by
careful placing of mappings as well, based on two assumptions: for a new
mapping the kernel always picks up a lowest address that allows enough space,
and the maximum amount of allocable memory for other mappings could be derived
from total available memory. With that in mind the shared mapping layout will
have to have a large gap at the start, between the lowest address and the
shared mappings used for buffers and rest -- the gap where all the other
mapping (allocations, libraries, madvise, etc) will land. It's similar to
address space reserving you mentioned above, will reduce possibility of
clashing significantly, and looks something like this:
01339000-0139e000 [heap]
0139e000-014aa000 [heap]
7f2dd72f6000-7f2dfbc9c000 /memfd:strategy (deleted)
7f2e0209c000-7f2e269b0000 /memfd:checkpoint (deleted)
7f2e2cdb0000-7f2e516b4000 /memfd:iocv (deleted)
7f2e57ab4000-7f2e7c478000 /memfd:descriptors (deleted)
7f2ebc478000-7f2ee8d3c000 /memfd:buffers (deleted)
^ note the distance between two mappings,
which is intended for resize
7f3168d3c000-7f318d600000 /memfd:main (deleted)
^ here is where the gap starts
7f4194c00000-7f4194e7d000
^ this one is an anonymous maping created due to large
memory allocation after shared mappings were created
7f4195000000-7f419527d000
7f41952dc000-7f4195416000
7f4195416000-7f4195600000 /dev/shm/PostgreSQL.2529797530
7f4195600000-7f41a311d000 /usr/lib/locale/locale-archive
7f41a317f000-7f41a3200000
7f41a3200000-7f41a3201000 /usr/lib64/libicudata.so.74.2
The assumption about picking up a lowest address is just how it works right now
on Linux, this fact is already used in the patch. The idea that we could put
upper boundary on the size of other mappings based on total available memory
comes from the fact that anonymous mappings, that are much larger than memory,
will fail without overcommit. With overcommit it becomes different, but if
allocations are hitting that limit I can imagine there are bigger problems than
shared buffer resize.
This approach follows the same ideas already used in the patch, and have the
same trade offs: no address changes, but questions about portability.