Tweaking DSM and DSA limits
Hello,
If you run a lot of parallel queries that use big parallel hash joins
simultaneously, you can run out of DSM slots (for example, when
testing many concurrent parallel queries). That's because we allow 64
slots + 2 * MaxBackends, but allocating seriously large amounts of
dynamic shared memory requires lots of slots.
Originally the DSM system was designed to support one segment per
parallel query, but now we also use one for the session and any number
for parallel executor nodes that want space limited by work_mem.
The number of slots it takes for a given total amount of shared memory
depends on the macro DSA_NUM_SEGMENTS_AT_EACH_SIZE. Since DSM slots
are relatively scarce (we use inefficient algorithms to access them,
and we think that some operating systems won't like us if we create
too many, so we impose this scarcity on ourselves), each DSA area
allocates bigger and bigger segments as it goes, starting with 1MB.
The approximate number of segments required to allocate various sizes
incrementally using different values of DSA_NUM_SEGMENTS_AT_EACH_SIZE
can be seen in this table:
N = 1 2 3 4
1MB 1 1 1 1
64MB 6 10 13 16
512MB 9 16 22 28
1GB 10 18 25 32
8GB 13 24 34 44
16GB 14 26 37 48
32GB 15 28 40 52
1TB 20 38 55 72
It's currently set to 4, but I now think that was too cautious. It
tries to avoid fragmentation by ramping up slowly (that is, memory
allocated and in some cases committed by the operating system that we
don't turn out to need), but it's pretty wasteful of slots. Perhaps
it should be set to 2?
Perhaps also the number of slots per backend should be dynamic, so
that you have the option to increase it from the current hard-coded
value of 2 if you don't want to increase max_connections but find
yourself running out of slots (this GUC was a request from Andres but
the name was made up by me -- if someone has a better suggestion I'm
all ears).
Also, there are some outdated comments near
PG_DYNSHMEM_SLOTS_PER_BACKEND's definition that we might as well drop
along with the macro.
Draft patch attached.
--
Thomas Munro
https://enterprisedb.com
Attachments:
0001-Add-dynamic_shared_memory_segments_per_backend-GUC.patchapplication/octet-stream; name=0001-Add-dynamic_shared_memory_segments_per_backend-GUC.patchDownload+35-10
On Tue, Jun 18, 2019 at 9:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
It's currently set to 4, but I now think that was too cautious. It
tries to avoid fragmentation by ramping up slowly (that is, memory
allocated and in some cases committed by the operating system that we
don't turn out to need), but it's pretty wasteful of slots. Perhaps
it should be set to 2?
+1. I think I said at the time that I thought that was too cautious...
Perhaps also the number of slots per backend should be dynamic, so
that you have the option to increase it from the current hard-coded
value of 2 if you don't want to increase max_connections but find
yourself running out of slots (this GUC was a request from Andres but
the name was made up by me -- if someone has a better suggestion I'm
all ears).
I am not convinced that we really need to GUC-ify this. How about
just bumping the value up from 2 to say 5? Between the preceding
change and this one we ought to buy ourselves more than 4x, and if
that is not enough then we can ask whether raising max_connections is
a reasonable workaround, and if that's still not enough then we can
revisit this idea, or maybe come up with something better. The
problem I have with a GUC here is that nobody without a PhD in
PostgreSQL-ology will have any clue how to set it, and while that's
good for your employment prospects and mine, it's not so great for
PostgreSQL users generally.
As Andres observed off-list, it would also be a good idea to allow
things that are going to gobble memory like hash joins to have some
input into how much memory gets allocated. Maybe preallocating the
expected size of the hash is too aggressive -- estimates can be wrong,
and it could be much smaller. But maybe we should allocate at least,
say, 1/64th of that amount, and act as if
DSA_NUM_SEGMENTS_AT_EACH_SIZE == 1 until the cumulative memory
allocation is more than 25% of that amount. So if we think it's gonna
be 1GB, start by allocating 16MB and double the size of each
allocation thereafter until we get to at least 256MB allocated. So
then we'd have 16MB + 32MB + 64MB + 128MB + 256MB + 256MB + 512MB = 7
segments instead of the 32 required currently or the 18 required with
DSA_NUM_SEGMENTS_AT_EACH_SIZE == 2.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2019-06-20 14:20:27 -0400, Robert Haas wrote:
On Tue, Jun 18, 2019 at 9:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Perhaps also the number of slots per backend should be dynamic, so
that you have the option to increase it from the current hard-coded
value of 2 if you don't want to increase max_connections but find
yourself running out of slots (this GUC was a request from Andres but
the name was made up by me -- if someone has a better suggestion I'm
all ears).I am not convinced that we really need to GUC-ify this. How about
just bumping the value up from 2 to say 5?
I'm not sure either. Although it's not great if the only way out for a
user hitting this is to increase max_connections... But we should really
increase the default.
As Andres observed off-list, it would also be a good idea to allow
things that are going to gobble memory like hash joins to have some
input into how much memory gets allocated. Maybe preallocating the
expected size of the hash is too aggressive -- estimates can be wrong,
and it could be much smaller.
At least for the case of the hashtable itself, we allocate that at the
predicted size immediately. So a mis-estimation wouldn't change
anything. For the entires, yea, something like you suggest would make
sense.
Greetings,
Andres Freund
On Thu, Jun 20, 2019 at 02:20:27PM -0400, Robert Haas wrote:
On Tue, Jun 18, 2019 at 9:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
It's currently set to 4, but I now think that was too cautious. It
tries to avoid fragmentation by ramping up slowly (that is, memory
allocated and in some cases committed by the operating system that we
don't turn out to need), but it's pretty wasteful of slots. Perhaps
it should be set to 2?+1. I think I said at the time that I thought that was too cautious...
Perhaps also the number of slots per backend should be dynamic, so
that you have the option to increase it from the current hard-coded
value of 2 if you don't want to increase max_connections but find
yourself running out of slots (this GUC was a request from Andres but
the name was made up by me -- if someone has a better suggestion I'm
all ears).I am not convinced that we really need to GUC-ify this. How about
just bumping the value up from 2 to say 5? Between the preceding
change and this one we ought to buy ourselves more than 4x, and if
that is not enough then we can ask whether raising max_connections is
a reasonable workaround,
Is there perhaps a way to make raising max_connections not require a
restart? There are plenty of situations out there where restarts
aren't something that can be done on a whim.
Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
On Thu, Jun 20, 2019 at 5:00 PM David Fetter <david@fetter.org> wrote:
Is there perhaps a way to make raising max_connections not require a
restart? There are plenty of situations out there where restarts
aren't something that can be done on a whim.
Sure, if you want to make this take about 100x more work.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Jun 21, 2019 at 6:52 AM Andres Freund <andres@anarazel.de> wrote:
On 2019-06-20 14:20:27 -0400, Robert Haas wrote:
On Tue, Jun 18, 2019 at 9:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Perhaps also the number of slots per backend should be dynamic, so
that you have the option to increase it from the current hard-coded
value of 2 if you don't want to increase max_connections but find
yourself running out of slots (this GUC was a request from Andres but
the name was made up by me -- if someone has a better suggestion I'm
all ears).I am not convinced that we really need to GUC-ify this. How about
just bumping the value up from 2 to say 5?I'm not sure either. Although it's not great if the only way out for a
user hitting this is to increase max_connections... But we should really
increase the default.
Ok, hard-to-explain GUC abandoned. Here is a patch that just adjusts
the two constants. DSM's array allows for 5 slots per connection (up
from 2), and DSA doubles its size after every two segments (down from
4).
As Andres observed off-list, it would also be a good idea to allow
things that are going to gobble memory like hash joins to have some
input into how much memory gets allocated. Maybe preallocating the
expected size of the hash is too aggressive -- estimates can be wrong,
and it could be much smaller.At least for the case of the hashtable itself, we allocate that at the
predicted size immediately. So a mis-estimation wouldn't change
anything. For the entires, yea, something like you suggest would make
sense.
At the moment the 32KB chunks are used as parallel granules for
various work (inserting, repartitioning, rebucketing). I could
certainly allocate a much bigger piece based on estimates, and then
invent another kind of chunks inside that, or keep the existing
layering but find a way to hint to DSA what allocation stream to
expect in the future so it can get bigger underlying chunks ready.
One problem is that it'd result in large, odd sized memory segments,
whereas the current scheme uses power of two sizes and might be more
amenable to a later segment reuse scheme; or maybe that doesn't really
matter.
I have a long wish list of improvements I'd like to investigate in
this area, subject for future emails, but while I'm making small
tweaks, here's another small thing: there is no "wait event" while
allocating (in the kernel sense) POSIX shm on Linux, unlike the
equivalent IO when file-backed segments are filled with write() calls.
Let's just reuse the same wait event, so that you can see what's going
on in pg_stat_activity.
Attachments:
0001-Adjust-the-constants-used-to-reserve-DSM-segment-slo.patchapplication/octet-stream; name=0001-Adjust-the-constants-used-to-reserve-DSM-segment-slo.patchDownload+2-8
0002-Report-time-spent-in-posix_fallocate-as-a-wait-event.patchapplication/octet-stream; name=0002-Report-time-spent-in-posix_fallocate-as-a-wait-event.patchDownload+2-1
On Mon, Oct 21, 2019 at 12:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, Jun 21, 2019 at 6:52 AM Andres Freund <andres@anarazel.de> wrote:
On 2019-06-20 14:20:27 -0400, Robert Haas wrote:
I am not convinced that we really need to GUC-ify this. How about
just bumping the value up from 2 to say 5?I'm not sure either. Although it's not great if the only way out for a
user hitting this is to increase max_connections... But we should really
increase the default.Ok, hard-to-explain GUC abandoned. Here is a patch that just adjusts
the two constants. DSM's array allows for 5 slots per connection (up
from 2), and DSA doubles its size after every two segments (down from
4).
Pushed. No back-patch for now: the risk/reward ratio doesn't seem
right for that.
I have a long wish list of improvements I'd like to investigate in
this area, subject for future emails, but while I'm making small
tweaks, here's another small thing: there is no "wait event" while
allocating (in the kernel sense) POSIX shm on Linux, unlike the
equivalent IO when file-backed segments are filled with write() calls.
Let's just reuse the same wait event, so that you can see what's going
on in pg_stat_activity.
Also pushed.