Tweaking DSM and DSA limits

Started by Thomas Munroover 6 years ago7 messages
#1Thomas Munro
thomas.munro@gmail.com
1 attachment(s)

Hello,

If you run a lot of parallel queries that use big parallel hash joins
simultaneously, you can run out of DSM slots (for example, when
testing many concurrent parallel queries). That's because we allow 64
slots + 2 * MaxBackends, but allocating seriously large amounts of
dynamic shared memory requires lots of slots.

Originally the DSM system was designed to support one segment per
parallel query, but now we also use one for the session and any number
for parallel executor nodes that want space limited by work_mem.

The number of slots it takes for a given total amount of shared memory
depends on the macro DSA_NUM_SEGMENTS_AT_EACH_SIZE. Since DSM slots
are relatively scarce (we use inefficient algorithms to access them,
and we think that some operating systems won't like us if we create
too many, so we impose this scarcity on ourselves), each DSA area
allocates bigger and bigger segments as it goes, starting with 1MB.
The approximate number of segments required to allocate various sizes
incrementally using different values of DSA_NUM_SEGMENTS_AT_EACH_SIZE
can be seen in this table:

N = 1 2 3 4

1MB 1 1 1 1
64MB 6 10 13 16
512MB 9 16 22 28
1GB 10 18 25 32
8GB 13 24 34 44
16GB 14 26 37 48
32GB 15 28 40 52
1TB 20 38 55 72

It's currently set to 4, but I now think that was too cautious. It
tries to avoid fragmentation by ramping up slowly (that is, memory
allocated and in some cases committed by the operating system that we
don't turn out to need), but it's pretty wasteful of slots. Perhaps
it should be set to 2?

Perhaps also the number of slots per backend should be dynamic, so
that you have the option to increase it from the current hard-coded
value of 2 if you don't want to increase max_connections but find
yourself running out of slots (this GUC was a request from Andres but
the name was made up by me -- if someone has a better suggestion I'm
all ears).

Also, there are some outdated comments near
PG_DYNSHMEM_SLOTS_PER_BACKEND's definition that we might as well drop
along with the macro.

Draft patch attached.

--
Thomas Munro
https://enterprisedb.com

Attachments:

0001-Add-dynamic_shared_memory_segments_per_backend-GUC.patchapplication/octet-stream; name=0001-Add-dynamic_shared_memory_segments_per_backend-GUC.patchDownload
From ba9da88207d02f0173ec0cf4f698cf49b6926229 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 19 Jun 2019 12:43:09 +1200
Subject: [PATCH] Add dynamic_shared_memory_segments_per_backend GUC.

In some scenarios (probably mostly testing for now) it is possible
to run out of DSM segment slots, because we have a hardcoded
allowance of 2 per backend.  Make that into a GUC so that users
who run into problems have some room to move.

Also make dsa.c ramp up its segment sizes more aggressively, so that
fewer slots are needed.

Remove an outdated comment about expect segment and slot sizes.

Author: Thomas Munro (based on a suggestion from Andres Freund)
---
 doc/src/sgml/config.sgml                      | 17 +++++++++++++++++
 src/backend/storage/ipc/dsm.c                 |  8 +-------
 src/backend/storage/ipc/dsm_impl.c            |  1 +
 src/backend/utils/misc/guc.c                  | 12 ++++++++++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/backend/utils/mmgr/dsa.c                  |  2 +-
 src/include/storage/dsm_impl.h                |  3 ++-
 7 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..3e8fab263f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1786,6 +1786,23 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-dynamic-shared-memory-segments-per-backend" xreflabel="dynamic_shared_memory_segments_per_backend">
+      <term><varname>dynamic_shared_memory_segments_per_backend</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>dynamic_shared_memory_segments_per_backend</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Limits the total number of dynamic shared memory segments the
+        server can create.  The system-wide limit is 64 + 
+        (<varname>max_connections</varname> *
+        <varname>dynamic_shared_memory_segments_per_backend</varname>) before
+        an error is raised.  The default value is <literal>2</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
 
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index 142293fd2c..42948791e6 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -45,13 +45,7 @@
 
 #define PG_DYNSHMEM_CONTROL_MAGIC		0x9a503d32
 
-/*
- * There's no point in getting too cheap here, because the minimum allocation
- * is one OS page, which is probably at least 4KB and could easily be as high
- * as 64KB.  Each currently sizeof(dsm_control_item), currently 8 bytes.
- */
 #define PG_DYNSHMEM_FIXED_SLOTS			64
-#define PG_DYNSHMEM_SLOTS_PER_BACKEND	2
 
 #define INVALID_CONTROL_SLOT		((uint32) -1)
 
@@ -161,7 +155,7 @@ dsm_postmaster_startup(PGShmemHeader *shim)
 
 	/* Determine size for new control segment. */
 	maxitems = PG_DYNSHMEM_FIXED_SLOTS
-		+ PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;
+		+ dynamic_shared_memory_segments_per_backend * MaxBackends;
 	elog(DEBUG2, "dynamic shared memory system will support %u segments",
 		 maxitems);
 	segsize = dsm_control_bytes_needed(maxitems);
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index d32996b6fc..a15409538c 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -112,6 +112,7 @@ const struct config_enum_entry dynamic_shared_memory_options[] = {
 
 /* Implementation selector. */
 int			dynamic_shared_memory_type;
+int			dynamic_shared_memory_segments_per_backend;
 
 /* Size of buffer to be used for zero-filling. */
 #define ZBUFFER_SIZE				8192
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1208eb9a68..293f8d11a5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3197,6 +3197,18 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, assign_tcp_user_timeout, show_tcp_user_timeout
 	},
 
+	{
+		{"dynamic_shared_memory_segments_per_backend", PGC_POSTMASTER,
+			RESOURCES_MEM,
+			gettext_noop("Sets the number of DSM segments that can be created."),
+			NULL
+		},
+		&dynamic_shared_memory_segments_per_backend,
+		2, 1, 256,
+		NULL, NULL, NULL
+	},
+
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5ee5e09ddf..d069712caf 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -144,6 +144,7 @@
 					#   windows
 					#   mmap
 					# (change requires restart)
+#dynamic_shared_memory_segments_per_backend = 2
 
 # - Disk -
 
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 4b826cdaa5..d8d36bf46c 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -74,7 +74,7 @@
  * dsm.c's limits on total number of segments), or limiting the total size
  * an area can manage when using small pointers.
  */
-#define DSA_NUM_SEGMENTS_AT_EACH_SIZE 4
+#define DSA_NUM_SEGMENTS_AT_EACH_SIZE 2
 
 /*
  * The number of bits used to represent the offset part of a dsa_pointer.
diff --git a/src/include/storage/dsm_impl.h b/src/include/storage/dsm_impl.h
index 1dc557b791..fb116ce208 100644
--- a/src/include/storage/dsm_impl.h
+++ b/src/include/storage/dsm_impl.h
@@ -38,8 +38,9 @@
 #define USE_DSM_MMAP
 #endif
 
-/* GUC. */
+/* GUCs. */
 extern int	dynamic_shared_memory_type;
+extern int	dynamic_shared_memory_segments_per_backend;
 
 /*
  * Directory for on-disk state.
-- 
2.21.0

#2Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#1)
Re: Tweaking DSM and DSA limits

On Tue, Jun 18, 2019 at 9:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:

It's currently set to 4, but I now think that was too cautious. It
tries to avoid fragmentation by ramping up slowly (that is, memory
allocated and in some cases committed by the operating system that we
don't turn out to need), but it's pretty wasteful of slots. Perhaps
it should be set to 2?

+1. I think I said at the time that I thought that was too cautious...

Perhaps also the number of slots per backend should be dynamic, so
that you have the option to increase it from the current hard-coded
value of 2 if you don't want to increase max_connections but find
yourself running out of slots (this GUC was a request from Andres but
the name was made up by me -- if someone has a better suggestion I'm
all ears).

I am not convinced that we really need to GUC-ify this. How about
just bumping the value up from 2 to say 5? Between the preceding
change and this one we ought to buy ourselves more than 4x, and if
that is not enough then we can ask whether raising max_connections is
a reasonable workaround, and if that's still not enough then we can
revisit this idea, or maybe come up with something better. The
problem I have with a GUC here is that nobody without a PhD in
PostgreSQL-ology will have any clue how to set it, and while that's
good for your employment prospects and mine, it's not so great for
PostgreSQL users generally.

As Andres observed off-list, it would also be a good idea to allow
things that are going to gobble memory like hash joins to have some
input into how much memory gets allocated. Maybe preallocating the
expected size of the hash is too aggressive -- estimates can be wrong,
and it could be much smaller. But maybe we should allocate at least,
say, 1/64th of that amount, and act as if
DSA_NUM_SEGMENTS_AT_EACH_SIZE == 1 until the cumulative memory
allocation is more than 25% of that amount. So if we think it's gonna
be 1GB, start by allocating 16MB and double the size of each
allocation thereafter until we get to at least 256MB allocated. So
then we'd have 16MB + 32MB + 64MB + 128MB + 256MB + 256MB + 512MB = 7
segments instead of the 32 required currently or the 18 required with
DSA_NUM_SEGMENTS_AT_EACH_SIZE == 2.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#2)
Re: Tweaking DSM and DSA limits

Hi,

On 2019-06-20 14:20:27 -0400, Robert Haas wrote:

On Tue, Jun 18, 2019 at 9:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Perhaps also the number of slots per backend should be dynamic, so
that you have the option to increase it from the current hard-coded
value of 2 if you don't want to increase max_connections but find
yourself running out of slots (this GUC was a request from Andres but
the name was made up by me -- if someone has a better suggestion I'm
all ears).

I am not convinced that we really need to GUC-ify this. How about
just bumping the value up from 2 to say 5?

I'm not sure either. Although it's not great if the only way out for a
user hitting this is to increase max_connections... But we should really
increase the default.

As Andres observed off-list, it would also be a good idea to allow
things that are going to gobble memory like hash joins to have some
input into how much memory gets allocated. Maybe preallocating the
expected size of the hash is too aggressive -- estimates can be wrong,
and it could be much smaller.

At least for the case of the hashtable itself, we allocate that at the
predicted size immediately. So a mis-estimation wouldn't change
anything. For the entires, yea, something like you suggest would make
sense.

Greetings,

Andres Freund

#4David Fetter
david@fetter.org
In reply to: Robert Haas (#2)
Re: Tweaking DSM and DSA limits

On Thu, Jun 20, 2019 at 02:20:27PM -0400, Robert Haas wrote:

On Tue, Jun 18, 2019 at 9:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:

It's currently set to 4, but I now think that was too cautious. It
tries to avoid fragmentation by ramping up slowly (that is, memory
allocated and in some cases committed by the operating system that we
don't turn out to need), but it's pretty wasteful of slots. Perhaps
it should be set to 2?

+1. I think I said at the time that I thought that was too cautious...

Perhaps also the number of slots per backend should be dynamic, so
that you have the option to increase it from the current hard-coded
value of 2 if you don't want to increase max_connections but find
yourself running out of slots (this GUC was a request from Andres but
the name was made up by me -- if someone has a better suggestion I'm
all ears).

I am not convinced that we really need to GUC-ify this. How about
just bumping the value up from 2 to say 5? Between the preceding
change and this one we ought to buy ourselves more than 4x, and if
that is not enough then we can ask whether raising max_connections is
a reasonable workaround,

Is there perhaps a way to make raising max_connections not require a
restart? There are plenty of situations out there where restarts
aren't something that can be done on a whim.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#5Robert Haas
robertmhaas@gmail.com
In reply to: David Fetter (#4)
Re: Tweaking DSM and DSA limits

On Thu, Jun 20, 2019 at 5:00 PM David Fetter <david@fetter.org> wrote:

Is there perhaps a way to make raising max_connections not require a
restart? There are plenty of situations out there where restarts
aren't something that can be done on a whim.

Sure, if you want to make this take about 100x more work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#3)
2 attachment(s)
Re: Tweaking DSM and DSA limits

On Fri, Jun 21, 2019 at 6:52 AM Andres Freund <andres@anarazel.de> wrote:

On 2019-06-20 14:20:27 -0400, Robert Haas wrote:

On Tue, Jun 18, 2019 at 9:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Perhaps also the number of slots per backend should be dynamic, so
that you have the option to increase it from the current hard-coded
value of 2 if you don't want to increase max_connections but find
yourself running out of slots (this GUC was a request from Andres but
the name was made up by me -- if someone has a better suggestion I'm
all ears).

I am not convinced that we really need to GUC-ify this. How about
just bumping the value up from 2 to say 5?

I'm not sure either. Although it's not great if the only way out for a
user hitting this is to increase max_connections... But we should really
increase the default.

Ok, hard-to-explain GUC abandoned. Here is a patch that just adjusts
the two constants. DSM's array allows for 5 slots per connection (up
from 2), and DSA doubles its size after every two segments (down from
4).

As Andres observed off-list, it would also be a good idea to allow
things that are going to gobble memory like hash joins to have some
input into how much memory gets allocated. Maybe preallocating the
expected size of the hash is too aggressive -- estimates can be wrong,
and it could be much smaller.

At least for the case of the hashtable itself, we allocate that at the
predicted size immediately. So a mis-estimation wouldn't change
anything. For the entires, yea, something like you suggest would make
sense.

At the moment the 32KB chunks are used as parallel granules for
various work (inserting, repartitioning, rebucketing). I could
certainly allocate a much bigger piece based on estimates, and then
invent another kind of chunks inside that, or keep the existing
layering but find a way to hint to DSA what allocation stream to
expect in the future so it can get bigger underlying chunks ready.
One problem is that it'd result in large, odd sized memory segments,
whereas the current scheme uses power of two sizes and might be more
amenable to a later segment reuse scheme; or maybe that doesn't really
matter.

I have a long wish list of improvements I'd like to investigate in
this area, subject for future emails, but while I'm making small
tweaks, here's another small thing: there is no "wait event" while
allocating (in the kernel sense) POSIX shm on Linux, unlike the
equivalent IO when file-backed segments are filled with write() calls.
Let's just reuse the same wait event, so that you can see what's going
on in pg_stat_activity.

Attachments:

0001-Adjust-the-constants-used-to-reserve-DSM-segment-slo.patchapplication/octet-stream; name=0001-Adjust-the-constants-used-to-reserve-DSM-segment-slo.patchDownload
From e73ed90730354e2f714cad1e4c226178b1361fb1 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 21 Oct 2019 10:41:23 +1300
Subject: [PATCH 1/2] Adjust the constants used to reserve DSM segment slots.

In some scenarios (probably only artifical tests), it is possible to run
out of DSM segment slots by running a lot of parallel queries concurrently.
Adjust the constants so that it's more difficult to hit the hard coded
limit.

Previously, a DSA area would create up to four segments at each size
before doubling the size.  After this commit, it will create only two at
each size, before switching to larger segment sizes.

Previously, the total limit on DSM slots allowed for 2 per connection.
Switch to 5 per connection.

Remove an obsolete nearby comment.

Author: Thomas Munro (based on an idea from Andres Freund to introduce a GUC)
Reviewed-by: Robert Haas (who suggested simply adjusting the numbers for now)
Discussion: https://postre.es/m/CA%2BhUKGL6H2BpGbiF7Lj6QiTjTGyTLW_vLR%3DSn2tEBeTcYXiMKw%40mail.gmail.com
---
 src/backend/storage/ipc/dsm.c | 7 +------
 src/backend/utils/mmgr/dsa.c  | 2 +-
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index 142293fd2c..641086cc04 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -45,13 +45,8 @@
 
 #define PG_DYNSHMEM_CONTROL_MAGIC		0x9a503d32
 
-/*
- * There's no point in getting too cheap here, because the minimum allocation
- * is one OS page, which is probably at least 4KB and could easily be as high
- * as 64KB.  Each currently sizeof(dsm_control_item), currently 8 bytes.
- */
 #define PG_DYNSHMEM_FIXED_SLOTS			64
-#define PG_DYNSHMEM_SLOTS_PER_BACKEND	2
+#define PG_DYNSHMEM_SLOTS_PER_BACKEND	5
 
 #define INVALID_CONTROL_SLOT		((uint32) -1)
 
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 6590e55a24..8225e56e82 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -74,7 +74,7 @@
  * dsm.c's limits on total number of segments), or limiting the total size
  * an area can manage when using small pointers.
  */
-#define DSA_NUM_SEGMENTS_AT_EACH_SIZE 4
+#define DSA_NUM_SEGMENTS_AT_EACH_SIZE 2
 
 /*
  * The number of bits used to represent the offset part of a dsa_pointer.
-- 
2.20.1

0002-Report-time-spent-in-posix_fallocate-as-a-wait-event.patchapplication/octet-stream; name=0002-Report-time-spent-in-posix_fallocate-as-a-wait-event.patchDownload
From 973b3078b7f621556a692b6fdc397b5d9ac4c3ee Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 21 Oct 2019 11:05:53 +1300
Subject: [PATCH 2/2] Report time spent in posix_fallocate() as a wait event.

When allocating DSM segments with posix_fallocate() on Linux (see commit
899bd785), report this activity as a wait event exactly as we would if
we were using file-backed DSM rather than shm_open()-backed DSM.

Author: Thomas Munro
---
 src/backend/storage/ipc/dsm_impl.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 2879b84bf6..031fd47ee5 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -371,10 +371,12 @@ dsm_impl_posix_resize(int fd, off_t size)
 		 * interrupt pending.  This avoids the possibility of looping forever
 		 * if another backend is repeatedly trying to interrupt us.
 		 */
+		pgstat_report_wait_start(WAIT_EVENT_DSM_FILL_ZERO_WRITE);
 		do
 		{
 			rc = posix_fallocate(fd, 0, size);
 		} while (rc == EINTR && !(ProcDiePending || QueryCancelPending));
+		pgstat_report_wait_end();
 
 		/*
 		 * The caller expects errno to be set, but posix_fallocate() doesn't
-- 
2.20.1

#7Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#6)
Re: Tweaking DSM and DSA limits

On Mon, Oct 21, 2019 at 12:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Jun 21, 2019 at 6:52 AM Andres Freund <andres@anarazel.de> wrote:

On 2019-06-20 14:20:27 -0400, Robert Haas wrote:

I am not convinced that we really need to GUC-ify this. How about
just bumping the value up from 2 to say 5?

I'm not sure either. Although it's not great if the only way out for a
user hitting this is to increase max_connections... But we should really
increase the default.

Ok, hard-to-explain GUC abandoned. Here is a patch that just adjusts
the two constants. DSM's array allows for 5 slots per connection (up
from 2), and DSA doubles its size after every two segments (down from
4).

Pushed. No back-patch for now: the risk/reward ratio doesn't seem
right for that.

I have a long wish list of improvements I'd like to investigate in
this area, subject for future emails, but while I'm making small
tweaks, here's another small thing: there is no "wait event" while
allocating (in the kernel sense) POSIX shm on Linux, unlike the
equivalent IO when file-backed segments are filled with write() calls.
Let's just reuse the same wait event, so that you can see what's going
on in pg_stat_activity.

Also pushed.