Add GUC to tune glibc's malloc implementation.
Hello,
Following some conversation with Tomas at PGCon, I decided to resurrect this
topic, which was previously discussed in the context of moving tuplesort to
use GenerationContext: https://www.postgresql.org/message-id/
8046109.NyiUUSuA9g%40aivenronan
The idea for this patch is that the behaviour of glibc's malloc can be
counterproductive for us in some cases. To summarise, glibc's malloc offers
(among others) two tunable parameters which greatly affects how it allocates
memory. From the mallopt manpage:
M_TRIM_THRESHOLD
When the amount of contiguous free memory at the top of
the heap grows sufficiently large, free(3) employs sbrk(2)
to release this memory back to the system. (This can be
useful in programs that continue to execute for a long
period after freeing a significant amount of memory.)
M_MMAP_THRESHOLD
For allocations greater than or equal to the limit
specified (in bytes) by M_MMAP_THRESHOLD that can't be
satisfied from the free list, the memory-allocation
functions employ mmap(2) instead of increasing the program
break using sbrk(2).
The thing is, by default, those parameters are adjusted dynamically by the
glibc itself. It starts with quite small thresholds, and raises them when the
program frees some memory, up to a certain limit. This patch proposes a new
GUC allowing the user to adjust those settings according to their workload.
This can cause problems. Let's take for example a table with 10k rows, and 32
columns (as defined by a bench script David Rowley shared last year when
discussing the GenerationContext for tuplesort), and execute the following
query, with 32MB of work_mem:
select * from t order by a offset 100000;
On unpatched master, attaching strace to the backend and grepping on brk|mmap,
we get the following syscalls:
brk(0x55b00df0c000) = 0x55b00df0c000
brk(0x55b00df05000) = 0x55b00df05000
brk(0x55b00df28000) = 0x55b00df28000
brk(0x55b00df52000) = 0x55b00df52000
mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x7fbc49254000
brk(0x55b00df7e000) = 0x55b00df7e000
mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x7fbc48f7f000
mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x7fbc48e7e000
mmap(NULL, 200704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x7fbc4980f000
brk(0x55b00df72000) = 0x55b00df72000
mmap(NULL, 2101248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x7fbc3d56d000
Using systemtap, we can hook to glibc's mallocs static probes to log whenever
it adjusts its values. During the above queries, glibc's malloc raised its
thresholds:
347704: New thresholds: mmap: 2101248 bytes, trim: 4202496 bytes
If we re-run the query again, we get:
brk(0x55b00dfe2000) = 0x55b00dfe2000
brk(0x55b00e042000) = 0x55b00e042000
brk(0x55b00e0ce000) = 0x55b00e0ce000
brk(0x55b00e1e6000) = 0x55b00e1e6000
brk(0x55b00e216000) = 0x55b00e216000
brk(0x55b00e416000) = 0x55b00e416000
brk(0x55b00e476000) = 0x55b00e476000
brk(0x55b00dfbc000) = 0x55b00dfbc000
This time, our allocations are below the new mmap_threshold, so malloc gets us
our memory by repeatedly moving the brk pointer.
When running with the attached patch, and setting the new GUC:
set glibc_malloc_max_trim_threshold = '64MB';
We now get the following syscalls for the same query, for the first run:
brk(0x55b00df0c000) = 0x55b00df0c000
brk(0x55b00df2e000) = 0x55b00df2e000
brk(0x55b00df52000) = 0x55b00df52000
brk(0x55b00dfb2000) = 0x55b00dfb2000
brk(0x55b00e03e000) = 0x55b00e03e000
brk(0x55b00e156000) = 0x55b00e156000
brk(0x55b00e186000) = 0x55b00e186000
brk(0x55b00e386000) = 0x55b00e386000
brk(0x55b00e3e6000) = 0x55b00e3e6000
But for the second run, the memory allocated is kept by malloc's freelist
instead of being released to the kernel, generating no syscalls at all, which
brings us a significant performance improvement at the cost of more memory
being used by the idle backend, up to twice as more tps.
On the other hand, the default behaviour can also be a problem if a backend
makes big allocations for a short time and then never needs that amount of
memory again.
For example, running this query:
select * from generate_series(1, 1000000);
We allocate some memory. The first time it's run, malloc will use mmap to
satisfy it. Once it's freed, it will raise it's threshold, and a second run
will allocate it on the heap instead. So if we run the query twice, we end up
with some memory in malloc's free lists that we may never use again. Using the
new GUC, we can actually control wether it will be given back to the OS by
setting a small value for the threshold.
I attached the results of the 10k rows / 32 columns / 32MB work_mem benchmark
with different values for glibc_malloc_max_trim_threshold.
I don't know how to write a test for this new feature so let me know if you
have suggestions. Documentation is not written yet, as I expect discussion on
this thread to lead to significant changes on the user-visible GUC or GUCs:
- should we provide one for trim which also adjusts mmap_threshold (current
patch) or several GUCs ?
- should this be simplified to only offer the default behaviour (glibc's takes
care of the threshold) and some presets ("greedy", to set trim_threshold to
work_mem, "frugal" to set it to a really small value)
Best regards,
--
Ronan Dunklau
Attachments:
v1-0001-Add-options-to-tune-malloc.patchtext/x-patch; charset=x-UTF_8J; name=v1-0001-Add-options-to-tune-malloc.patchDownload
From 3686e660446facfb2d64683286176887914cd9fd Mon Sep 17 00:00:00 2001
From: Ronan Dunklau <ronan.dunklau@aiven.io>
Date: Tue, 20 Jun 2023 16:17:32 +0200
Subject: [PATCH v1] Add options to tune malloc.
Add a new GUC glibc_malloc_max_trim_threshold which is used only when
being compiled against glibc.
This GUC adjusts the glibc's malloc options M_MMAP_THRESHOLD and M_TRIM_THRESHOLD.
The M_MMAP_THRESHOLD will be set to half M_TRIM_THRESHOLD. We cap the
value to the current work_mem, as it doesn't make much sense to reserve
more memory than that.
This new GUC allows the user to control how aggresively malloc will actually
return memory from the heap to the kernel, and when it should start
using mmap for bigger allocations. Depending on the workload, a user can
reduce the residual footprint of a long session (by reducing the trim
threshold to a fairly low value) or on the contrary make sure that
malloc keeps a sizable free chunk at the end of the heap for future
allocations.
On some benchmarks heavily dependent on memory allocations, like sorts,
this can dramatically influence performance.
---
src/backend/utils/init/postinit.c | 6 +
src/backend/utils/misc/guc_tables.c | 16 ++-
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/backend/utils/mmgr/Makefile | 1 +
src/backend/utils/mmgr/malloc_tuning.c | 122 ++++++++++++++++++
src/backend/utils/mmgr/meson.build | 1 +
src/include/utils/guc_hooks.h | 2 +
src/include/utils/memutils.h | 10 ++
8 files changed, 158 insertions(+), 1 deletion(-)
create mode 100644 src/backend/utils/mmgr/malloc_tuning.c
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 561bd13ed2..707f20606d 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -808,6 +808,12 @@ InitPostgres(const char *in_dbname, Oid dboid,
InitCatalogCache();
InitPlanCache();
+ /* Adjust malloc options if needed.
+ * This is done here because the implementation can vary depending on the
+ * type of backend.
+ */
+ MallocAdjustSettings();
+
/* Initialize portal manager */
EnablePortalManager();
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 71e27f8eb0..f1e03dc306 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2338,7 +2338,7 @@ struct config_int ConfigureNamesInt[] =
},
&work_mem,
4096, 64, MAX_KILOBYTES,
- NULL, NULL, NULL
+ NULL, &assign_work_mem, NULL
},
{
@@ -2364,6 +2364,20 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"glibc_malloc_max_trim_threshold", PGC_USERSET, RESOURCES_MEM,
+ gettext_noop("Sets the maximum value for glibc's M_TRIM_THRESHOLD option."),
+ gettext_noop("This controls how much memory glibc's will not return to the "
+ "OS once freed. An idle backend can thus consume that much memory "
+ "even if not in used. The default (-1) value disable static tuning "
+ "and relies on the default dynamic adjustment"),
+ GUC_UNIT_KB
+ },
+ &glibc_malloc_max_trim_threshold,
+ -1, -1, MAX_KILOBYTES,
+ NULL, &assign_glibc_trim_threshold, NULL
+ },
+
/*
* We use the hopefully-safely-small value of 100kB as the compiled-in
* default for max_stack_depth. InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e4c0269fa3..dc70a1dfa3 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -157,6 +157,7 @@
# windows
# mmap
# (change requires restart)
+#glibc_malloc_max_trim_threshold = -1 # Only used with glibc's malloc. -1 disable it
#min_dynamic_shared_memory = 0MB # (change requires restart)
#vacuum_buffer_usage_limit = 256kB # size of vacuum and analyze buffer access strategy ring;
# 0 to disable vacuum buffer access strategy;
diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index dae3432c98..88c5f7e23a 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -18,6 +18,7 @@ OBJS = \
dsa.o \
freepage.o \
generation.o \
+ malloc_tuning.o \
mcxt.o \
memdebug.o \
portalmem.o \
diff --git a/src/backend/utils/mmgr/malloc_tuning.c b/src/backend/utils/mmgr/malloc_tuning.c
new file mode 100644
index 0000000000..935420e74f
--- /dev/null
+++ b/src/backend/utils/mmgr/malloc_tuning.c
@@ -0,0 +1,122 @@
+#include "postgres.h"
+
+#include "utils/guc_hooks.h"
+#include "utils/memutils.h"
+#include "miscadmin.h"
+
+
+/*
+ * Implementation speficic GUCs. Those are defined even if we use another implementation, but will have
+ * no effect in that case.
+ */
+
+int glibc_malloc_max_trim_threshold;
+
+/*
+ * Depending on the malloc implementation used, we may want to
+ * tune it.
+ * In this first version, the only tunable library is glibc's malloc
+ * implementation.
+ */
+/* GLIBC implementation */
+
+void
+assign_work_mem(int newval, void* extra)
+{
+ work_mem = newval;
+ MallocAdjustSettings();
+}
+
+void
+assign_glibc_trim_threshold(int newval, void* extra)
+{
+ glibc_malloc_max_trim_threshold = newval;
+ MallocAdjustSettings();
+}
+
+#if defined(__GLIBC__)
+#include <malloc.h>
+#include <limits.h>
+#include <bits/wordsize.h>
+
+int previous_mmap_threshold = -1;
+int previous_trim_threshold = -1;
+
+/* For GLIBC's malloc, we want to avoid having too many mmap'd memory regions,
+ * and also to avoid repeatingly allocating / releasing memory from the system.
+ *
+ * The default configuration of malloc adapt's its M_MMAP_THRESHOLD and M_TRIM_THRESHOLD
+ * dynamically when previoulsy mmaped blocks are freed.
+ *
+ * This isn't really sufficient for our use case, as we may end up with a trim
+ * threshold which repeatedly releases work_mem memory to the system.
+ *
+ * Instead of letting malloc dynamically tune itself, for values up to 32MB we
+ * ensure that work_mem will fit both bellow M_MMAP_THRESHOLD and
+ * M_TRIM_THRESHOLD. The rationale for this is that once a backend has allocated
+ * this much memory, it is likely to use it again.
+ *
+ * To keep up with malloc's default behaviour, we set M_TRIM_THRESHOLD to
+ * M_MMAP_THRESHOLD * 2 so that work_mem blocks can avoid being released too
+ * early.
+ *
+ * Newer versions of malloc got rid of the MALLOC_MMAP_THRESHOLD upper limit,
+ * but we still enforce the values it sets to avoid wasting too much memory if we have a huge
+ * work_mem which is used only once.
+ */
+
+# if __WORDSIZE == 32
+# define MMAP_THRESHOLD_MAX (512 * 1024)
+# else
+# define MMAP_THRESHOLD_MAX (4 * 1024 * 1024 * sizeof(long))
+# endif
+
+void
+MallocAdjustSettings()
+{
+ int uncapped_mmap_threshold,
+ mmap_threshold,
+ trim_threshold;
+ long base_target;
+ /* If static malloc tuning is disabled, bail out. */
+ if (glibc_malloc_max_trim_threshold == -1)
+ return;
+ /* We don't want to adjust anything in the postmaster process, as that would
+ * disable dynamic adjustment for any child process*/
+ if ((MyProcPid == PostmasterPid) ||
+ ((MyBackendType != B_BACKEND) &&
+ (MyBackendType != B_BG_WORKER)))
+ return;
+ base_target = Min((long) work_mem * 1024, (long) glibc_malloc_max_trim_threshold / 2 * 1024);
+ /* To account for overhead, add one more memory page to that. */
+ base_target += 4096;
+ uncapped_mmap_threshold = Min(INT_MAX, base_target);
+ /* Cap mmap_threshold to MMAP_THRESHOLD_MAX */
+ mmap_threshold = Min(MMAP_THRESHOLD_MAX, uncapped_mmap_threshold);
+ /* Set trim treshold to two times the uncapped value */
+ trim_threshold = Min(INT_MAX, (long) uncapped_mmap_threshold * 2);
+ if (mmap_threshold != previous_mmap_threshold)
+ {
+ mallopt(M_MMAP_THRESHOLD, mmap_threshold);
+ previous_mmap_threshold = mmap_threshold;
+ }
+
+ if (trim_threshold != previous_trim_threshold)
+ {
+ mallopt(M_TRIM_THRESHOLD, trim_threshold);
+ /* If we reduce the trim_threshold, ask malloc to actually trim it.
+ * This allows us to recover from a bigger work_mem set up once, then
+ * reset back to a smaller value.
+ */
+ if (trim_threshold < previous_trim_threshold)
+ malloc_trim(trim_threshold);
+ previous_trim_threshold = trim_threshold;
+ }
+}
+
+/* Default no-op implementation for others malloc providers. */
+#else
+void MallocAdjustSettings()
+{
+}
+#endif
diff --git a/src/backend/utils/mmgr/meson.build b/src/backend/utils/mmgr/meson.build
index e6ebe145ea..53c1c85524 100644
--- a/src/backend/utils/mmgr/meson.build
+++ b/src/backend/utils/mmgr/meson.build
@@ -6,6 +6,7 @@ backend_sources += files(
'dsa.c',
'freepage.c',
'generation.c',
+ 'malloc_tuning.c',
'mcxt.c',
'memdebug.c',
'portalmem.c',
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 2ecb9fc086..e3644d4323 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -59,6 +59,7 @@ extern bool check_default_with_oids(bool *newval, void **extra,
GucSource source);
extern bool check_effective_io_concurrency(int *newval, void **extra,
GucSource source);
+extern void assign_glibc_trim_threshold(int newval, void* extra);
extern bool check_huge_page_size(int *newval, void **extra, GucSource source);
extern const char *show_in_hot_standby(void);
extern bool check_locale_messages(char **newval, void **extra, GucSource source);
@@ -157,6 +158,7 @@ extern bool check_wal_consistency_checking(char **newval, void **extra,
GucSource source);
extern void assign_wal_consistency_checking(const char *newval, void *extra);
extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+extern void assign_work_mem(int newval, void *extra);
extern bool check_io_direct(char **newval, void **extra, GucSource source);
extern void assign_io_direct(const char *newval, void *extra);
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index 21640d62a6..b502c1b0ff 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -93,6 +93,16 @@ extern void MemoryContextStatsDetail(MemoryContext context, int max_children,
extern void MemoryContextAllowInCriticalSection(MemoryContext context,
bool allow);
+/* Interface to tune underlying malloc implementation in mcxt.c.
+ * The implementation can only rely on GUCs for now, but it could be profitable
+ * to collect statistics about individual palloc / pfree cycle to determine the
+ * optimum size of certain values.
+ */
+extern void MallocAdjustSettings(void);
+
+/* Malloc-implementation specific GUCs */
+extern PGDLLIMPORT int glibc_malloc_max_trim_threshold;
+
#ifdef MEMORY_CONTEXT_CHECKING
extern void MemoryContextCheck(MemoryContext context);
#endif
--
2.41.0
Ronan Dunklau <ronan.dunklau@aiven.io> writes:
Following some conversation with Tomas at PGCon, I decided to resurrect this
topic, which was previously discussed in the context of moving tuplesort to
use GenerationContext: https://www.postgresql.org/message-id/
8046109.NyiUUSuA9g%40aivenronan
This seems like a pretty awful idea, mainly because there's no way
to have such a GUC mean anything on non-glibc platforms, which is
going to cause confusion or worse.
Aren't these same settings controllable via environment variables?
I could see adding some docs suggesting that you set thus-and-such
values in the postmaster's startup script. Admittedly, the confusion
argument is perhaps still raisable; but we have a similar docs section
discussing controlling Linux OOM behavior, and I've not heard much
complaints about that.
regards, tom lane
Le jeudi 22 juin 2023, 15:49:36 CEST Tom Lane a écrit :
This seems like a pretty awful idea, mainly because there's no way
to have such a GUC mean anything on non-glibc platforms, which is
going to cause confusion or worse.
I named the GUC glibc_malloc_max_trim_threshold, I hope this is enough to
clear up the confusion. We already have at least event_source, which is
windows specific even if it's not clear from the name.
Aren't these same settings controllable via environment variables?
I could see adding some docs suggesting that you set thus-and-such
values in the postmaster's startup script. Admittedly, the confusion
argument is perhaps still raisable; but we have a similar docs section
discussing controlling Linux OOM behavior, and I've not heard much
complaints about that.
Yes they are, but controlling them via an environment variable for the whole
cluster defeats the point: different backends have different workloads, and
being able to make sure for example the OLAP user is memory-greedy while the
OLTP one is as conservative as possible is a worthwile goal. Or even a
specific backend may want to raise it's work_mem and adapt glibc behaviour
accordingly, then get back to being conservative with memory until the next
such transaction.
Regards,
--
Ronan Dunklau
Ronan Dunklau <ronan.dunklau@aiven.io> writes:
Le jeudi 22 juin 2023, 15:49:36 CEST Tom Lane a écrit :
Aren't these same settings controllable via environment variables?
I could see adding some docs suggesting that you set thus-and-such
values in the postmaster's startup script. Admittedly, the confusion
argument is perhaps still raisable; but we have a similar docs section
discussing controlling Linux OOM behavior, and I've not heard much
complaints about that.
Yes they are, but controlling them via an environment variable for the whole
cluster defeats the point: different backends have different workloads, and
being able to make sure for example the OLAP user is memory-greedy while the
OLTP one is as conservative as possible is a worthwile goal.
And what is going to happen when we switch to a thread model?
(I don't personally think that's going to happen, but some other
people do.) If we only document how to adjust this cluster-wide,
then we won't have a problem with that. But I'm not excited about
introducing functionality that is both platform-dependent and
unsupportable in a threaded system.
regards, tom lane
On 22.06.23 15:35, Ronan Dunklau wrote:
The thing is, by default, those parameters are adjusted dynamically by the
glibc itself. It starts with quite small thresholds, and raises them when the
program frees some memory, up to a certain limit. This patch proposes a new
GUC allowing the user to adjust those settings according to their workload.This can cause problems. Let's take for example a table with 10k rows, and 32
columns (as defined by a bench script David Rowley shared last year when
discussing the GenerationContext for tuplesort), and execute the following
query, with 32MB of work_mem:
I don't follow what you are trying to achieve with this. The examples
you show appear to work sensibly in my mind. Using this setting, you
can save some of the adjustments that glibc does after the first query.
But that seems only useful if your session only does one query. Is that
what you are doing?
Le vendredi 23 juin 2023, 22:55:51 CEST Peter Eisentraut a écrit :
On 22.06.23 15:35, Ronan Dunklau wrote:
The thing is, by default, those parameters are adjusted dynamically by the
glibc itself. It starts with quite small thresholds, and raises them when
the program frees some memory, up to a certain limit. This patch proposes
a new GUC allowing the user to adjust those settings according to their
workload.This can cause problems. Let's take for example a table with 10k rows, and
32 columns (as defined by a bench script David Rowley shared last year
when discussing the GenerationContext for tuplesort), and execute the
following
query, with 32MB of work_mem:
I don't follow what you are trying to achieve with this. The examples
you show appear to work sensibly in my mind. Using this setting, you
can save some of the adjustments that glibc does after the first query.
But that seems only useful if your session only does one query. Is that
what you are doing?
No, not at all: glibc does not do the right thing, we don't "save" it.
I will try to rephrase that.
In the first test case I showed, we see that glibc adjusts its threshold, but
to a suboptimal value since repeated executions of a query needing the same
amount of memory will release it back to the kernel, and move the brk pointer
again, and will not adjust it again. On the other hand, by manually adjusting
the thresholds, we can set them to a higher value which means that the memory
will be kept in malloc's freelist for reuse for the next queries. As shown in
the benchmark results I posted, this can have quite a dramatic effect, going
from 396 tps to 894. For ease of benchmarking, it is a single query being
executed over and over again, but the same thing would be true if different
queries allocating memories were executed by a single backend.
The worst part of this means it is unpredictable: depending on past memory
allocation patterns, glibc will end up in different states, and exhibit
completely different performance for all subsequent queries. In fact, this is
what Tomas noticed last year, (see [0]/messages/by-id/bcdd4e3e-c12d-cd2b-7ead-a91ad416100a@enterprisedb.com), which led to investigation into
this.
I also tried to show that for certain cases glibcs behaviour can be on the
contrary to greedy, and hold on too much memory if we just need the memory
once and never allocate it again.
I hope what I'm trying to achieve is clearer that way. Maybe this patch is not
the best way to go about this, but since the memory allocator behaviour can
have such an impact it's a bit sad we have to leave half the performance on
the table because of it when there are easily accessible knobs to avoid it.
[0]: /messages/by-id/bcdd4e3e-c12d-cd2b-7ead-a91ad416100a@enterprisedb.com
Hi,
On 2023-06-22 15:35:12 +0200, Ronan Dunklau wrote:
This can cause problems. Let's take for example a table with 10k rows, and 32
columns (as defined by a bench script David Rowley shared last year when
discussing the GenerationContext for tuplesort), and execute the following
query, with 32MB of work_mem:select * from t order by a offset 100000;
I attached the results of the 10k rows / 32 columns / 32MB work_mem benchmark
with different values for glibc_malloc_max_trim_threshold.
Could you provide instructions for the benchmark that don't require digging
into the archive to find an email by David?
Greetings,
Andres Freund
Hi,
On 2023-06-26 08:38:35 +0200, Ronan Dunklau wrote:
I hope what I'm trying to achieve is clearer that way. Maybe this patch is not
the best way to go about this, but since the memory allocator behaviour can
have such an impact it's a bit sad we have to leave half the performance on
the table because of it when there are easily accessible knobs to avoid it.
I'm *quite* doubtful this patch is the way to go. If we want to more tightly
control memory allocation patterns, because we have more information than
glibc, we should do that, rather than try to nudge glibc's malloc in random
direction. In contrast a generic malloc() implementation we can have much
more information about memory lifetimes etc due to memory contexts.
We e.g. could keep a larger number of memory blocks reserved
ourselves. Possibly by delaying the release of additionally held blocks until
we have been idle for a few seconds or such.
WRT to the difference in TPS in the benchmark you mention - I suspect that we
are doing something bad that needs to be improved regardless of the underlying
memory allocator implementation. Due to the lack of detailed instructions I
couldn't reproduce the results immediately.
Greetings,
Andres Freund
Le lundi 26 juin 2023, 23:03:48 CEST Andres Freund a écrit :
Hi,
On 2023-06-26 08:38:35 +0200, Ronan Dunklau wrote:
I hope what I'm trying to achieve is clearer that way. Maybe this patch is
not the best way to go about this, but since the memory allocator
behaviour can have such an impact it's a bit sad we have to leave half
the performance on the table because of it when there are easily
accessible knobs to avoid it.I'm *quite* doubtful this patch is the way to go. If we want to more
tightly control memory allocation patterns, because we have more
information than glibc, we should do that, rather than try to nudge glibc's
malloc in random direction. In contrast a generic malloc() implementation
we can have much more information about memory lifetimes etc due to memory
contexts.
Yes this is probably much more appropriate, but a much larger change with
greater risks of regression. Especially as we have to make sure we're not
overfitting our own code for a specific malloc implementation, to the detriment
of others. Except if you hinted we should write our own directly instead ?
We e.g. could keep a larger number of memory blocks reserved
ourselves. Possibly by delaying the release of additionally held blocks
until we have been idle for a few seconds or such.
I think keeping work_mem around after it has been used a couple times make
sense. This is the memory a user is willing to dedicate to operations, after
all.
WRT to the difference in TPS in the benchmark you mention - I suspect that
we are doing something bad that needs to be improved regardless of the
underlying memory allocator implementation. Due to the lack of detailed
instructions I couldn't reproduce the results immediately.
I re-attached the simple script I used. I've run this script with different
values for glibc_malloc_max_trim_threshold.
Best regards,
--
Ronan Dunklau
Attachments:
Le mardi 27 juin 2023, 08:35:28 CEST Ronan Dunklau a écrit :
I re-attached the simple script I used. I've run this script with different
values for glibc_malloc_max_trim_threshold.
I forgot to add that it was using default parametrers except for work_mem, set
to 32M, and max_parallel_workers_per_gather set to zero.
Hi,
On 2023-06-27 08:35:28 +0200, Ronan Dunklau wrote:
Le lundi 26 juin 2023, 23:03:48 CEST Andres Freund a �crit :
Hi,
On 2023-06-26 08:38:35 +0200, Ronan Dunklau wrote:
I hope what I'm trying to achieve is clearer that way. Maybe this patch is
not the best way to go about this, but since the memory allocator
behaviour can have such an impact it's a bit sad we have to leave half
the performance on the table because of it when there are easily
accessible knobs to avoid it.I'm *quite* doubtful this patch is the way to go. If we want to more
tightly control memory allocation patterns, because we have more
information than glibc, we should do that, rather than try to nudge glibc's
malloc in random direction. In contrast a generic malloc() implementation
we can have much more information about memory lifetimes etc due to memory
contexts.Yes this is probably much more appropriate, but a much larger change with
greater risks of regression. Especially as we have to make sure we're not
overfitting our own code for a specific malloc implementation, to the detriment
of others.
I think your approach is fundamentally overfitting our code to a specific
malloc implementation, in a way that's not tunable by mere mortals. It just
seems like a dead end to me.
Except if you hinted we should write our own directly instead ?
I don't think we should write our own malloc - we don't rely on it much
ourselves. And if we replace it, we need to care about mallocs performance
characteristics a whole lot, because various libraries etc do heavily rely on
it.
However, I do think we should eventually avoid using malloc() for aset.c et
al. malloc() is a general allocator, but at least for allocations below
maxBlockSize aset.c's doesn't do allocations in a way that really benefit from
that *at all*. It's not a lot of work to do such allocations on our own.
We e.g. could keep a larger number of memory blocks reserved
ourselves. Possibly by delaying the release of additionally held blocks
until we have been idle for a few seconds or such.I think keeping work_mem around after it has been used a couple times make
sense. This is the memory a user is willing to dedicate to operations, after
all.
The biggest overhead of returning pages to the kernel is that that triggers
zeroing the data during the next allocation. Particularly on multi-node
servers that's surprisingly slow. It's most commonly not the brk() or mmap()
themselves that are the performance issue.
Indeed, with your benchmark, I see that most of the time, on my dual Xeon Gold
5215 workstation, is spent zeroing newly allocated pages during page
faults. That microarchitecture is worse at this than some others, but it's
never free (or cache friendly).
WRT to the difference in TPS in the benchmark you mention - I suspect that
we are doing something bad that needs to be improved regardless of the
underlying memory allocator implementation. Due to the lack of detailed
instructions I couldn't reproduce the results immediately.I re-attached the simple script I used. I've run this script with different
values for glibc_malloc_max_trim_threshold.
FWIW, in my experience trimming the brk()ed region doesn't work reliably
enough in real world postgres workloads to be worth relying on (from a memory
usage POV). Sooner or later you're going to have longer lived allocations
placed that will prevent it from happening.
I have played around with telling aset.c that certain contexts are long lived
and using mmap() for those, to make it more likely that the libc malloc/free
can actually return memory to the system. I think that can be quite
worthwhile.
Greetings,
Andres Freund
Le mardi 27 juin 2023, 20:17:46 CEST Andres Freund a écrit :
Yes this is probably much more appropriate, but a much larger change with
greater risks of regression. Especially as we have to make sure we're not
overfitting our own code for a specific malloc implementation, to the
detriment of others.I think your approach is fundamentally overfitting our code to a specific
malloc implementation, in a way that's not tunable by mere mortals. It just
seems like a dead end to me.
I see it as a way to have *some* sort of control over the malloc
implementation we use, instead of tuning our allocations pattern on top of it
while treating it entirely as a black box. As for the tuning, I proposed
earlier to replace this parameter expressed in terms of size as a "profile"
(greedy / conservative) to make it easier to pick a sensible value.
Except if you hinted we should write our own directly instead ?
I don't think we should write our own malloc - we don't rely on it much
ourselves. And if we replace it, we need to care about mallocs performance
characteristics a whole lot, because various libraries etc do heavily rely
on it.However, I do think we should eventually avoid using malloc() for aset.c et
al. malloc() is a general allocator, but at least for allocations below
maxBlockSize aset.c's doesn't do allocations in a way that really benefit
from that *at all*. It's not a lot of work to do such allocations on our
own.We e.g. could keep a larger number of memory blocks reserved
ourselves. Possibly by delaying the release of additionally held blocks
until we have been idle for a few seconds or such.I think keeping work_mem around after it has been used a couple times make
sense. This is the memory a user is willing to dedicate to operations,
after all.The biggest overhead of returning pages to the kernel is that that triggers
zeroing the data during the next allocation. Particularly on multi-node
servers that's surprisingly slow. It's most commonly not the brk() or
mmap() themselves that are the performance issue.Indeed, with your benchmark, I see that most of the time, on my dual Xeon
Gold 5215 workstation, is spent zeroing newly allocated pages during page
faults. That microarchitecture is worse at this than some others, but it's
never free (or cache friendly).
I'm not sure I see the practical difference between those, but that's
interesting. Were you able to reproduce my results ?
FWIW, in my experience trimming the brk()ed region doesn't work reliably
enough in real world postgres workloads to be worth relying on (from a
memory usage POV). Sooner or later you're going to have longer lived
allocations placed that will prevent it from happening.
I'm not sure I follow: given our workload is clearly split at queries and
transactions boundaries, releasing memory at that time, I've assumed (and
noticed in practice, albeit not on a production system) that most memory at
the top of the heap would be trimmable as we don't keep much in between
queries / transactions.
I have played around with telling aset.c that certain contexts are long
lived and using mmap() for those, to make it more likely that the libc
malloc/free can actually return memory to the system. I think that can bequite worthwhile.
So if I understand your different suggestions, we should:
- use mmap ourselves for what we deem to be "one-off" allocations, to make
sure that memory is not hanging around after we don't use
- keep some pool allocated which will not be freed in between queries, but
reused for the next time we need it.
Thank you for looking at this problem.
Regards,
--
Ronan Dunklau
Hi,
On 2023-06-28 07:26:03 +0200, Ronan Dunklau wrote:
I see it as a way to have *some* sort of control over the malloc
implementation we use, instead of tuning our allocations pattern on top of it
while treating it entirely as a black box. As for the tuning, I proposed
earlier to replace this parameter expressed in terms of size as a "profile"
(greedy / conservative) to make it easier to pick a sensible value.
I don't think that makes it very usable - we'll still have idle connections
use up a lot more memory than now in some cases, and not in others, even
though it doesn't help. And it will be very heavily dependent on the OS and
glibc version.
Le mardi 27 juin 2023, 20:17:46 CEST Andres Freund a �crit :
Except if you hinted we should write our own directly instead ?
We e.g. could keep a larger number of memory blocks reserved
ourselves. Possibly by delaying the release of additionally held blocks
until we have been idle for a few seconds or such.I think keeping work_mem around after it has been used a couple times make
sense. This is the memory a user is willing to dedicate to operations,
after all.The biggest overhead of returning pages to the kernel is that that triggers
zeroing the data during the next allocation. Particularly on multi-node
servers that's surprisingly slow. It's most commonly not the brk() or
mmap() themselves that are the performance issue.Indeed, with your benchmark, I see that most of the time, on my dual Xeon
Gold 5215 workstation, is spent zeroing newly allocated pages during page
faults. That microarchitecture is worse at this than some others, but it's
never free (or cache friendly).I'm not sure I see the practical difference between those, but that's
interesting. Were you able to reproduce my results ?
I see a bit smaller win than what you observed, but it is substantial.
The runtime difference between the "default" and "cached" malloc are almost
entirely in these bits:
cached:
- 8.93% postgres libc.so.6 [.] __memmove_evex_unaligned_erms
- __memmove_evex_unaligned_erms
+ 6.77% minimal_tuple_from_heap_tuple
+ 2.04% _int_realloc
+ 0.04% AllocSetRealloc
0.02% 0x56281094806f
0.02% 0x56281094e0bf
vs
uncached:
- 14.52% postgres libc.so.6 [.] __memmove_evex_unaligned_erms
8.61% asm_exc_page_fault
- 5.91% __memmove_evex_unaligned_erms
+ 5.78% minimal_tuple_from_heap_tuple
0.04% 0x560130a2900f
0.02% 0x560130a20faf
+ 0.02% AllocSetRealloc
+ 0.02% _int_realloc
+ 3.81% postgres [kernel.vmlinux] [k] native_irq_return_iret
+ 1.88% postgres [kernel.vmlinux] [k] __handle_mm_fault
+ 1.76% postgres [kernel.vmlinux] [k] clear_page_erms
+ 1.67% postgres [kernel.vmlinux] [k] get_mem_cgroup_from_mm
+ 1.42% postgres [kernel.vmlinux] [k] cgroup_rstat_updated
+ 1.00% postgres [kernel.vmlinux] [k] get_page_from_freelist
+ 0.93% postgres [kernel.vmlinux] [k] mtree_range_walk
None of the latter are visible in a profile in the cached case.
I.e. the overhead is encountering page faults and individually allocating the
necessary memory in the kernel.
This isn't surprising, I just wanted to make sure I entirely understand.
Part of the reason this code is a bit worse is that it's using generation.c,
which doesn't cache any part of of the context. Not that aset.c's level of
caching would help a lot, given that it caches the context itself, not later
blocks.
FWIW, in my experience trimming the brk()ed region doesn't work reliably
enough in real world postgres workloads to be worth relying on (from a
memory usage POV). Sooner or later you're going to have longer lived
allocations placed that will prevent it from happening.I'm not sure I follow: given our workload is clearly split at queries and
transactions boundaries, releasing memory at that time, I've assumed (and
noticed in practice, albeit not on a production system) that most memory at
the top of the heap would be trimmable as we don't keep much in between
queries / transactions.
That's true for very simple workloads, but once you're beyond that you just
need some longer-lived allocation to happen. E.g. some relcache / catcache
miss during the query execution, and there's no exant memory in
CacheMemoryContext, so a new block is allocated.
Greetings,
Andres Freund
On 29 Jun 2023, at 00:31, Andres Freund <andres@anarazel.de> wrote:
On 2023-06-28 07:26:03 +0200, Ronan Dunklau wrote:I see it as a way to have *some* sort of control over the malloc
implementation we use, instead of tuning our allocations pattern on top of it
while treating it entirely as a black box. As for the tuning, I proposed
earlier to replace this parameter expressed in terms of size as a "profile"
(greedy / conservative) to make it easier to pick a sensible value.I don't think that makes it very usable - we'll still have idle connections
use up a lot more memory than now in some cases, and not in others, even
though it doesn't help. And it will be very heavily dependent on the OS and
glibc version.
Based on the comments in this thread and that no update has been posted
addressing the objections I will mark this returned with feedback. Please feel
free to resubmit to a future CF.
--
Daniel Gustafsson