memory leak in trigger handling (since PG12)
Hi,
it seems there's a fairly annoying memory leak in trigger code,
introduced by
commit fc22b6623b6b3bab3cb057ccd282c2bfad1a0b30
Author: Peter Eisentraut <peter@eisentraut.org>
Date: Sat Mar 30 08:13:09 2019 +0100
Generated columns
...
which added GetAllUpdatedColumns() and uses it many places instead of
the original GetUpdatedColumns():
#define GetUpdatedColumns(relinfo, estate) \
(exec_rt_fetch((relinfo)->ri_RangeTableIndex,
estate)->updatedCols)
#define GetAllUpdatedColumns(relinfo, estate) \
(bms_union(exec_rt_fetch((relinfo)->ri_RangeTableIndex,
estate)->updatedCols, \
exec_rt_fetch((relinfo)->ri_RangeTableIndex,
estate)->extraUpdatedCols))
Notice this creates a new bitmap on every calls. That's a bit of a
problem, because we call this from:
- ExecASUpdateTriggers
- ExecARUpdateTriggers
- ExecBSUpdateTriggers
- ExecBRUpdateTriggers
- ExecUpdateLockMode
This means that for an UPDATE with triggers, we may end up calling this
for each row, possibly multiple bitmaps. And those bitmaps are allocated
in ExecutorState, so won't be freed until the end of the query :-(
The bitmaps are typically fairly small (a couple bytes), but for wider
tables it can be a couple dozen bytes. But it's primarily driven by
number of updated rows.
It's easy to leak gigabytes when updating ~10M rows. I've seen cases
with a couple tens of GBs leaked, though, but in that case it seems to
be caused by UPDATE ... FROM missing a join condition (so in fact the
memory leak is proportional to number of rows in the join result, not
the number we end up updating).
Attached is a patch, restoring the pre-12 behavior for me.
While looking for other places allocating stuff in ExecutorState (for
the UPDATE case) and leaving it there, I found two more cases:
1) copy_plpgsql_datums
2) make_expanded_record_from_tupdesc
make_expanded_record_from_exprecord
All of this is calls from plpgsql_exec_trigger.
Fixing the copy_plpgsql_datums case seems fairly simple, the space
allocated for local copies can be freed during the cleanup. That's what
0002 does.
I'm not sure what to do about the expanded records. My understanding of
the expanded record lifecycle is fairly limited, so my (rather naive)
attempt to free the memory failed ...
I wonder how much we should care about these cases. On the one hand we
often leave the cleanup up to the memory context, but the assumption is
the context is not unnecessarily long-lived. And ExecutorState is not
that. And leaking memory per-row does not seem great either.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
0001-Free-updatedCols-bitmaps.patchtext/x-patch; charset=UTF-8; name=0001-Free-updatedCols-bitmaps.patchDownload
From 85439e2587f035040c82e7303b96b887e650b01f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Tue, 23 May 2023 17:45:47 +0200
Subject: [PATCH 1/2] Free updatedCols bitmaps
---
src/backend/commands/trigger.c | 22 ++++++++++++++++++++--
src/backend/executor/execMain.c | 9 +++++++--
2 files changed, 27 insertions(+), 4 deletions(-)
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 4b295f8da5e..fb11203c473 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2916,6 +2916,8 @@ ExecBSUpdateTriggers(EState *estate, ResultRelInfo *relinfo)
(errcode(ERRCODE_E_R_I_E_TRIGGER_PROTOCOL_VIOLATED),
errmsg("BEFORE STATEMENT trigger cannot return a value")));
}
+
+ bms_free(updatedCols);
}
void
@@ -2928,12 +2930,18 @@ ExecASUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
Assert(relinfo->ri_RootResultRelInfo == NULL);
if (trigdesc && trigdesc->trig_update_after_statement)
+ {
+ Bitmapset *updatedCols = ExecGetAllUpdatedCols(relinfo, estate);
+
AfterTriggerSaveEvent(estate, relinfo, NULL, NULL,
TRIGGER_EVENT_UPDATE,
false, NULL, NULL, NIL,
- ExecGetAllUpdatedCols(relinfo, estate),
+ updatedCols,
transition_capture,
false);
+
+ bms_free(updatedCols);
+ }
}
bool
@@ -3043,6 +3051,9 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
heap_freetuple(trigtuple);
if (should_free_new)
heap_freetuple(oldtuple);
+
+ bms_free(updatedCols);
+
return false; /* "do nothing" */
}
else if (newtuple != oldtuple)
@@ -3068,6 +3079,8 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
if (should_free_trig)
heap_freetuple(trigtuple);
+ bms_free(updatedCols);
+
return true;
}
@@ -3107,6 +3120,7 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
*/
TupleTableSlot *oldslot;
ResultRelInfo *tupsrc;
+ Bitmapset *updatedCols;
Assert((src_partinfo != NULL && dst_partinfo != NULL) ||
!is_crosspart_update);
@@ -3129,14 +3143,18 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
else
ExecClearTuple(oldslot);
+ updatedCols = ExecGetAllUpdatedCols(relinfo, estate);
+
AfterTriggerSaveEvent(estate, relinfo,
src_partinfo, dst_partinfo,
TRIGGER_EVENT_UPDATE,
true,
oldslot, newslot, recheckIndexes,
- ExecGetAllUpdatedCols(relinfo, estate),
+ updatedCols,
transition_capture,
is_crosspart_update);
+
+ bms_free(updatedCols);
}
}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index c76fdf59ec4..98502b956c2 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2367,6 +2367,7 @@ ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo)
{
Bitmapset *keyCols;
Bitmapset *updatedCols;
+ LockTupleMode lockMode;
/*
* Compute lock mode to use. If columns that are part of the key have not
@@ -2378,9 +2379,13 @@ ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo)
INDEX_ATTR_BITMAP_KEY);
if (bms_overlap(keyCols, updatedCols))
- return LockTupleExclusive;
+ lockMode = LockTupleExclusive;
+ else
+ lockMode = LockTupleNoKeyExclusive;
+
+ bms_free(updatedCols);
- return LockTupleNoKeyExclusive;
+ return lockMode;
}
/*
--
2.40.1
0002-Free-space-allocated-by-copy_plpgsql_datums.patchtext/x-patch; charset=UTF-8; name=0002-Free-space-allocated-by-copy_plpgsql_datums.patchDownload
From 16a7075569fa56951971592b237929f49e414737 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sat, 13 May 2023 13:19:49 +0200
Subject: [PATCH 2/2] Free space allocated by copy_plpgsql_datums
During PL/pgSQL execution, we allocate a local copy for calls. When
executing triggers, this gets allocated in ExecutorState, so it'd be
kept until query completes.
Free it explicitly, as part of cleanup.
---
src/pl/plpgsql/src/pl_exec.c | 39 +++++++++++++++++++++++++++++++-----
1 file changed, 34 insertions(+), 5 deletions(-)
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index 4b76f7699a6..99283d354f4 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -264,6 +264,8 @@ static void coerce_function_result_tuple(PLpgSQL_execstate *estate,
static void plpgsql_exec_error_callback(void *arg);
static void copy_plpgsql_datums(PLpgSQL_execstate *estate,
PLpgSQL_function *func);
+static void free_plpgsql_datums(PLpgSQL_execstate *estate,
+ PLpgSQL_function *func);
static void plpgsql_fulfill_promise(PLpgSQL_execstate *estate,
PLpgSQL_var *var);
static MemoryContext get_stmt_mcontext(PLpgSQL_execstate *estate);
@@ -788,6 +790,8 @@ plpgsql_exec_function(PLpgSQL_function *func, FunctionCallInfo fcinfo,
exec_eval_cleanup(&estate);
/* stmt_mcontext will be destroyed when function's main context is */
+ free_plpgsql_datums(&estate, func);
+
/*
* Pop the error context stack
*/
@@ -1142,6 +1146,8 @@ plpgsql_exec_trigger(PLpgSQL_function *func,
exec_eval_cleanup(&estate);
/* stmt_mcontext will be destroyed when function's main context is */
+ free_plpgsql_datums(&estate, func);
+
/*
* Pop the error context stack
*/
@@ -1217,6 +1223,8 @@ plpgsql_exec_event_trigger(PLpgSQL_function *func, EventTriggerData *trigdata)
exec_eval_cleanup(&estate);
/* stmt_mcontext will be destroyed when function's main context is */
+ free_plpgsql_datums(&estate, func);
+
/*
* Pop the error context stack
*/
@@ -1304,15 +1312,20 @@ copy_plpgsql_datums(PLpgSQL_execstate *estate,
char *ws_next;
int i;
- /* Allocate local datum-pointer array */
- estate->datums = (PLpgSQL_datum **)
- palloc(sizeof(PLpgSQL_datum *) * ndatums);
-
/*
+ * Allocate local datum-pointer array, followed by space for the values.
+ *
* To reduce palloc overhead, we make a single palloc request for all the
* space needed for locally-instantiated datums.
*/
- workspace = palloc(func->copiable_size);
+ estate->datums = (PLpgSQL_datum **)
+ palloc(MAXALIGN(sizeof(PLpgSQL_datum *) * ndatums) + func->copiable_size);
+
+ /*
+ * Space for values starts after the datum-pointer array, but we need to make
+ * sure it's aligned properly.
+ */
+ workspace = (char *) estate->datums + MAXALIGN(sizeof(PLpgSQL_datum *) * ndatums);
ws_next = workspace;
/* Fill datum-pointer array, copying datums into workspace as needed */
@@ -1362,6 +1375,22 @@ copy_plpgsql_datums(PLpgSQL_execstate *estate,
Assert(ws_next == workspace + func->copiable_size);
}
+/*
+ * Free the local execution variables. We've allocated all of the space at once
+ * so we can simply free the main pointer.
+ */
+static void
+free_plpgsql_datums(PLpgSQL_execstate *estate,
+ PLpgSQL_function *func)
+{
+ if (estate->datums != NULL)
+ {
+ pfree(estate->datums);
+ estate->datums = NULL;
+ }
+
+}
+
/*
* If the variable has an armed "promise", compute the promised value
* and assign it to the variable.
--
2.40.1
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
it seems there's a fairly annoying memory leak in trigger code,
introduced by
...
Attached is a patch, restoring the pre-12 behavior for me.
While looking for other places allocating stuff in ExecutorState (for
the UPDATE case) and leaving it there, I found two more cases:
1) copy_plpgsql_datums
2) make_expanded_record_from_tupdesc
make_expanded_record_from_exprecord
All of this is calls from plpgsql_exec_trigger.
Not sure about the expanded-record case, but both of your other two
fixes feel like poor substitutes for pushing the memory into a
shorter-lived context. In particular I'm quite surprised that
plpgsql isn't already allocating that workspace in the "procedure"
memory context.
I wonder how much we should care about these cases. On the one hand we
often leave the cleanup up to the memory context, but the assumption is
the context is not unnecessarily long-lived. And ExecutorState is not
that. And leaking memory per-row does not seem great either.
I agree per-row leaks in the ExecutorState context are not cool.
regards, tom lane
Hi,
On 2023-05-23 18:23:00 +0200, Tomas Vondra wrote:
This means that for an UPDATE with triggers, we may end up calling this
for each row, possibly multiple bitmaps. And those bitmaps are allocated
in ExecutorState, so won't be freed until the end of the query :-(
Ugh.
I've wondered about some form of instrumentation to detect such issues
before. It's obviously a problem that we can have fairly large leaks, like the
one you just discovered, without detecting it for a couple years. It's kinda
made worse by the memory context infrastructure, because it hides such issues.
Could it help to have a mode where the executor shutdown hook checks how much
memory is allocated in ExecutorState and warns if its too much? There's IIRC a
few places that allocate large things directly in it, but most of those
probably should be dedicated contexts anyway. Something similar could be
useful for some other long-lived contexts.
The bitmaps are typically fairly small (a couple bytes), but for wider
tables it can be a couple dozen bytes. But it's primarily driven by
number of updated rows.
Random aside: I've been wondering whether it'd be worth introducing an
in-place representation of Bitmap (e.g. if the low bit is set, the low 63 bits
are in-place, if unset, it's a pointer).
Attached is a patch, restoring the pre-12 behavior for me.
Hm. Somehow this doesn't seem quite right. Shouldn't we try to use a shorter
lived memory context instead? Otherwise we'll just end up with the same
problem in a few years.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
I've wondered about some form of instrumentation to detect such issues
before.
Yeah.
Could it help to have a mode where the executor shutdown hook checks how much
memory is allocated in ExecutorState and warns if its too much?
It'd be very hard to set a limit for what's "too much", since the amount
of stuff created initially will depend on the plan size. In any case
I think that the important issue is not how much absolute space, but is
there per-row leakage. I wonder if we could do something involving
checking for continued growth after the first retrieved tuple, or
something like that.
Random aside: I've been wondering whether it'd be worth introducing an
in-place representation of Bitmap (e.g. if the low bit is set, the low 63 bits
are in-place, if unset, it's a pointer).
Why? Unlike Lists, those things are already a single palloc chunk.
regards, tom lane
Hi,
On 2023-05-23 13:28:30 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
Could it help to have a mode where the executor shutdown hook checks how much
memory is allocated in ExecutorState and warns if its too much?It'd be very hard to set a limit for what's "too much", since the amount
of stuff created initially will depend on the plan size.
I was thinking of some limit that should really never be reached outside of a
leak or work_mem based allocations, say 2GB or so.
In any case I think that the important issue is not how much absolute space,
but is there per-row leakage. I wonder if we could do something involving
checking for continued growth after the first retrieved tuple, or something
like that.
As-is I some nodes do large allocations the query context, but are not
guaranteed to be reached when gathering the first row. So we would still have
to move such large allocations out of ExecutorState.
I think it might be best to go for a combination of these two
heuristics. Store the size of es_query_context after standard_ExecutorStart(),
that would include the allocation of the executor tree itself. Then in
standard_ExecutorEnd(), if the difference in size of ExecutorState is bigger
than some constant *and* is bigger than the initial size by some factor, emit
a warning.
The constant size difference avoids spurious warnings in case of a small plan
that just grows due to a few fmgr lookups or such, the factor takes care of
the plan complexity?
Random aside: I've been wondering whether it'd be worth introducing an
in-place representation of Bitmap (e.g. if the low bit is set, the low 63 bits
are in-place, if unset, it's a pointer).Why? Unlike Lists, those things are already a single palloc chunk.
We do a fair amount of 8 byte allocations - they have quite a bit of overhead,
even after c6e0fe1f2a0. Not needing allocations for the common case of
bitmapsets with a max member < 63 seems like it could be worth it.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2023-05-23 13:28:30 -0400, Tom Lane wrote:
Why? Unlike Lists, those things are already a single palloc chunk.
We do a fair amount of 8 byte allocations - they have quite a bit of overhead,
even after c6e0fe1f2a0. Not needing allocations for the common case of
bitmapsets with a max member < 63 seems like it could be worth it.
Oh, now I understand what you meant: use the pointer's bits as data.
Dunno that it's a good idea though. You'd pay for the palloc savings
by needing two or four code paths in every bitmapset function, because
the need to reserve one bit would mean you couldn't readily make the
two cases look alike at the bit-pushing level.
Another big problem is that we'd have to return to treating bitmapsets
as a special-purpose thing rather than a kind of Node. While that's
not very deeply embedded yet, I recall that the alternatives weren't
attractive.
Also, returning to the original topic: we'd never find leaks of the
sort complained of here, because they wouldn't exist in cases with
fewer than 64 relations per query (or whatever the bitmap is
representing).
regards, tom lane
On 5/23/23 18:39, Tom Lane wrote:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
it seems there's a fairly annoying memory leak in trigger code,
introduced by
...
Attached is a patch, restoring the pre-12 behavior for me.While looking for other places allocating stuff in ExecutorState (for
the UPDATE case) and leaving it there, I found two more cases:1) copy_plpgsql_datums
2) make_expanded_record_from_tupdesc
make_expanded_record_from_exprecordAll of this is calls from plpgsql_exec_trigger.
Not sure about the expanded-record case, but both of your other two
fixes feel like poor substitutes for pushing the memory into a
shorter-lived context. In particular I'm quite surprised that
plpgsql isn't already allocating that workspace in the "procedure"
memory context.
I don't disagree, but which memory context should this use and
when/where should we switch to it?
I haven't seen any obvious memory context candidate in the code calling
ExecGetAllUpdatedCols, so I guess we'd have to pass it from above. Is
that a good idea for backbranches ...
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 5/23/23 19:14, Andres Freund wrote:
Hi,
On 2023-05-23 18:23:00 +0200, Tomas Vondra wrote:
This means that for an UPDATE with triggers, we may end up calling this
for each row, possibly multiple bitmaps. And those bitmaps are allocated
in ExecutorState, so won't be freed until the end of the query :-(Ugh.
I've wondered about some form of instrumentation to detect such issues
before. It's obviously a problem that we can have fairly large leaks, like the
one you just discovered, without detecting it for a couple years. It's kinda
made worse by the memory context infrastructure, because it hides such issues.Could it help to have a mode where the executor shutdown hook checks how much
memory is allocated in ExecutorState and warns if its too much? There's IIRC a
few places that allocate large things directly in it, but most of those
probably should be dedicated contexts anyway. Something similar could be
useful for some other long-lived contexts.
Not sure such simple instrumentation would help much, unfortunately :-(
We only discovered this because the user reported OOM crashes, which
means the executor didn't get to the shutdown hook at all. Yeah, maybe
if we had such instrumentation it'd get triggered for milder cases, but
I'd bet the amount of noise is going to be significant.
For example, there's a nearby thread about hashjoin allocating buffiles
etc. in ExecutorState - we ended up moving that to a separate context,
but surely there are more such cases (not just for ExecutorState).
The really hard thing was determining what causes the memory leak - the
simple instrumentation doesn't help with that at all. It tells you there
might be a leak, but you don't know where did the allocations came from.
What I ended up doing is a simple gdb script that sets breakpoints on
all palloc/pfree variants, and prints info (including the backtrace) for
each call on ExecutorState. And then a script that aggregate those to
identify which backtraces allocated most chunks that were not freed.
This was super slow (definitely useless outside development environment)
but it made it super obvious where the root cause is.
I experimented with instrumenting the code a bit (as alternative to gdb
breakpoints) - still slow, but much faster than gdb. But perhaps useful
for special (valgrind-like) testing mode ...
Would require testing with more data, though. I doubt we'd find much
with our regression tests, which have tiny data sets.
...
Attached is a patch, restoring the pre-12 behavior for me.
Hm. Somehow this doesn't seem quite right. Shouldn't we try to use a shorter
lived memory context instead? Otherwise we'll just end up with the same
problem in a few years.
I agree using a shorter lived memory context would be more elegant, and
more in line with how we do things. But it's not clear to me why we'd
end up with the same problem in a few years with what the patch does.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
The really hard thing was determining what causes the memory leak - the
simple instrumentation doesn't help with that at all. It tells you there
might be a leak, but you don't know where did the allocations came from.
What I ended up doing is a simple gdb script that sets breakpoints on
all palloc/pfree variants, and prints info (including the backtrace) for
each call on ExecutorState. And then a script that aggregate those to
identify which backtraces allocated most chunks that were not freed.
FWIW, I've had some success localizing palloc memory leaks with valgrind's
leak detection mode. The trick is to ask it for a report before the
context gets destroyed. Beats writing your own infrastructure ...
Would require testing with more data, though. I doubt we'd find much
with our regression tests, which have tiny data sets.
Yeah, it's not clear whether we could make the still-hypothetical check
sensitive enough to find leaks using small test cases without getting an
unworkable number of false positives. Still, might be worth trying.
It might be an acceptable tradeoff to have stricter rules for what can
be allocated in ExecutorState in order to make this sort of problem
more easily detectable.
regards, tom lane
Hi, just two cents:
On Tue, May 23, 2023 at 8:01 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-05-23 13:28:30 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
Could it help to have a mode where the executor shutdown hook checks how much
memory is allocated in ExecutorState and warns if its too much?It'd be very hard to set a limit for what's "too much", since the amount
of stuff created initially will depend on the plan size.I was thinking of some limit that should really never be reached outside of a
leak or work_mem based allocations, say 2GB or so.
RE: instrumentation subthread:
if that helps then below technique can work somewhat good on normal
binaries for end users (given there are debug symbols installed), so
maybe we don't need that much infrastructure added in to see the hot
code path:
perf probe -x /path/to/postgres 'palloc' 'size=%di:u64' # RDI on
x86_64(palloc size arg0)
perf record -avg --call-graph dwarf -e probe_postgres:palloc -aR -p
<pid> sleep 3 # cannot be longer, huge overhead (~3s=~2GB)
it produces:
50.27% (563d0e380670) size=24
|
---palloc
bms_copy
ExecUpdateLockMode
ExecBRUpdateTriggers
ExecUpdate
[..]
49.73% (563d0e380670) size=16
|
---palloc
bms_copy
RelationGetIndexAttrBitmap
ExecUpdateLockMode
ExecBRUpdateTriggers
ExecUpdate
[..]
Now we know that those small palloc() are guilty, but we didn't know
at the time with Tomas. The problem here is that we do not know in
palloc() - via its own arguments for which MemoryContext this is going
to be allocated for. This is problematic for perf, because on RHEL8, I
was not able to generate an uprobe that was capable of reaching a
global variable (CurrentMemoryContext) at that time.
Additionally what was even more frustrating on diagnosing that case on
the customer end system, was that such OOMs were crashing other
PostgreSQL clusters on the same OS. Even knowing the exact guilty
statement it was impossible to limit RSS memory usage of that
particular backend. So, what you are proposing makes a lot of sense.
Also it got me thinking of implementing safety-memory-net-GUC
debug_query_limit_backend_memory=X MB that would inject
setrlimit(RLIMIT_DATA) through external extension via hook(s) and
un-set it later, but the man page states it works for mmap() only
after Linux 4.7+ so it is future proof but won't work e.g. on RHEL7 -
maybe that's still good enough?; Or, well maybe try to hack a palloc()
a little, but that has probably too big overhead, right? (just
thinking loud).
-Jakub Wartak.
On 5/23/23 23:39, Tom Lane wrote:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
The really hard thing was determining what causes the memory leak - the
simple instrumentation doesn't help with that at all. It tells you there
might be a leak, but you don't know where did the allocations came from.What I ended up doing is a simple gdb script that sets breakpoints on
all palloc/pfree variants, and prints info (including the backtrace) for
each call on ExecutorState. And then a script that aggregate those to
identify which backtraces allocated most chunks that were not freed.FWIW, I've had some success localizing palloc memory leaks with valgrind's
leak detection mode. The trick is to ask it for a report before the
context gets destroyed. Beats writing your own infrastructure ...
I haven't tried valgrind, so can't compare.
Would it be possible to filter which memory contexts to track? Say we
know the leak is in ExecutorState, so we don't need to track allocations
in other contexts. That was a huge speedup for me, maybe it'd help
valgrind too.
Also, how did you ask for the report before context gets destroyed?
Would require testing with more data, though. I doubt we'd find much
with our regression tests, which have tiny data sets.Yeah, it's not clear whether we could make the still-hypothetical check
sensitive enough to find leaks using small test cases without getting an
unworkable number of false positives. Still, might be worth trying.
I'm not against experimenting with that. Were you thinking about
something that'd be cheap enough to just be enabled always/everywhere,
or something we'd enable during testing?
This reminded me a strangeloop talk [1]https://youtu.be/vVUnCXKuNOg?t=1405 [2]https://youtu.be/vVUnCXKuNOg?t=1706 about the Scalene memory
profiler from UMass. That's for Python, but they did some smart tricks
to reduce the cost of profiling - maybe we could do something similar,
possibly by extending the memory contexts a bit.
[1]: https://youtu.be/vVUnCXKuNOg?t=1405
[2]: https://youtu.be/vVUnCXKuNOg?t=1706
It might be an acceptable tradeoff to have stricter rules for what can
be allocated in ExecutorState in order to make this sort of problem
more easily detectable.
Would these rules be just customary, or defined/enforced in the code
somehow? I can't quite imagine how would that work, TBH.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 5/24/23 10:19, Jakub Wartak wrote:
Hi, just two cents:
On Tue, May 23, 2023 at 8:01 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-05-23 13:28:30 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
Could it help to have a mode where the executor shutdown hook checks how much
memory is allocated in ExecutorState and warns if its too much?It'd be very hard to set a limit for what's "too much", since the amount
of stuff created initially will depend on the plan size.I was thinking of some limit that should really never be reached outside of a
leak or work_mem based allocations, say 2GB or so.RE: instrumentation subthread:
if that helps then below technique can work somewhat good on normal
binaries for end users (given there are debug symbols installed), so
maybe we don't need that much infrastructure added in to see the hot
code path:perf probe -x /path/to/postgres 'palloc' 'size=%di:u64' # RDI on
x86_64(palloc size arg0)
perf record -avg --call-graph dwarf -e probe_postgres:palloc -aR -p
<pid> sleep 3 # cannot be longer, huge overhead (~3s=~2GB)it produces:
50.27% (563d0e380670) size=24
|
---palloc
bms_copy
ExecUpdateLockMode
ExecBRUpdateTriggers
ExecUpdate
[..]49.73% (563d0e380670) size=16
|
---palloc
bms_copy
RelationGetIndexAttrBitmap
ExecUpdateLockMode
ExecBRUpdateTriggers
ExecUpdate
[..]Now we know that those small palloc() are guilty, but we didn't know
at the time with Tomas. The problem here is that we do not know in
palloc() - via its own arguments for which MemoryContext this is going
to be allocated for. This is problematic for perf, because on RHEL8, I
was not able to generate an uprobe that was capable of reaching a
global variable (CurrentMemoryContext) at that time.
I think there are a couple even bigger issues:
(a) There are other methods that allocate memory - repalloc, palloc0,
MemoryContextAlloc, ... and so on. But presumably we can trace all of
them at once? We'd have to trace reset/deletes too.
(b) This doesn't say if/when the allocated chunks get freed - even with
a fix, we'll still do exactly the same number of allocations, except
that we'll free the memory shortly after that. I wonder if we could
print a bit more info for each probe, to match the alloc/free requests.
Additionally what was even more frustrating on diagnosing that case on
the customer end system, was that such OOMs were crashing other
PostgreSQL clusters on the same OS. Even knowing the exact guilty
statement it was impossible to limit RSS memory usage of that
particular backend. So, what you are proposing makes a lot of sense.
Also it got me thinking of implementing safety-memory-net-GUC
debug_query_limit_backend_memory=X MB that would inject
setrlimit(RLIMIT_DATA) through external extension via hook(s) and
un-set it later, but the man page states it works for mmap() only
after Linux 4.7+ so it is future proof but won't work e.g. on RHEL7 -
maybe that's still good enough?; Or, well maybe try to hack a palloc()
a little, but that has probably too big overhead, right? (just
thinking loud).
Not sure about setting a hard limit - that seems pretty tricky and may
easily backfire. It's already possible to set such memory limit using
e.g. cgroups, or even better use VMs to isolate the instances.
I certainly agree it's annoying that when OOM hits, we end up with no
information about what used the memory. Maybe we could have a threshold
triggering a call to MemoryContextStats? So that we have at least some
memory usage info in the log in case the OOM killer intervenes.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 5/23/23 22:57, Tomas Vondra wrote:
On 5/23/23 18:39, Tom Lane wrote:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
it seems there's a fairly annoying memory leak in trigger code,
introduced by
...
Attached is a patch, restoring the pre-12 behavior for me.While looking for other places allocating stuff in ExecutorState (for
the UPDATE case) and leaving it there, I found two more cases:1) copy_plpgsql_datums
2) make_expanded_record_from_tupdesc
make_expanded_record_from_exprecordAll of this is calls from plpgsql_exec_trigger.
Not sure about the expanded-record case, but both of your other two
fixes feel like poor substitutes for pushing the memory into a
shorter-lived context. In particular I'm quite surprised that
plpgsql isn't already allocating that workspace in the "procedure"
memory context.I don't disagree, but which memory context should this use and
when/where should we switch to it?I haven't seen any obvious memory context candidate in the code
calling ExecGetAllUpdatedCols, so I guess we'd have to pass it from
above. Is that a good idea for backbranches ...
I looked at this again, and I think GetPerTupleMemoryContext(estate)
might do the trick, see the 0002 part. Unfortunately it's not much
smaller/simpler than just freeing the chunks, because we end up doing
oldcxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
updatedCols = ExecGetAllUpdatedCols(relinfo, estate);
MemoryContextSwitchTo(oldcxt);
and then have to pass updatedCols elsewhere. It's tricky to just switch
to the context (e.g. in ExecASUpdateTriggers/ExecARUpdateTriggers), as
AfterTriggerSaveEvent allocates other bits of memory too (in a longer
lived context). So we'd have to do another switch again. Not sure how
backpatch-friendly would that be.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
0001-Free-updatedCols-bitmaps.patchtext/x-patch; charset=UTF-8; name=0001-Free-updatedCols-bitmaps.patchDownload
From 85439e2587f035040c82e7303b96b887e650b01f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Tue, 23 May 2023 17:45:47 +0200
Subject: [PATCH 1/3] Free updatedCols bitmaps
---
src/backend/commands/trigger.c | 22 ++++++++++++++++++++--
src/backend/executor/execMain.c | 9 +++++++--
2 files changed, 27 insertions(+), 4 deletions(-)
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 4b295f8da5e..fb11203c473 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2916,6 +2916,8 @@ ExecBSUpdateTriggers(EState *estate, ResultRelInfo *relinfo)
(errcode(ERRCODE_E_R_I_E_TRIGGER_PROTOCOL_VIOLATED),
errmsg("BEFORE STATEMENT trigger cannot return a value")));
}
+
+ bms_free(updatedCols);
}
void
@@ -2928,12 +2930,18 @@ ExecASUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
Assert(relinfo->ri_RootResultRelInfo == NULL);
if (trigdesc && trigdesc->trig_update_after_statement)
+ {
+ Bitmapset *updatedCols = ExecGetAllUpdatedCols(relinfo, estate);
+
AfterTriggerSaveEvent(estate, relinfo, NULL, NULL,
TRIGGER_EVENT_UPDATE,
false, NULL, NULL, NIL,
- ExecGetAllUpdatedCols(relinfo, estate),
+ updatedCols,
transition_capture,
false);
+
+ bms_free(updatedCols);
+ }
}
bool
@@ -3043,6 +3051,9 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
heap_freetuple(trigtuple);
if (should_free_new)
heap_freetuple(oldtuple);
+
+ bms_free(updatedCols);
+
return false; /* "do nothing" */
}
else if (newtuple != oldtuple)
@@ -3068,6 +3079,8 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
if (should_free_trig)
heap_freetuple(trigtuple);
+ bms_free(updatedCols);
+
return true;
}
@@ -3107,6 +3120,7 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
*/
TupleTableSlot *oldslot;
ResultRelInfo *tupsrc;
+ Bitmapset *updatedCols;
Assert((src_partinfo != NULL && dst_partinfo != NULL) ||
!is_crosspart_update);
@@ -3129,14 +3143,18 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
else
ExecClearTuple(oldslot);
+ updatedCols = ExecGetAllUpdatedCols(relinfo, estate);
+
AfterTriggerSaveEvent(estate, relinfo,
src_partinfo, dst_partinfo,
TRIGGER_EVENT_UPDATE,
true,
oldslot, newslot, recheckIndexes,
- ExecGetAllUpdatedCols(relinfo, estate),
+ updatedCols,
transition_capture,
is_crosspart_update);
+
+ bms_free(updatedCols);
}
}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index c76fdf59ec4..98502b956c2 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2367,6 +2367,7 @@ ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo)
{
Bitmapset *keyCols;
Bitmapset *updatedCols;
+ LockTupleMode lockMode;
/*
* Compute lock mode to use. If columns that are part of the key have not
@@ -2378,9 +2379,13 @@ ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo)
INDEX_ATTR_BITMAP_KEY);
if (bms_overlap(keyCols, updatedCols))
- return LockTupleExclusive;
+ lockMode = LockTupleExclusive;
+ else
+ lockMode = LockTupleNoKeyExclusive;
+
+ bms_free(updatedCols);
- return LockTupleNoKeyExclusive;
+ return lockMode;
}
/*
--
2.40.1
0002-Switch-to-short-lived-MemoryContext.patchtext/x-patch; charset=UTF-8; name=0002-Switch-to-short-lived-MemoryContext.patchDownload
From 99246f4de6668d7a3f6054f672e67e66a1982edf Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Wed, 24 May 2023 13:12:03 +0200
Subject: [PATCH 2/3] Switch to short-lived MemoryContext
---
src/backend/commands/trigger.c | 39 ++++++++++++++++++++++-----------
src/backend/executor/execMain.c | 14 ++++++------
2 files changed, 33 insertions(+), 20 deletions(-)
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index fb11203c473..f5872ca6da2 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2867,6 +2867,7 @@ ExecBSUpdateTriggers(EState *estate, ResultRelInfo *relinfo)
int i;
TriggerData LocTriggerData = {0};
Bitmapset *updatedCols;
+ MemoryContext oldcxt;
trigdesc = relinfo->ri_TrigDesc;
@@ -2883,8 +2884,12 @@ ExecBSUpdateTriggers(EState *estate, ResultRelInfo *relinfo)
/* statement-level triggers operate on the parent table */
Assert(relinfo->ri_RootResultRelInfo == NULL);
+ oldcxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+
updatedCols = ExecGetAllUpdatedCols(relinfo, estate);
+ MemoryContextSwitchTo(oldcxt);
+
LocTriggerData.type = T_TriggerData;
LocTriggerData.tg_event = TRIGGER_EVENT_UPDATE |
TRIGGER_EVENT_BEFORE;
@@ -2916,8 +2921,6 @@ ExecBSUpdateTriggers(EState *estate, ResultRelInfo *relinfo)
(errcode(ERRCODE_E_R_I_E_TRIGGER_PROTOCOL_VIOLATED),
errmsg("BEFORE STATEMENT trigger cannot return a value")));
}
-
- bms_free(updatedCols);
}
void
@@ -2931,7 +2934,14 @@ ExecASUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
if (trigdesc && trigdesc->trig_update_after_statement)
{
- Bitmapset *updatedCols = ExecGetAllUpdatedCols(relinfo, estate);
+ MemoryContext oldcxt;
+ Bitmapset *updatedCols;
+
+ oldcxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+
+ updatedCols = ExecGetAllUpdatedCols(relinfo, estate);
+
+ MemoryContextSwitchTo(oldcxt);
AfterTriggerSaveEvent(estate, relinfo, NULL, NULL,
TRIGGER_EVENT_UPDATE,
@@ -2939,8 +2949,6 @@ ExecASUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
updatedCols,
transition_capture,
false);
-
- bms_free(updatedCols);
}
}
@@ -2962,6 +2970,7 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
TriggerData LocTriggerData = {0};
int i;
Bitmapset *updatedCols;
+ MemoryContext oldcxt;
LockTupleMode lockmode;
/* Determine lock mode to use */
@@ -3015,7 +3024,13 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
TRIGGER_EVENT_ROW |
TRIGGER_EVENT_BEFORE;
LocTriggerData.tg_relation = relinfo->ri_RelationDesc;
+
+ oldcxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+
updatedCols = ExecGetAllUpdatedCols(relinfo, estate);
+
+ MemoryContextSwitchTo(oldcxt);
+
LocTriggerData.tg_updatedcols = updatedCols;
for (i = 0; i < trigdesc->numtriggers; i++)
{
@@ -3051,9 +3066,6 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
heap_freetuple(trigtuple);
if (should_free_new)
heap_freetuple(oldtuple);
-
- bms_free(updatedCols);
-
return false; /* "do nothing" */
}
else if (newtuple != oldtuple)
@@ -3079,8 +3091,6 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
if (should_free_trig)
heap_freetuple(trigtuple);
- bms_free(updatedCols);
-
return true;
}
@@ -3120,7 +3130,8 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
*/
TupleTableSlot *oldslot;
ResultRelInfo *tupsrc;
- Bitmapset *updatedCols;
+ MemoryContext oldcxt;
+ Bitmapset *updatedCols;
Assert((src_partinfo != NULL && dst_partinfo != NULL) ||
!is_crosspart_update);
@@ -3143,8 +3154,12 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
else
ExecClearTuple(oldslot);
+ oldcxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+
updatedCols = ExecGetAllUpdatedCols(relinfo, estate);
+ MemoryContextSwitchTo(oldcxt);
+
AfterTriggerSaveEvent(estate, relinfo,
src_partinfo, dst_partinfo,
TRIGGER_EVENT_UPDATE,
@@ -3153,8 +3168,6 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
updatedCols,
transition_capture,
is_crosspart_update);
-
- bms_free(updatedCols);
}
}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 98502b956c2..71e93486c42 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2367,7 +2367,9 @@ ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo)
{
Bitmapset *keyCols;
Bitmapset *updatedCols;
- LockTupleMode lockMode;
+ MemoryContext oldcxt;
+
+ oldcxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
/*
* Compute lock mode to use. If columns that are part of the key have not
@@ -2378,14 +2380,12 @@ ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo)
keyCols = RelationGetIndexAttrBitmap(relinfo->ri_RelationDesc,
INDEX_ATTR_BITMAP_KEY);
- if (bms_overlap(keyCols, updatedCols))
- lockMode = LockTupleExclusive;
- else
- lockMode = LockTupleNoKeyExclusive;
+ MemoryContextSwitchTo(oldcxt);
- bms_free(updatedCols);
+ if (bms_overlap(keyCols, updatedCols))
+ return LockTupleExclusive;
- return lockMode;
+ return LockTupleNoKeyExclusive;
}
/*
--
2.40.1
0003-Free-space-allocated-by-copy_plpgsql_datums.patchtext/x-patch; charset=UTF-8; name=0003-Free-space-allocated-by-copy_plpgsql_datums.patchDownload
From 4375046f8b1f376e404b6ed65a358ee6d55d3b41 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sat, 13 May 2023 13:19:49 +0200
Subject: [PATCH 3/3] Free space allocated by copy_plpgsql_datums
During PL/pgSQL execution, we allocate a local copy for calls. When
executing triggers, this gets allocated in ExecutorState, so it'd be
kept until query completes.
Free it explicitly, as part of cleanup.
---
src/pl/plpgsql/src/pl_exec.c | 39 +++++++++++++++++++++++++++++++-----
1 file changed, 34 insertions(+), 5 deletions(-)
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index 4b76f7699a6..99283d354f4 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -264,6 +264,8 @@ static void coerce_function_result_tuple(PLpgSQL_execstate *estate,
static void plpgsql_exec_error_callback(void *arg);
static void copy_plpgsql_datums(PLpgSQL_execstate *estate,
PLpgSQL_function *func);
+static void free_plpgsql_datums(PLpgSQL_execstate *estate,
+ PLpgSQL_function *func);
static void plpgsql_fulfill_promise(PLpgSQL_execstate *estate,
PLpgSQL_var *var);
static MemoryContext get_stmt_mcontext(PLpgSQL_execstate *estate);
@@ -788,6 +790,8 @@ plpgsql_exec_function(PLpgSQL_function *func, FunctionCallInfo fcinfo,
exec_eval_cleanup(&estate);
/* stmt_mcontext will be destroyed when function's main context is */
+ free_plpgsql_datums(&estate, func);
+
/*
* Pop the error context stack
*/
@@ -1142,6 +1146,8 @@ plpgsql_exec_trigger(PLpgSQL_function *func,
exec_eval_cleanup(&estate);
/* stmt_mcontext will be destroyed when function's main context is */
+ free_plpgsql_datums(&estate, func);
+
/*
* Pop the error context stack
*/
@@ -1217,6 +1223,8 @@ plpgsql_exec_event_trigger(PLpgSQL_function *func, EventTriggerData *trigdata)
exec_eval_cleanup(&estate);
/* stmt_mcontext will be destroyed when function's main context is */
+ free_plpgsql_datums(&estate, func);
+
/*
* Pop the error context stack
*/
@@ -1304,15 +1312,20 @@ copy_plpgsql_datums(PLpgSQL_execstate *estate,
char *ws_next;
int i;
- /* Allocate local datum-pointer array */
- estate->datums = (PLpgSQL_datum **)
- palloc(sizeof(PLpgSQL_datum *) * ndatums);
-
/*
+ * Allocate local datum-pointer array, followed by space for the values.
+ *
* To reduce palloc overhead, we make a single palloc request for all the
* space needed for locally-instantiated datums.
*/
- workspace = palloc(func->copiable_size);
+ estate->datums = (PLpgSQL_datum **)
+ palloc(MAXALIGN(sizeof(PLpgSQL_datum *) * ndatums) + func->copiable_size);
+
+ /*
+ * Space for values starts after the datum-pointer array, but we need to make
+ * sure it's aligned properly.
+ */
+ workspace = (char *) estate->datums + MAXALIGN(sizeof(PLpgSQL_datum *) * ndatums);
ws_next = workspace;
/* Fill datum-pointer array, copying datums into workspace as needed */
@@ -1362,6 +1375,22 @@ copy_plpgsql_datums(PLpgSQL_execstate *estate,
Assert(ws_next == workspace + func->copiable_size);
}
+/*
+ * Free the local execution variables. We've allocated all of the space at once
+ * so we can simply free the main pointer.
+ */
+static void
+free_plpgsql_datums(PLpgSQL_execstate *estate,
+ PLpgSQL_function *func)
+{
+ if (estate->datums != NULL)
+ {
+ pfree(estate->datums);
+ estate->datums = NULL;
+ }
+
+}
+
/*
* If the variable has an armed "promise", compute the promised value
* and assign it to the variable.
--
2.40.1
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
While looking for other places allocating stuff in ExecutorState (for
the UPDATE case) and leaving it there, I found two more cases:
1) copy_plpgsql_datums
2) make_expanded_record_from_tupdesc
make_expanded_record_from_exprecord
All of this is calls from plpgsql_exec_trigger.
Can you show a test case in which this happens? I added some
instrumentation and verified at least within our regression tests,
copy_plpgsql_datums' CurrentMemoryContext is always plpgsql's
"SPI Proc" context, so I do not see how there can be a query-lifespan
leak there, nor how your 0003 would fix it if there is.
regards, tom lane
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
On 5/23/23 23:39, Tom Lane wrote:
FWIW, I've had some success localizing palloc memory leaks with valgrind's
leak detection mode. The trick is to ask it for a report before the
context gets destroyed. Beats writing your own infrastructure ...
I haven't tried valgrind, so can't compare.
Would it be possible to filter which memory contexts to track? Say we
know the leak is in ExecutorState, so we don't need to track allocations
in other contexts. That was a huge speedup for me, maybe it'd help
valgrind too.
I don't think valgrind has a way to do that, but this'd be something you
set up specially in any case.
Also, how did you ask for the report before context gets destroyed?
There are several valgrind client requests that could be helpful:
/* Do a full memory leak check (like --leak-check=full) mid-execution. */
#define VALGRIND_DO_LEAK_CHECK \
VALGRIND_DO_CLIENT_REQUEST_STMT(VG_USERREQ__DO_LEAK_CHECK, \
0, 0, 0, 0, 0)
/* Same as VALGRIND_DO_LEAK_CHECK but only showing the entries for
which there was an increase in leaked bytes or leaked nr of blocks
since the previous leak search. */
#define VALGRIND_DO_ADDED_LEAK_CHECK \
VALGRIND_DO_CLIENT_REQUEST_STMT(VG_USERREQ__DO_LEAK_CHECK, \
0, 1, 0, 0, 0)
/* Return number of leaked, dubious, reachable and suppressed bytes found by
all previous leak checks. They must be lvalues. */
#define VALGRIND_COUNT_LEAK_BLOCKS(leaked, dubious, reachable, suppressed) \
Putting VALGRIND_DO_ADDED_LEAK_CHECK someplace in the executor loop would
help narrow things down pretty quickly, assuming you had a self-contained
example demonstrating the leak. I don't recall exactly how I used these
but it was something along that line.
Yeah, it's not clear whether we could make the still-hypothetical check
sensitive enough to find leaks using small test cases without getting an
unworkable number of false positives. Still, might be worth trying.
I'm not against experimenting with that. Were you thinking about
something that'd be cheap enough to just be enabled always/everywhere,
or something we'd enable during testing?
We seem to have already paid the overhead of counting all palloc
allocations, so I don't think it'd be too expensive to occasionally check
the ExecutorState's mem_allocated and see if it's growing (especially
if we only do so in assert-enabled builds). The trick is to define the
rules for what's worth reporting.
It might be an acceptable tradeoff to have stricter rules for what can
be allocated in ExecutorState in order to make this sort of problem
more easily detectable.
Would these rules be just customary, or defined/enforced in the code
somehow? I can't quite imagine how would that work, TBH.
If the check bleated "WARNING: possible executor memory leak" during
regression testing, people would soon become conditioned to doing
whatever they have to do to avoid it ;-)
regards, tom lane
On 5/24/23 17:37, Tom Lane wrote:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
While looking for other places allocating stuff in ExecutorState (for
the UPDATE case) and leaving it there, I found two more cases:1) copy_plpgsql_datums
2) make_expanded_record_from_tupdesc
make_expanded_record_from_exprecordAll of this is calls from plpgsql_exec_trigger.
Can you show a test case in which this happens? I added some
instrumentation and verified at least within our regression tests,
copy_plpgsql_datums' CurrentMemoryContext is always plpgsql's
"SPI Proc" context, so I do not see how there can be a query-lifespan
leak there, nor how your 0003 would fix it if there is.
Interesting. I tried to reproduce it, but without success, and it passes
even with an assert on the context name. The only explanation I have is
that the gdb script I used might have been a bit broken - it used
conditional breakpoints like this one:
break AllocSetAlloc if strcmp(((MemoryContext) $rdi)->name, \
"ExecutorState") == 0
commands
bt
cont
end
but I just noticed gdb sometimes complains about this:
Error in testing breakpoint condition:
'__strcmp_avx2' has unknown return type; cast the call to its declared
return type
The breakpoint still fires all the commands, which is pretty surprising
behavior, but that might explain why I saw copy_plpgsql_data as another
culprit. And I suspect the make_expanded_record calls might be caused by
the same thing.
I'll check deeper tomorrow, when I get access to the original script
etc. We can ignore these cases until then.
Sorry for the confusion :-/
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2023-05-23 23:26:42 +0200, Tomas Vondra wrote:
On 5/23/23 19:14, Andres Freund wrote:
Hi,
On 2023-05-23 18:23:00 +0200, Tomas Vondra wrote:
This means that for an UPDATE with triggers, we may end up calling this
for each row, possibly multiple bitmaps. And those bitmaps are allocated
in ExecutorState, so won't be freed until the end of the query :-(Ugh.
I've wondered about some form of instrumentation to detect such issues
before. It's obviously a problem that we can have fairly large leaks, like the
one you just discovered, without detecting it for a couple years. It's kinda
made worse by the memory context infrastructure, because it hides such issues.Could it help to have a mode where the executor shutdown hook checks how much
memory is allocated in ExecutorState and warns if its too much? There's IIRC a
few places that allocate large things directly in it, but most of those
probably should be dedicated contexts anyway. Something similar could be
useful for some other long-lived contexts.Not sure such simple instrumentation would help much, unfortunately :-(
We only discovered this because the user reported OOM crashes, which
means the executor didn't get to the shutdown hook at all. Yeah, maybe
if we had such instrumentation it'd get triggered for milder cases, but
I'd bet the amount of noise is going to be significant.For example, there's a nearby thread about hashjoin allocating buffiles
etc. in ExecutorState - we ended up moving that to a separate context,
but surely there are more such cases (not just for ExecutorState).
Yes, that's why I said that we would have to more of those into dedicated
contexts - which is a good idea independent of this issue.
The really hard thing was determining what causes the memory leak - the
simple instrumentation doesn't help with that at all. It tells you there
might be a leak, but you don't know where did the allocations came from.What I ended up doing is a simple gdb script that sets breakpoints on
all palloc/pfree variants, and prints info (including the backtrace) for
each call on ExecutorState. And then a script that aggregate those to
identify which backtraces allocated most chunks that were not freed.
FWIW, for things like this I found "heaptrack" to be extremely helpful.
E.g. for a reproducer of the problem here, it gave me the attach "flame graph"
of the peak memory usage - after attaching to a running backend and running an
UPDATE triggering the leak..
Because the view I screenshotted shows the stacks contributing to peak memory
usage, it works nicely to find "temporary leaks", even if memory is actually
freed after all etc.
Hm. Somehow this doesn't seem quite right. Shouldn't we try to use a shorter
lived memory context instead? Otherwise we'll just end up with the same
problem in a few years.I agree using a shorter lived memory context would be more elegant, and
more in line with how we do things. But it's not clear to me why we'd
end up with the same problem in a few years with what the patch does.
Because it sets up the pattern of manual memory management and continues to
run the relevant code within a query-lifetime context.
Greetings,
Andres Freund
Attachments:
2023-05-24T11:05:24,383491003-07:00.pngimage/pngDownload
�PNG
IHDR � � ���a IDATx���g\G ����"U���"HG�R�)��"b�^P ��h�X�H���[�(� �(EA@�HQ:���R������M��P��Q�����������;;��3����#������� 0�H# ��L 0�@3 �(��6SI2&��&2C�{$YWWS��7��8 �*�y 0*�����4k����YdB�$��8_:�FT~3���IBsf�=~�����~;���������,0�(�����?Qa\b*��������'V�Ce�hl�;^�����8=�yQ}'�������P����?��LJ��j�{[�0X��)�1l��
~Q�u�w�c��q�M����$%����?�<Ny���[> 1m�9���$C8Nk�N��Q��� �&��L&|��Xg��}����}��1VKG���~$�����*\_����6�����Hy����
F3M,[�����z���/�M���W��9V��|�v�-�Fg������K���������O
������3������g��
�%U�;9��63m��5a39��-����f��h� �d��T�8�iz�
su4��y�����#�qK����Ir���w���fW���$k�X�1�t����0�Z��,>�� �$t����p��0��SB�%��b�t���\gM�����'����MUD�iw�e�s��}�b��o8EIJ�k�{��$.����,l�����$��+iba����O��7A�������O�>w|���K�!�S|����1����$3�yam;�O�F�t�$��:��r�c2*�{�/�0^9=3�����}vz��!_����Qo�[G�9�{N���$ID]O�&�Dh�aOZ���Go?���/�b8GH�����<N-�������U&H��������CF��s���r��� ;�j�-]Zg{��!7^+}����AK�1Y2[������6����������&�y--���n��!�����z^�_x'����N�Q z1�f*���Ls�O��C�S���-,���<���x[+�*�i���6���tKs��[���!���vC����6>%C+s��;�9������q&.z��w��p�w�w �h�O�_��h���c�Y�"�f'�@C�m���f;m���G����8��u?�k�Tk���G,m�"��5�^%����=~���t������?=��x�Llt�s�E�DUM,fMk
���c<������}lc����t ������`j1�x\L���5�.A*F����|�9]�0^e}U���w����KqJuB$��qU>QQ��y5��$�>u<%=�N[~VI�{�=�8E����S?����=n�B���J�;@�xv�R
Cq�5�3����!�� ��}��ILF�����Tho �V��TLB�i�6Q��HX%Bq�O����0����*Ly��2^�7?� �����#~���W�9E��$#���K+����)/5o�|�q���(�m�v�hmm�����r���)�JWOC��X���qN�I������������8
' ��Y�ox~7���.���H��/�>�i���S�i)&JBd���?��|MSR��:4�}
!��g��jJ)e�T�a$�hkini�[���,��~a�
*������B�/2���/�B/��K�L�Kki�����n�`��7g{sK'�?�P
����QUS��d�����V�81�����w�1Oa��U�6�����w��u �q�M�*����@�'��?+}O�� '�3�!������(�:?$�� �k�7Si�_E=�'�z��mlDB����������T�1v��O�5!��������;�D�`�����7�J!���(�\UP C�������k�=o����}�i���D4-��6�����`g�d��w�����'e������k(D�x[[;���������@Gm
��@���:����������q�3���r23sJ?3{ra�B��:����PG��0����P�a"�S����}������F�0�#���������ch4�c;��i����2�����?��Z��g����I��^��y ks�L��{�O��O1�|� �������F"!�B4r^Lh)���l}<� ���~��]�v���u��hkj�Ph![+�^��}z�������&�D4�f�'�=�.�k�����H� ����0�x*�xG;c7Fo��1��6���-W��J~��Lq"�~!D��<,$����7����I�ql���?=���II���^�\*3QC�p��I�Q_���������q��K]����^�y��b��z�;s"K����Z�����S�*S�|lGl\�\���>����DiEJ}[l6Q*��w��KOR�y���_O�b<����U?�S������=��������x{K3�G��
�t����v���"��7~��D ����M�)u���y�)5�m4p�K����.��a|r���J�)$���Sp�HBB����+�8"���������7R� �����H�J��/dW�u�>��N_
1���Bi�O0L��
�OOq���vU1!Ry
!� ��(�
��tP>�$U�T��������_ �u�pJ�4QD�����b�G��T��s�vE�'�'5n���m�Yk�Kc.����:N����HM�������H��}.���q������g����*�}��$���*KV&��to��7�{��3��-}����c5IG^�����o�6��
�c�!�0~>����6h� �� �c�V���>��TMF���WH|�d
EAB��vA�q��<|����k�1lW�U��# ��?Y������M����-7AI���W@����v66ccgxz����m�Ee�
�����
�J+��7i������r�2����p
��x�F�`���..���������Ks�x���d�xy�eu
's��-����&�+����+$!.�5S�����_xcY�'~5�)��<|"
z�J�I�����3�t����YHp����o�������41$�yy�}��O�EuTr��8}� xI�'Z��a���=AVL���G@|�T]��wo��>��x�&�5t�<�����dl,Q��^�srqqqq��K��!�m����7��&34����yD�9�����IA[[V��Gp���"���������� @�:�o{ejdL�����![GS��7/�p�h���Fv�:)�))B����F)��g>�D���1?>>�>��V�:���t�yK�Q����{Yu��vz�\l$��+U;��L�@C��mo.M|�kh0�������X������N����$f*:N�{'��G[����?�����S��9����(NC��s�xQsMij�����-9�P������X]��iN���}�jqJ��8~3#kg}������D�"�����3�t�����P��w���4������B$I��3U���8�����i�����4ZMfl"����b>����y���N�Po�!�H��M�.�E�h!*I�Jy����>B��&L��\��������� (OQ� Y-V�'o�^��j�
0����j��y��~U<�:Uw��?'��X�!��Gx��gq"�F3�ks��������Zq�6F�~��$7; Cv�TZ>������4Y �l��������:�o�$k���w#�b(&��}-�3�qb���I7�~�����r^ ��A���30�`��B<-��c ���nll��Z )l�D=�������lB�zZ�M�/���
�w����d�c ���m���� �N���ilF
<� �� #^G8TcS �� Q���� ����*BH\\|�# ��*""2�q ��L%�����h0 �w�� �?h� E��
`�f* �Q�� F��5S���D�)))������Sjj*B(11QDD�B��*��&��xCCC[[��QSS��7V�X���5������o�2Y�������F���P(*�J|��h���W�\�� `�c�{�n"##I���o%$$qg]T_�hcc���Bdd��������u�V2��d��������gdd06A�������~�����(�d��MS�N��q#B�q�6�y ��`���-F�??hh(�BE��&//������rrr�;;�/� ��'X |�X36���FDD$--��W{���>}���k_�x����$���$//��={����"�Fsuu���<z�(�L��<>>~�����O�����MMM�����MOO���466��c��w�B����������444�_�>,,���WDD��������_�p�l��q���300��cGEE}�YYY?���������������>}�~�z--�9s���� ���`��irss��Y[[���($$��o���5��������rss?z�!t�����$UUU���K�._��h���;wfgg���O��k���;y���[��7���L�2���|��
!!!����_����Z�������~���vuumhhPVV>}�4B(222))���o��3f�����������l�L&;;;������������!���+[[[MM��� ee�_~��X?++���a��g��]�n���l�� F��<�;s�L����3yZ���;
�7o�0���8'''**J]]}�ecc�4iBHAAa��������{�^�vm���!ii����o�����!t��mKK�����������3g��]�v��5�/����0��?�\�~���KB���������JJJ��'cS������$%%� ���i���BH@@���������;}�������kB��M���'V.**���stt����� ��2�fjmmm�a��y��uLL}� ����� RP�]������9s�0.lllA
.~~~GG�9s�������XYY]�xq������w����{7}M��A#��655uvv>z����S����6B���HJJj���NNN***_�"