Memory leak in CachememoryContext
Please find below simple repro for CacheMemoryContext memory leak
create type two_int4s as (f1 int4, f2 int4);
create type two_int8s as (q1 int8, q2 int8);
PLpgSQL example:
do $$ declare c4 two_int4s; c8 two_int8s;
begin
c8 := row(1,2);
c4 := c8;
end$$;
Executing above plpgsql in same memory session we observe
cachememorycontext goes on increasing as below which is captured using
MemoryContextStats
1590:2023-04-19 13:31:54.336 IST [31615] LOG: Grand total: 1213440 bytes
in 153 blocks; 496000 free (53 chunks); 717440 used
1687:2023-04-19 13:31:54.348 IST [31615] LOG: Grand total: 1220608 bytes
in 160 blocks; 497160 free (53 chunks); 723448 used
1781:2023-04-19 13:31:59.919 IST [31615] LOG: Grand total: 1213440 bytes
in 154 blocks; 494168 free (45 chunks); 719272 used
1880:2023-04-19 13:31:59.924 IST [31615] LOG: Grand total: 1220608 bytes
in 161 blocks; 496128 free (45 chunks); 724480 used
1976:2023-04-19 13:32:29.977 IST [31615] LOG: Grand total: 1215488 bytes
in 156 blocks; 495144 free (45 chunks); 720344 used
2077:2023-04-19 13:32:29.978 IST [31615] LOG: Grand total: 1222656 bytes
in 163 blocks; 497104 free (45 chunks); 725552 used
Root cause:
Memory leak is in function "GetCachedExpression" which creates context
under CacheMemoryContext. During each execution in the same session memory
gets allocated and it is never freed resulting in memory leak.
During anonymous block execution in the function "plpgsql_estate_setup", a
local casting hash table gets created in SPI memory context. When hash
table look up is performed in "get_cast_hashenty" function if entry is no
present , memory is allocated in CacheMemoryContext in function
"GetCachedExpression".At the end of proc execution SPI memory context is
deleted and hence local hash table gets deleted, but still entries remain
in Cachemeorycontext.
During the next execution in the same session, a brand new hash table is
created and if entry is not present memory will be repeatedly assigned in
CacheMemoryContext.
Solution:
Please find attached(memoryleakfix.patch) to this email. We need to keep
track of the local casting hash table or session wide cast hash table which
gets created in the function "plpgsql_estate_setup". We need to allocate
CacheMemorycontext only for session wide cast hash table and for local cast
hash table memory will be allocated from SPI context.
Please find below CacheMemory Context stats with fix as below
3316:2023-04-19 14:07:23.391 IST [38021] LOG: Grand total: 1210368 bytes
in 151 blocks; 492704 free (45 chunks); 717664 used
3411:2023-04-19 14:07:23.391 IST [38021] LOG: Grand total: 1216512 bytes
in 157 blocks; 494176 free (45 chunks); 722336 used
3502:2023-04-19 14:07:23.932 IST [38021] LOG: Grand total: 1210368 bytes
in 151 blocks; 492704 free (45 chunks); 717664 used
3597:2023-04-19 14:07:23.932 IST [38021] LOG: Grand total: 1216512 bytes
in 157 blocks; 494176 free (45 chunks); 722336 used
3688:2023-04-19 14:07:24.464 IST [38021] LOG: Grand total: 1210368 bytes
in 151 blocks; 492704 free (45 chunks); 717664 used
3783:2023-04-19 14:07:24.464 IST [38021] LOG: Grand total: 1216512 bytes
in 157 blocks; 494176 free (45 chunks); 722336 used
3874:2023-04-19 14:07:25.012 IST [38021] LOG: Grand total: 1210368 bytes
in 151 blocks; 492704 free (45 chunks); 717664 used
3969:2023-04-19 14:07:25.012 IST [38021] LOG: Grand total: 1216512 bytes
in 157 blocks; 494176 free (45 chunks); 722336 used
4060:2023-04-19 14:07:25.552 IST [38021] LOG: Grand total: 1210368 bytes
in 151 blocks; 492704 free (45 chunks); 717664 used
4155:2023-04-19 14:07:25.552 IST [38021] LOG: Grand total: 1216512 bytes
in 157 blocks; 494176 free (45 chunks); 722336 used
Thanks & Best Regards,
Ajit
Attachments:
memoryleakfix.patchapplication/octet-stream; name=memoryleakfix.patchDownload+28-13
Ajit Awekar <ajit.awekar@enterprisedb.com> writes:
Please find below simple repro for CacheMemoryContext memory leak
Hm, yeah, reproduced here.
During anonymous block execution in the function "plpgsql_estate_setup", a
local casting hash table gets created in SPI memory context. When hash
table look up is performed in "get_cast_hashenty" function if entry is no
present , memory is allocated in CacheMemoryContext in function
"GetCachedExpression".At the end of proc execution SPI memory context is
deleted and hence local hash table gets deleted, but still entries remain
in Cachemeorycontext.
Yeah, it's from using just a short-lived cast hash table for DO blocks.
I think that was okay when it was written, but when we wheeled the
CachedExpression machinery undeneath it, we created a problem.
Please find attached(memoryleakfix.patch) to this email.
I don't think this fix is acceptable at all. A minor problem is that
we can't change the API of GetCachedExpression() in stable branches,
because extensions may be using it. We could work around that by
making it a wrapper function. But the big problem is that this patch
destroys the reason for using a CachedExpression in the first place:
because you aren't linking it into cached_expression_list, the plancache
will not detect events that should obsolete the expression. The
test cases added by 04fe805a1 only cover regular functions, but one
that did domain constraint DDL within a DO block would expose the
shortcoming.
A possible answer is to split plpgsql's cast hash table into two parts.
The lower-level part would contain the hash key, cast_expr and
cast_cexpr fields, and would have session lifespan and be used by
both DO blocks and regular functions. In this way we'd not leak
CachedExpressions. The upper-level hash table would contain the
hash key, a link to the relevant lower-level entry, and the
cast_exprstate, cast_in_use, cast_lxid fields. There would be a
session-lifespan one of these plus one for each DO block, so that
management of the ExprStates still works as it does now for DO blocks.
This could be factored in other ways, and maybe another way would be
simpler. But the idea is that DO blocks should use persistent
CachedExpressions even though their cast_exprstates are transient.
I've not tried to code this, do you want to?
regards, tom lane
Hi Tom,
Thanks a lot for your possible approach for a solution.
I have implemented the approach by splitting the hash table into two parts.
Please find the attached patch for the same.
Thanks & Best Regards,
Ajit
On Wed, Apr 19, 2023 at 10:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Ajit Awekar <ajit.awekar@enterprisedb.com> writes:
Please find below simple repro for CacheMemoryContext memory leak
Hm, yeah, reproduced here.
During anonymous block execution in the function "plpgsql_estate_setup",
a
local casting hash table gets created in SPI memory context. When hash
table look up is performed in "get_cast_hashenty" function if entry is no
present , memory is allocated in CacheMemoryContext in function
"GetCachedExpression".At the end of proc execution SPI memory context is
deleted and hence local hash table gets deleted, but still entries remain
in Cachemeorycontext.Yeah, it's from using just a short-lived cast hash table for DO blocks.
I think that was okay when it was written, but when we wheeled the
CachedExpression machinery undeneath it, we created a problem.Please find attached(memoryleakfix.patch) to this email.
I don't think this fix is acceptable at all. A minor problem is that
we can't change the API of GetCachedExpression() in stable branches,
because extensions may be using it. We could work around that by
making it a wrapper function. But the big problem is that this patch
destroys the reason for using a CachedExpression in the first place:
because you aren't linking it into cached_expression_list, the plancache
will not detect events that should obsolete the expression. The
test cases added by 04fe805a1 only cover regular functions, but one
that did domain constraint DDL within a DO block would expose the
shortcoming.A possible answer is to split plpgsql's cast hash table into two parts.
The lower-level part would contain the hash key, cast_expr and
cast_cexpr fields, and would have session lifespan and be used by
both DO blocks and regular functions. In this way we'd not leak
CachedExpressions. The upper-level hash table would contain the
hash key, a link to the relevant lower-level entry, and the
cast_exprstate, cast_in_use, cast_lxid fields. There would be a
session-lifespan one of these plus one for each DO block, so that
management of the ExprStates still works as it does now for DO blocks.This could be factored in other ways, and maybe another way would be
simpler. But the idea is that DO blocks should use persistent
CachedExpressions even though their cast_exprstates are transient.I've not tried to code this, do you want to?
regards, tom lane
Attachments:
memoryleakfix_V1.patchapplication/octet-stream; name=memoryleakfix_V1.patchDownload+76-36
Ajit Awekar <ajit.awekar@enterprisedb.com> writes:
I have implemented the approach by splitting the hash table into two parts.
Please find the attached patch for the same.
I found a few things not to like about this:
* You didn't update the comment describing these hash tables.
* I wasn't really thrilled about renaming the plpgsql_CastHashEntry
typedef, as that seemed to just create uninteresting diff noise.
Also, "SessionCastHashEntry" versus "PrivateCastHashEntry" seems a
very misleading choice of names, since one of the "PrivateCastHashEntry"
hash tables is in fact session-lifespan. After some thought I left
the "upper" hash table entry type as plpgsql_CastHashEntry so that
code outside the immediate area needn't be affected, and named the
"lower" table cast_expr_hash, with entry type plpgsql_CastExprHashEntry.
I'm not wedded to those names though, if you have a better idea.
(BTW, it's completely reasonable to rename the type as an intermediate
step in making a patch like this, since it ensures you'll examine
every existing usage to choose the right thing to change it to. But
I generally rename things back afterwards.)
* I didn't like having to do two hashtable lookups during every
call even after we've fully cached the info. That's easy to avoid
by keeping a link to the associated "lower" hashtable entry in the
"upper" ones.
* You removed the reset of cast_exprstate etc from the code path where
we've just reconstructed the cast_expr. That's a mistake since it
might allow us to skip rebuilding the derived expression state after
a DDL change.
Also, while looking at this I noticed that we are no longer making
any use of estate->cast_hash_context. That's not the fault of
your patch; it's another oversight in the one that added the
CachedExpression mechanism. The compiled expressions used to be
stored in that context, but now the plancache is responsible for
them and we are never putting anything in the cast_hash_context.
So we might as well get rid of that and save 8K of wasted memory.
This allows some simplification in the hashtable setup code too.
In short, I think we need something more like the attached.
(Note to self: we can't remove the cast_hash_context field in
back branches for fear of causing an ABI break for pldebugger.
But we can leave it unused, I think.)
regards, tom lane
Attachments:
memoryleakfix_V2.patchtext/x-diff; charset=us-ascii; name=memoryleakfix_V2.patchDownload+66-30
Tom, Thanks a lot for your patch. I applied the changes and confirmed
there is no memory leak with the V2 patch.
We are not using MemoryContext variables "cast_hash_context" and
"shared_cast_context".
Thanks & Best Regards,
Ajit
On Sat, Apr 22, 2023 at 4:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Ajit Awekar <ajit.awekar@enterprisedb.com> writes:
I have implemented the approach by splitting the hash table into two
parts.
Please find the attached patch for the same.
I found a few things not to like about this:
* You didn't update the comment describing these hash tables.
* I wasn't really thrilled about renaming the plpgsql_CastHashEntry
typedef, as that seemed to just create uninteresting diff noise.
Also, "SessionCastHashEntry" versus "PrivateCastHashEntry" seems a
very misleading choice of names, since one of the "PrivateCastHashEntry"
hash tables is in fact session-lifespan. After some thought I left
the "upper" hash table entry type as plpgsql_CastHashEntry so that
code outside the immediate area needn't be affected, and named the
"lower" table cast_expr_hash, with entry type plpgsql_CastExprHashEntry.
I'm not wedded to those names though, if you have a better idea.(BTW, it's completely reasonable to rename the type as an intermediate
step in making a patch like this, since it ensures you'll examine
every existing usage to choose the right thing to change it to. But
I generally rename things back afterwards.)* I didn't like having to do two hashtable lookups during every
call even after we've fully cached the info. That's easy to avoid
by keeping a link to the associated "lower" hashtable entry in the
"upper" ones.* You removed the reset of cast_exprstate etc from the code path where
we've just reconstructed the cast_expr. That's a mistake since it
might allow us to skip rebuilding the derived expression state after
a DDL change.Also, while looking at this I noticed that we are no longer making
any use of estate->cast_hash_context. That's not the fault of
your patch; it's another oversight in the one that added the
CachedExpression mechanism. The compiled expressions used to be
stored in that context, but now the plancache is responsible for
them and we are never putting anything in the cast_hash_context.
So we might as well get rid of that and save 8K of wasted memory.
This allows some simplification in the hashtable setup code too.In short, I think we need something more like the attached.
(Note to self: we can't remove the cast_hash_context field in
back branches for fear of causing an ABI break for pldebugger.
But we can leave it unused, I think.)regards, tom lane
On 2023-Apr-21, Tom Lane wrote:
(Note to self: we can't remove the cast_hash_context field in
back branches for fear of causing an ABI break for pldebugger.
But we can leave it unused, I think.)
Hmm, we can leave it unused in our code, but it still needs to be
initialized to some valid memory context anyway; otherwise hypothetical
code that uses it would still crash. This seems halfway obvious, but
since the submitted patch doesn't have this part, I thought better to
point it out.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
(Note to self: we can't remove the cast_hash_context field in
back branches for fear of causing an ABI break for pldebugger.
But we can leave it unused, I think.)
Hmm, we can leave it unused in our code, but it still needs to be
initialized to some valid memory context anyway; otherwise hypothetical
code that uses it would still crash.
I think we want that to happen, actually, because it's impossible
to guess what such hypothetical code needs the context to be.
As things stand now, that field points to a long-lived context
in some cases and a short-lived one in others. We risk either
data structure corruption or a session-lifespan memory leak
if we guess about such usage ... which really shouldn't exist
anyway.
regards, tom lane
I wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Hmm, we can leave it unused in our code, but it still needs to be
initialized to some valid memory context anyway; otherwise hypothetical
code that uses it would still crash.
I think we want that to happen, actually, because it's impossible
to guess what such hypothetical code needs the context to be.
I guess we could have the back branches continue to create a
shared_cast_context and just not use it in core. Seems rather
expensive for a very hypothetical compatibility measure, though.
regards, tom lane
On 2023-Apr-24, Tom Lane wrote:
I wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Hmm, we can leave it unused in our code, but it still needs to be
initialized to some valid memory context anyway; otherwise hypothetical
code that uses it would still crash.I think we want that to happen, actually, because it's impossible
to guess what such hypothetical code needs the context to be.I guess we could have the back branches continue to create a
shared_cast_context and just not use it in core. Seems rather
expensive for a very hypothetical compatibility measure, though.
I think a session-long memory leak is not so bad, compared to a possible
crash. However, after looking at the code again, as well as pldebugger
and plpgsql_check, I agree that there's no point in doing anything other
than keeping the field there.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Hay dos momentos en la vida de un hombre en los que no debería
especular: cuando puede permitírselo y cuando no puede" (Mark Twain)
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
On 2023-Apr-24, Tom Lane wrote:
I guess we could have the back branches continue to create a
shared_cast_context and just not use it in core. Seems rather
expensive for a very hypothetical compatibility measure, though.
I think a session-long memory leak is not so bad, compared to a possible
crash. However, after looking at the code again, as well as pldebugger
and plpgsql_check, I agree that there's no point in doing anything other
than keeping the field there.
Yeah, I can't see any plausible reason for outside code to be using
that field (and I don't see any evidence in Debian Code Search that
anyone is). I'll push it like this then. Thanks for looking!
regards, tom lane