v12 "won't fix" item regarding memory leak in "ATTACH PARTITION without AEL"; (or, relcache ref counting)

Started by Justin Pryzbyalmost 6 years ago2 messages

pryzby@telsasoft.com

almost 6 years ago

This links to a long thread, from which I've tried to quote some of the
most important mails, below.
https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items#Won.27t_Fix

I wondered if there's an effort to pursue a resolution for v13 ?

On Fri, Apr 12, 2019 at 11:42:24AM -0400, Tom Lane wrote in <31027.1555083744@sss.pgh.pa.us>:

Michael Paquier <michael@paquier.xyz> writes:

On Wed, Apr 10, 2019 at 05:03:21PM +0900, Amit Langote wrote:

The problem lies in all branches that have partitioning, so it should be
listed under Older Bugs, right? You may have noticed that I posted
patches for all branches down to 10.

I have noticed. The message from Tom upthread outlined that an open
item should be added, but this is not one. That's what I wanted to
emphasize. Thanks for making it clearer.

To clarify a bit: there's more than one problem here. Amit added an
open item about pre-existing leaks related to rd_partcheck. (I'm going
to review and hopefully push his fix for that today.) However, there's
a completely separate leak associated with mismanagement of rd_pdcxt,
as I showed in [1], and it doesn't seem like we have consensus about
what to do about that one. AFAIK that's a new bug in 12 (caused by
898e5e329) and so it ought to be in the main open-items list.

This thread has discussed a bunch of possible future changes like
trying to replace copying of relcache-provided data structures
with reference-counting, but I don't think any such thing could be
v12 material at this point. We do need to fix the newly added
leak though.

regards, tom lane

[1] /messages/by-id/10797.1552679128@sss.pgh.pa.us

On Fri, Mar 15, 2019 at 05:41:47PM -0400, Robert Haas wrote in <CA+Tgmoa5rT+ZR+Vv+q1XLwQtDMCqkNL6B4PjR4V6YAC9K_LBxw@mail.gmail.com>:

On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

More to the point, we turned *one* rebuild = false situation into
a bunch of rebuild = true situations. I haven't studied it closely,
but I think a CCA animal would probably see O(N^2) rebuild = true
invocations in a query with N partitions, since each time
expand_partitioned_rtentry opens another child table, we'll incur
an AcceptInvalidationMessages call which leads to forced rebuilds
of the previously-pinned rels. In non-CCA situations, almost always
nothing happens with the previously-examined relcache entries.

That's rather unfortunate. I start to think that clobbering the cache
"always" is too hard a line.

I agree that copying data isn't great. What I don't agree is that this
solution is better. In particular, copying data out of the relcache
rather than expecting the relcache to hold still over long periods
is the way we've done things everywhere else, cf RelationGetIndexList,
RelationGetStatExtList, RelationGetIndexExpressions,
RelationGetIndexPredicate, RelationGetIndexAttrBitmap,
RelationGetExclusionInfo, GetRelationPublicationActions. I don't care
for a patch randomly deciding to do things differently on the basis of an
unsupported-by-evidence argument that it might cost too much to copy the
data. If we're going to make a push to reduce the amount of copying of
that sort that we do, it should be a separately (and carefully) designed
thing that applies to all the relcache substructures that have the issue,
not one-off hacks that haven't been reviewed thoroughly.

That's not an unreasonable argument. On the other hand, if you never
try new stuff, you lose opportunities to explore things that might
turn out to be better and worth adopting more widely.

I am not very convinced that it makes sense to lump something like
RelationGetIndexAttrBitmap in with something like
RelationGetPartitionDesc. RelationGetIndexAttrBitmap is returning a
single Bitmapset, whereas the data structure RelationGetPartitionDesc
is vastly larger and more complex. You can't say "well, if it's best
to copy 32 bytes of data out of the relcache every time we need it, it
must also be right to copy 10k or 100k of data out of the relcache
every time we need it."

There is another difference as well: there's a good chance that
somebody is going to want to mutate a Bitmapset, whereas they had
BETTER NOT think that they can mutate the PartitionDesc. So returning
an uncopied Bitmapset is kinda risky in a way that returning an
uncopied PartitionDesc is not.

If we want an at-least-somewhat unified solution here, I think we need
to bite the bullet and make the planner hold a reference to the
relcache throughout planning. (The executor already keeps it open, I
believe.) Otherwise, how is the relcache supposed to know when it can
throw stuff away and when it can't? The only alternative seems to be
to have each subsystem hold its own reference count, as I did with the
PartitionDirectory stuff, which is not something we'd want to scale
out.

I especially don't care for the much-less-than-half-baked kluge of
chaining the old rd_pdcxt onto the new one and hoping that it will go away
at a suitable time.

It will go away at a suitable time, but maybe not at the soonest
suitable time. It wouldn't be hard to improve that, though. The
easiest thing to do, I think, would be to have an rd_oldpdcxt and
stuff the old context there. If there already is one there, parent
the new one under it. When RelationDecrementReferenceCount reduces
the reference count to zero, destroy anything found in rd_oldpdcxt.
With that change, things get destroyed at the earliest time at which
we know the old things aren't referenced, instead of the earliest time
at which they are not referenced + an invalidation arrives.

regression=# create table parent (a text, b int) partition by list (a);
CREATE TABLE
regression=# create table child (a text, b int);
CREATE TABLE
regression=# do $$
regression$# begin
regression$# for i in 1..10000000 loop
regression$# alter table parent attach partition child for values in ('x');
regression$# alter table parent detach partition child;
regression$# end loop;
regression$# end $$;

Interesting example.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On Sun, Apr 14, 2019 at 03:29:26PM -0400, Tom Lane wrote in <27380.1555270166@sss.pgh.pa.us>:

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I agree that copying data isn't great. What I don't agree is that this
solution is better.

I am not very convinced that it makes sense to lump something like
RelationGetIndexAttrBitmap in with something like
RelationGetPartitionDesc. RelationGetIndexAttrBitmap is returning a
single Bitmapset, whereas the data structure RelationGetPartitionDesc
is vastly larger and more complex. You can't say "well, if it's best
to copy 32 bytes of data out of the relcache every time we need it, it
must also be right to copy 10k or 100k of data out of the relcache
every time we need it."

I did not say that. What I did say is that they're both correct
solutions. Copying large data structures is clearly a potential
performance problem, but that doesn't mean we can take correctness
shortcuts.

If we want an at-least-somewhat unified solution here, I think we need
to bite the bullet and make the planner hold a reference to the
relcache throughout planning. (The executor already keeps it open, I
believe.) Otherwise, how is the relcache supposed to know when it can
throw stuff away and when it can't?

The real problem with that is that we *still* won't know whether it's
okay to throw stuff away or not. The issue with these subsidiary
data structures is exactly that we're trying to reduce the lock strength
required for changing one of them, and as soon as you make that lock
strength less than AEL, you have the problem that that value may need
to change in relcache entries despite them being pinned. The code I'm
complaining about is trying to devise a shortcut solution to exactly
that situation ... and it's not a good shortcut.

The only alternative seems to be to have each subsystem hold its own
reference count, as I did with the PartitionDirectory stuff, which is
not something we'd want to scale out.

Well, we clearly don't want to devise a separate solution for every
subsystem, but it doesn't seem out of the question to me that we could
build a "reference counted cache sub-structure" mechanism and apply it
to each of these relcache fields. Maybe it could be unified with the
existing code in the typcache that does a similar thing. Sure, this'd
be a fair amount of work, but we've done it before. Syscache entries
didn't use to have reference counts, for example.

BTW, the problem I have with the PartitionDirectory stuff is exactly
that it *isn't* a reference-counted solution. If it were, we'd not
have this problem of not knowing when we could free rd_pdcxt.

I especially don't care for the much-less-than-half-baked kluge of
chaining the old rd_pdcxt onto the new one and hoping that it will go away
at a suitable time.

It will go away at a suitable time, but maybe not at the soonest
suitable time. It wouldn't be hard to improve that, though. The
easiest thing to do, I think, would be to have an rd_oldpdcxt and
stuff the old context there. If there already is one there, parent
the new one under it. When RelationDecrementReferenceCount reduces
the reference count to zero, destroy anything found in rd_oldpdcxt.

Meh. While it seems likely that that would mask most practical problems,
it still seems like covering up a wart with a dirty bandage. In
particular, that fix doesn't fix anything unless relcache reference counts
go to zero pretty quickly --- which I'll just note is directly contrary
to your enthusiasm for holding relcache pins longer.

I think that what we ought to do for v12 is have PartitionDirectory
copy the data, and then in v13 work on creating real reference-count
infrastructure that would allow eliminating the copy steps with full
safety. The $64 question is whether that really would cause unacceptable
performance problems. To look into that, I made the attached WIP patches.
(These are functionally complete, but I didn't bother for instance with
removing the hunk that 898e5e329 added to relcache.c, and the comments
need work, etc.) The first one just changes the PartitionDirectory
code to do that, and then the second one micro-optimizes
partition_bounds_copy() to make it somewhat less expensive, mostly by
collapsing lots of small palloc's into one big one.

What I get for test cases like [1] is

single-partition SELECT, hash partitioning:

N tps, HEAD tps, patch
2 11426.243754 11448.615193
8 11254.833267 11374.278861
32 11288.329114 11371.942425
128 11222.329256 11185.845258
512 11001.177137 10572.917288
1024 10612.456470 9834.172965
4096 8819.110195 7021.864625
8192 7372.611355 5276.130161

single-partition SELECT, range partitioning:

N tps, HEAD tps, patch
2 11037.855338 11153.595860
8 11085.218022 11019.132341
32 10994.348207 10935.719951
128 10884.417324 10532.685237
512 10635.583411 9578.108915
1024 10407.286414 8689.585136
4096 8361.463829 5139.084405
8192 7075.880701 3442.542768

Now certainly these numbers suggest that avoiding the copy could be worth
our trouble, but these results are still several orders of magnitude
better than where we were two weeks ago [2]. Plus, this is an extreme
case that's not really representative of real-world usage, since the test
tables have neither indexes nor any data. In practical situations the
baseline would be lower and the dropoff less bad. So I don't feel bad
about shipping v12 with these sorts of numbers and hoping for more
improvement later.

regards, tom lane

[1] /messages/by-id/3529.1554051867@sss.pgh.pa.us

[2] /messages/by-id/0F97FA9ABBDBE54F91744A9B37151A512BAC60@g01jpexmbkw24

On Wed, May 01, 2019 at 01:09:07PM -0400, Robert Haas wrote in <CA+Tgmob-cska+-WUC7T-G4zkSJp7yum6M_bzYd4YFzwQ51qiow@mail.gmail.com>:

On Wed, May 1, 2019 at 11:59 AM Andres Freund <andres@anarazel.de> wrote:

The message I'm replying to is marked as an open item. Robert, what do
you think needs to be done here before release? Could you summarize,
so we then can see what others think?

The original issue on this thread was that hyrax started running out
of memory when it hadn't been doing so previously. That happened
because, for complicated reasons, commit
898e5e3290a72d288923260143930fb32036c00c resulted in the leak being
hit lots of times in CLOBBER_CACHE_ALWAYS builds instead of just once.
Commits 2455ab48844c90419714e27eafd235a85de23232 and
d3f48dfae42f9655425d1f58f396e495c7fb7812 fixed that problem.

In the email at issue, Tom complains about two things. First, he
complains about the fact that I try to arrange things so that relcache
data remains valid for as long as required instead of just copying it.
Second, he complains about the fact repeated ATTACH and DETACH
PARTITION operations can produce a slow session-lifespan memory leak.

I think it's reasonable to fix the second issue for v12. I am not
sure how important it is, because (1) the leak is small, (2) it seems
unlikely that anyone would execute enough ATTACH/DETACH PARTITION
operations in one backend for the leak to amount to anything
significant, and (3) if a relcache flush ever happens when you don't
have the relation open, all of the "leaked" memory will be un-leaked.
However, I believe that we could fix it as follows. First, invent
rd_pdoldcxt and put the first old context there; if that pointer is
already in use, then parent the new context under the old one.
Second, in RelationDecrementReferenceCount, if the refcount hits 0,
nuke rd_pdoldcxt and set the pointer back to NULL. With that change,
you would only keep the old PartitionDesc around until the ref count
hits 0, whereas at present, you keep the old PartitionDesc around
until an invalidation happens while the ref count is 0.

I think the first issue is not v12 material. Tom proposed to fix it
by copying all the relevant data out of the relcache, but his own
performance results show a pretty significant hit, and AFAICS he
hasn't pointed to anything that's actually broken by the current
approach. What I think should be done is actually generalize the
approach I took in this patch, so that instead of the partition
directory holding a reference count, the planner or executor holds
one. Then not only could people who want information about the
PartitionDesc avoid copying stuff from the relcache, but maybe other
things as well. I think that would be significantly better than
continuing to double-down on the current copy-everything approach,
which really only works well in a world where a table can't have all
that much metadata, which is clearly not true for PostgreSQL any more.
I'm not sure that Tom is actually opposed to this approach -- although
I may have misunderstood his position -- but where we disagree, I
think, is that I see what I did in this commit as a stepping-stone
toward a better world, and he sees it as something that should be
killed with fire until that better world has fully arrived.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On Tue, Jun 11, 2019 at 01:57:16PM -0400, Tom Lane wrote in <18286.1560275836@sss.pgh.pa.us>:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jun 6, 2019 at 2:48 AM Amit Langote <amitlangote09@gmail.com> wrote:

Attached is a patch that applies on top of Robert's pdoldcxt-v1.patch,
which seems to fix this issue for me.

Yeah, that looks right. I think my patch was full of fuzzy thinking
and inadequate testing; thanks for checking it over and coming up with
the right solution.

Anyone else want to look/comment?

I think the existing code is horribly ugly and this is even worse.
It adds cycles to RelationDecrementReferenceCount which is a hotspot
that has no business dealing with this; the invariants are unclear;
and there's no strong reason to think there aren't still cases where
we accumulate lots of copies of old partition descriptors during a
sequence of operations. Basically you're just doubling down on a
wrong design.

As I said upthread, my current inclination is to do nothing in this
area for v12 and then try to replace the whole thing with proper
reference counting in v13. I think the cases where we have a major
leak are corner-case-ish enough that we can leave it as-is for one
release.

regards, tom lane

On Wed, Jun 12, 2019 at 03:11:56PM -0400, Tom Lane wrote in <3800.1560366716@sss.pgh.pa.us>:

Show quoted text

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jun 11, 2019 at 1:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think the change is responsive to your previous complaint that the
timing of stuff getting freed is not very well pinned down. With this
change, it's much more tightly pinned down: it happens when the
refcount goes to 0. That is definitely not perfect, but I think that
it is a lot easier to come up with scenarios where the leak
accumulates because no cache flush happens while the relfcount is 0
than it is to come up with scenarios where the refcount never reaches
0. I agree that the latter type of scenario probably exists, but I
don't think we've come up with one yet.

I don't know why you think that's improbable, given that the changes
around PartitionDirectory-s cause relcache entries to be held open much
longer than before (something I've also objected to on this thread).

As I said upthread, my current inclination is to do nothing in this
area for v12 and then try to replace the whole thing with proper
reference counting in v13. I think the cases where we have a major
leak are corner-case-ish enough that we can leave it as-is for one
release.

Is this something you're planning to work on yourself?

Well, I'd rather farm it out to somebody else, but ...

Do you have a
design in mind? Is the idea to reference-count the PartitionDesc?

What we discussed upthread was refcounting each of the various
large sub-objects of relcache entries, not just the partdesc.
I think if we're going to go this way we should bite the bullet and fix
them all. I really want to burn down RememberToFreeTupleDescAtEOX() in
particular ... it seems likely to me that that's also a source of
unpleasant memory leaks.

regards, tom lane

Amit Langote

amitlangote09@gmail.com

almost 6 years ago

In reply to: Justin Pryzby (#1)

Re: v12 "won't fix" item regarding memory leak in "ATTACH PARTITION without AEL"; (or, relcache ref counting)

On Mon, Feb 24, 2020 at 7:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

This links to a long thread, from which I've tried to quote some of the
most important mails, below.
https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items#Won.27t_Fix

I wondered if there's an effort to pursue a resolution for v13 ?

I think commit 5b9312378 in master branch fixes this. Commit message
mentions it like this:

...
Also, fix things so that old copies of a relcache partition descriptor
will be dropped when the cache entry's refcount goes to zero. In the
previous coding it was possible for such copies to survive for the
lifetime of the session, as I'd complained of in a previous discussion.
(This management technique still isn't perfect, but it's better than
before.)
...
...Although this is a pre-existing
problem, no back-patch: the patch seems too invasive to be safe to
back-patch, and the bug it fixes is a corner case that seems
relatively unlikely to cause problems in the field.

Discussion:
/messages/by-id/CA+HiwqFUzjfj9HEsJtYWcr1SgQ_=iCAvQ=O2Sx6aQxoDu4OiHw@mail.gmail.com
Discussion:
/messages/by-id/CA+TgmoY3bRmGB6-DUnoVy5fJoreiBJ43rwMrQRCdPXuKt4Ykaw@mail.gmail.com

Thanks,
Amit