Assorted leaks and weirdness in parallel execution
I complained a couple weeks ago that nodeGatherMerge looked like it
leaked a lot of memory when commanded to rescan. Attached are three
proposed patches that, in combination, demonstrably result in zero
leakage across repeated rescans.
The first thing I noticed when I started digging into this was that
there was some leakage in TopMemoryContext, which seemed pretty weird.
What it turned out to be was on_dsm_detach callback registration records.
This happens because, although the comments for shm_mq_attach() claim
that shm_mq_detach() will free the shm_mq_handle, it does no such thing,
and it doesn't worry about canceling the on_dsm_detach registration
either. So over repeated attach/detach cycles, we leak shm_mq_handles
and also callback registrations. This isn't just a memory leak: it
means that, whenever we finally do detach from the DSM segment, we'll
execute a bunch of shm_mq_detach() calls pointed at long-since-detached-
and-reused shm_mq structs. That seems incredibly dangerous. It manages
not to fail ATM because our stylized use of DSM means that a shm_mq
would only ever be re-used as another shm_mq; so the only real effect is
that our last counterparty process, if still attached, would receive N
SetLatch events not just one. But it's going to crash and burn someday.
For extra fun, the error MQs weren't ever explicitly detached from,
just left to rot until on_dsm_detach time. Although we did pfree the
shm_mq_handles out from under them.
So the first patch attached cleans this up by making shm_mq_detach
do what it was advertised to, ie fully reverse what shm_mq_attach
does. That means it needs to take a shm_mq_handle, not a bare shm_mq,
but that actually makes the callers cleaner anyway. (With this patch,
there are no callers of shm_mq_get_queue(); should we remove that?)
The second patch cleans up assorted garden-variety leaks when
rescanning a GatherMerge node, by having it allocate its work
arrays just once and then re-use them across rescans.
The last patch fixes the one remaining leak I saw after applying the
first two patches, namely that execParallel.c leaks the array palloc'd
by ExecParallelSetupTupleQueues --- just the array storage, not any of
the shm_mq_handles it points to. The given patch just adds a pfree
to ExecParallelFinish, but TBH I find this pretty unsatisfactory.
It seems like a significant modularity violation that execParallel.c
is responsible for creating those shm_mqs but not for cleaning them up.
That cleanup currently happens as a result of DestroyTupleQueueReader
calls done by nodeGather.c or nodeGatherMerge.c. I'm tempted to
propose that we should move both the creation and the destruction of
the TupleQueueReaders into execParallel.c; the current setup is not
just weird but requires duplicative coding in the Gather nodes.
(That would make it more difficult to do the early reader destruction
that nodeGather currently does, but I am not sure we care about that.)
Another thing that seems like a poor factorization choice is that
DestroyTupleQueueReader is charged with doing shm_mq_detach even though
tqueue.c did not do the shm_mq_attach ... should we rethink that?
Comments?
regards, tom lane
Attachments:
fix-shm-mq-management.patchtext/x-diff; charset=us-ascii; name=fix-shm-mq-management.patchDownload+39-17
fix-gathermerge-leaks.patchtext/x-diff; charset=us-ascii; name=fix-gathermerge-leaks.patchDownload+78-33
fix-pei-leaks.patchtext/x-diff; charset=us-ascii; name=fix-pei-leaks.patchDownload+6-1
On Thu, Aug 31, 2017 at 11:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I complained a couple weeks ago that nodeGatherMerge looked like it
leaked a lot of memory when commanded to rescan. Attached are three
proposed patches that, in combination, demonstrably result in zero
leakage across repeated rescans.
Gosh, thanks for looking into this so deeply. I apologize for all of
the bugs. Also, my ego is taking some severe damage here.
But it's going to crash and burn someday.
Yeah, ouch.
(With this patch,
there are no callers of shm_mq_get_queue(); should we remove that?)
May as well. I can't remember any more why I did shm_mq_detach() that
way; I think there was someplace where I thought that the
shm_mq_handle might not be available. Maybe I'm misremembering, or
perhaps the situation has changed as that code has evolved.
The last patch fixes the one remaining leak I saw after applying the
first two patches, namely that execParallel.c leaks the array palloc'd
by ExecParallelSetupTupleQueues --- just the array storage, not any of
the shm_mq_handles it points to. The given patch just adds a pfree
to ExecParallelFinish, but TBH I find this pretty unsatisfactory.
It seems like a significant modularity violation that execParallel.c
is responsible for creating those shm_mqs but not for cleaning them up.
Yeah, the correct division of labor between execParallel.c and
nodeGather.c was not entirely clear to me, and I don't pretend that I
got that 100% right.
(That would make it more difficult to do the early reader destruction
that nodeGather currently does, but I am not sure we care about that.)
I think the only thing that matters here is -- if we know that we're
not going to read any more tuples from a worker that might still be
generating tuples, it's imperative that we shut it down ASAP.
Otherwise, it's just going to keep cranking them out, wasting
resources unnecessarily. I think this is different than what you're
talking about here, but just wanted to be clear.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Aug 31, 2017 at 11:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
(With this patch,
there are no callers of shm_mq_get_queue(); should we remove that?)
May as well. I can't remember any more why I did shm_mq_detach() that
way; I think there was someplace where I thought that the
shm_mq_handle might not be available. Maybe I'm misremembering, or
perhaps the situation has changed as that code has evolved.
I initially tried to convert the on_dsm_detach callback to take a
pointer to the shm_mq_handle rather than the shm_mq proper. That
caused regression test crashes in some processes, indicating that
there are situations where we have freed the shm_mq_handle before
the DSM detach happens. I think it was only during worker process exit.
That's sort of contrary to the advice in shm_mq.c about the desirable
lifespan of a shm_mq_handle, but I didn't feel like trying to fix it.
It seems generally more robust if the on_dsm_detach callback assumes
as little as possible about intra-process state, anyway.
I don't have any strong reason to remove shm_mq_get_queue(), other than
neatnik-ism. It might save a caller having to remember the shm_mq pointer
separately. Given the set of API functions, that would only matter if
somebody wanted to set/get the sender/receiver PGPROC pointers later,
but maybe that's a plausible thing to do.
It seems like a significant modularity violation that execParallel.c
is responsible for creating those shm_mqs but not for cleaning them up.
Yeah, the correct division of labor between execParallel.c and
nodeGather.c was not entirely clear to me, and I don't pretend that I
got that 100% right.
OK, I'll have a go at that.
(That would make it more difficult to do the early reader destruction
that nodeGather currently does, but I am not sure we care about that.)
I think the only thing that matters here is -- if we know that we're
not going to read any more tuples from a worker that might still be
generating tuples, it's imperative that we shut it down ASAP.
Otherwise, it's just going to keep cranking them out, wasting
resources unnecessarily. I think this is different than what you're
talking about here, but just wanted to be clear.
Yeah, it is different. What I'm looking at is that nodeGather does
DestroyTupleQueueReader as soon as it's seen EOF on a given tuple queue.
That can't save any worker cycles. The reason seems to be that it wants
to collapse its array of TupleQueueReader pointers so only live queues are
in it. That's reasonable, but I'm inclined to implement it by making the
Gather node keep a separate working array of pointers to only the live
TupleQueueReaders. The ParallelExecutorInfo would keep the authoritative
array of all TupleQueueReaders that have been created, and destroy them in
ExecParallelFinish.
Your point is that we want to shut down the TupleQueueReaders immediately
on rescan, which we do already. Another possible scenario is to shut them
down once we've reached the passed-down tuple limit (across the whole
Gather, not per-child which is what 3452dc524 implemented). I don't think
what I'm suggesting would complicate that.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 31, 2017 at 2:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, it is different. What I'm looking at is that nodeGather does
DestroyTupleQueueReader as soon as it's seen EOF on a given tuple queue.
That can't save any worker cycles. The reason seems to be that it wants
to collapse its array of TupleQueueReader pointers so only live queues are
in it. That's reasonable, but I'm inclined to implement it by making the
Gather node keep a separate working array of pointers to only the live
TupleQueueReaders. The ParallelExecutorInfo would keep the authoritative
array of all TupleQueueReaders that have been created, and destroy them in
ExecParallelFinish.
Hmm, that's a thought.
Your point is that we want to shut down the TupleQueueReaders immediately
on rescan, which we do already. Another possible scenario is to shut them
down once we've reached the passed-down tuple limit (across the whole
Gather, not per-child which is what 3452dc524 implemented). I don't think
what I'm suggesting would complicate that.
Yeah. I think the way to do that would be to implement what is
mentioned in the comment for ExecShutdownNode: call that function on
the child plan as soon as the LIMIT is filled.
(Hmm, the reference to someday covering FDW in the header of that
comment is obsolete, isn't it? Another oversight on my part.)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers