Re: Proposal for fixing intra-query memory leaks

Started by Bruce Momjianalmost 26 years ago2 messageshackers

bruce@momjian.us

almost 26 years ago

FYI, Tom, is this still relivant?

This issue seems to have been on the back burner for a while,
but I think we need to put it on the front burner again for 7.1.
Here is a think-piece I just did. I'd appreciate comments,
particularly about possible interactions with TOAST --- Jan,
did you have any particular plan in mind for freeing datums created
by de-TOASTing?

regards, tom lane

Proposal for memory allocation fixes 29-Apr-2000
------------------------------------

We know that Postgres has serious problems with memory leakage during
large queries that process a lot of pass-by-reference data. There is
no provision for recycling memory until end of query. This needs to be
fixed, even more so with the advent of TOAST which will allow very
large chunks of data to be passed around in the system. Furthermore,
7.1 is an ideal time for fixing it since TOAST and the function-manager
interface changes will require visiting a lot of the same code that needs
to be cleaned up. So, here is a proposal.

Background
----------

We already do most of our memory allocation in "memory contexts", which
are usually AllocSets as implemented by backend/utils/mmgr/aset.c.
(Is there any value in allowing for other memory context types? We could
save some cycles by getting rid of a level of indirection here.) What
we need to do is create more contexts and define proper rules about when
they can be freed.

The basic operations on a memory context are:

* create a context

* delete a context (including freeing all the memory allocated therein)

* reset a context (free all memory allocated in the context, but not the
context object itself)

Given a context, one can allocate a chunk of memory within it, free a
previously allocated chunk, or realloc a previously allocated chunk larger
or smaller. (These operations correspond directly to standard C's
malloc(), free(), and realloc() routines.) At all times there is a
"current" context denoted by the CurrentMemoryContext global variable.
The backend macros palloc(), pfree(), prealloc() implicitly allocate space
in that context. The MemoryContextSwitchTo() operation selects a new
current context (and returns the previous context, so that the caller can
restore the previous context before exiting).

Note: there is no really good reason for pfree() to be tied to the current
memory context; it ought to be possible to pfree() a chunk of memory no
matter which context it was allocated from. Currently we cannot do that
because of the possibility that there is more than one kind of memory
context. If they were all AllocSets then the problem goes away, which is
one reason I'd like to eliminate the provision for other kinds of
contexts.

The main advantage of memory contexts over plain use of malloc/free is
that the entire contents of a memory context can be freed easily, without
having to request freeing of each individual chunk within it. This is
both faster and more reliable than per-chunk bookkeeping. We already use
this fact to clean up at transaction end: by resetting all the active
contexts, we reclaim all memory. What we need are additional contexts
that can be reset or deleted at strategic times within a query, such as
after each tuple.

Additions to the memory-context mechanism
-----------------------------------------

If we are going to have more contexts, we need more mechanism for keeping
track of them; else we risk leaking whole contexts under error conditions.
We can do this as follows:

1. There will be two kinds of contexts, "permanent" and "temporary".
Permanent contexts are never reset or deleted except by explicit caller
command (in practice, they probably won't ever be, period). There will
not be very many of these --- perhaps only the existing TopMemoryContext
and CacheMemoryContext. We should avoid having very much code run with
CurrentMemoryContext pointing at a permanent context, since any forgotten
palloc() represents a permanent memory leak.

2. Temporary contexts are remembered by the context manager and are
guaranteed to be deleted at transaction end. (If we ever have nested
transactions, we'd probably want to tie each temporary context to a
particular transaction, but for now that's not necessary.) Most activity
will happen in temporary contexts.

3. When a context is created, an existing context can be specified as its
parent; thus a tree of contexts is created. Resetting or deleting any
particular context resets or deletes all its direct and indirect children
as well. This feature allows us to manage a lot of contexts without fear
that some will be leaked; we just have to make sure everything descends
from one context that we remember to zap at transaction end.

In practice, point #2 doesn't require any special support in the context
manager as long as it supports point #3. We simply start a new context
for each transaction and delete it at transaction end. All temporary
contexts created within the transaction must be direct or indirect
children of this "transaction top context".

Note: it would probably be possible to adapt the existing "portal" memory
management mechanism to do what we need. I am instead proposing setting
up a totally new mechanism, because the portal code strikes me as
extremely crufty and unwieldy. It may be that we can eventually remove
portals entirely, or perhaps reimplement them with this mechanism
underneath.

Top-level (permanent) memory contexts
-------------------------------------

We currently have TopMemoryContext and CacheMemoryContext as permanent
memory contexts. The existing usages of these are probably OK, although
it might be a good idea to examine usages of TopMemoryContext to see if
they should go somewhere else.

It might also be a good idea to set up a permanent ErrorMemoryContext that
elog() can switch into for processing an error; this would ensure that
there is at least ~8K of memory available for error processing, even if
we've run out otherwise. (ErrorMemoryContext could be reset, but not
deleted, after each successful error recovery.)

We will also create a global variable TransactionTopMemoryContext, which
is valid at all times. Memory recovery at end of transaction is done by
deleting and immediately recreating this context. All transaction-local
contexts are created as children of TransactionTopMemoryContext, so that
they go away at transaction end too. (If we implement nested
transactions, it could be that TransactionTopMemoryContext will itself be
a child of some outer transaction's top context, but that's beyond the
scope of this proposal.)

Transaction-local memory contexts
---------------------------------

Relatively little stuff should get allocated directly in
TransactionTopMemoryContext; the bulk of the action should happen in
sub-contexts. I propose the following:

QueryTopMemoryContext: this child of TransactionTopMemoryContext is
created at the start of each query cycle and deleted upon successful
completion. (On error, of course, it goes away because it is a child of
TransactionTopMemoryContext.) The query input buffer is allocated in this
context, as well as anything else that should live just till end of query.

ParsePlanMemoryContext: this child of QueryTopMemoryContext is working
space for the parse/rewrite/plan/optimize pipeline. After completion
of planning, the final query plan is copied via copyObject() back into
QueryTopMemoryContext, and then the ParsePlanMemoryContext can be deleted.
This allows us to recycle the (perhaps large) amount of memory used by
planning before actual query execution starts.

Execution per-run memory contexts: at startup, the executor will create a
child of QueryTopMemoryContext to hold data that should live until
ExecEndPlan; an example is the plan-node-local execution state. Some plan
node types may want to create shorter-lived contexts that are children of
their parent's per-run context. For example, a subplan node would create
its own "per run" context so that memory could be freed at completion of
each invocation of the subplan.

Execution per-tuple memory contexts: each per-run context will have a
child context that the executor will reset (not delete) each time through
the node's per-tuple loop. This per-tuple context will be the active
CurrentMemoryContext most of the time during execution.

By resetting the per-tuple context, we will be able to free memory after
each tuple is processed, rather than only after the whole plan is
processed. This should solve our memory leakage problems pretty well;
yet we do not need to add very much new bookkeeping logic to do it.
In particular, we do *not* need to try to keep track of individual values
palloc'd during expression evaluation.

Note we assume that resetting a context is a cheap operation. This is
true already, and we can make it even more true with a little bit of
tuning in aset.c.

Coding rules required
---------------------

Functions that return pass-by-reference values will be required always
to palloc the returned space in the caller's memory context (ie, the
context that was CurrentMemoryContext at the time of call). It is not
OK to pass back an input pointer, even if we are returning an input value
verbatim, because we do not know the lifespan of the context the input
pointer points to. An example showing why this is necessary is provided
by aggregate-function execution. The aggregate function executor must
retain state values returned by state-transition functions from one tuple
to the next. Yet it does not want to keep them till end of run; that
would be a memory leak. The solution nodeAgg.c will use is to have two
per-tuple memory contexts that are used alternately. At each tuple,
an old state value existing in one context is passed to the state
transition function, which will return its result in the other context
(since that'll be where CurrentMemoryContext points). Then the first
context is reset and used as the target for the next cycle. This solution
works as long as the transition function always returns a newly palloc'd
datum, and never simply returns a pointer to its input data.

Thus, a function must use the passed-in CurrentMemoryContext for
allocating its result data, and can use it for any temporary storage it
needs as well. pfree'ing such temporary data before return is possible
but not essential.

Executor routines that switch the active CurrentMemoryContext may need
to copy data into their caller's current memory context before returning.
I think there will be relatively little need for that, if we use a
convention of resetting the per-tuple context at the *start* of an
execution cycle rather than at its end. With that rule, an execution
node can return a tuple that is palloc'd in its per-tuple context, and
the tuple will remain good until the node is called for another tuple
or told to end execution. This is pretty much the same state of affairs
that exists now, since a scan node can return a direct pointer to a tuple
in a disk buffer that is only guaranteed to remain good that long.

A more common reason for copying data will be to transfer a result from
per-tuple context to per-run context; for example, a Unique node will
save the last distinct tuple value in its per-run context, requiring a
copy step. (Actually, Unique could use the same trick with two per-tuple
contexts as described above for Agg, but there will probably be other
cases where doing an extra copy step is the right thing.)

Other notes
-----------

It might be that the executor per-run contexts described above should
be tied directly to executor "EState" nodes, that is, one context per
EState. I'm not real clear on the lifespan of EStates or the situations
where we have just one or more than one, so I'm not sure. Comments?

With so many contexts running around, I think it will be almost essential
to allow pfree() to work on chunks belonging to contexts other than the
current one. If we don't get rid of the notion of multiple allocation
context types then some other work will have to be expended to make this
possible. Also, should we allow prealloc() to work on a chunk not
belonging to the current context? I'm less excited about allowing that,
but it may prove useful.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Import Notes

Reply to msg id not found: 17136.957047804@sss.pgh.pa.us

Tom Lane

tgl@sss.pgh.pa.us

almost 26 years ago

In reply to: Bruce Momjian (#1)

Bruce Momjian <pgman@candle.pha.pa.us> writes:

FYI, Tom, is this still relivant?

Yes, and I'll probably do something about it as soon as I can come up
for air from the fmgr fixes. Man, we have got a *lot* of built-in
functions.

regards, tom lane