memory leak in e94568ecc10 (pre-reading in external sort)
Hi,
it seems e94568ecc10 has a pretty bad memory leak. A simple
pgbench -i -s 300
allocates ~32GB of memory before it fails
vacuum...
set primary keys...
ERROR: out of memory
DETAIL: Failed on request of size 134184960.
The relevant bit from the memory context stats:
TupleSort main: 33278738504 total in 263 blocks; 78848 free (23 chunks);
33278659656 used
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/06/2016 07:50 AM, Tomas Vondra wrote:
it seems e94568ecc10 has a pretty bad memory leak. A simple
Oops, fixed, thanks for the report!
To be precise, this wasn't a memory leak, just a gross overallocation of
memory. The new code in tuplesort.c assumes that it's harmless to call
LogicalTapeRewind() on never-used tapes, but in fact, it allocated the
read buffer for the tape. I fixed that by changing LogicalTapeRewind()
to not allocate it, if the tape was completely empty.
This is related to earlier the discussion with Peter G, on whether we
should change state->maxTapes to reflect the actual number of tape that
were used, when that's less than maxTapes. I think his confusion about
the loop in init_tape_buffers(), in
CAM3SWZQ7=FCy1iUZ6jNzUUNnktAG6uitC1i-DoNxScP-9ZsHwQ@mail.gmail.com would
also have been avoided, if we had done that. So I think we should
reconsider that.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 6, 2016 at 12:00 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
This is related to earlier the discussion with Peter G, on whether we should
change state->maxTapes to reflect the actual number of tape that were used,
when that's less than maxTapes. I think his confusion about the loop in
init_tape_buffers(), in
CAM3SWZQ7=FCy1iUZ6jNzUUNnktAG6uitC1i-DoNxScP-9ZsHwQ@mail.gmail.com would
also have been avoided, if we had done that. So I think we should reconsider
that.
-1 on that from me. I don't think that you should modify a variable
that is directly linkable to Knuth's description of polyphase merge --
doing so seems like a bad idea. state->maxTapes (Knuth's T) really is
something that is established pretty early on, and doesn't change.
While the fix you pushed was probably a good idea anyway, I still
think you should not use state->maxTapes to exhaustively call
LogicalTapeAssignReadBufferSize() on every tape, even non-active
tapes. That's the confusing part. It's not as if your need for the
number of input tapes (the number of maybe-active tapes) is long
lived; you just need to instruct logtape.c on buffer sizing once, at
the start of mergeruns().
Besides, what I propose to do is really exactly the same as what you
also want to do, except it avoids actually changing state->maxTapes.
We'd just pass down what you propose to assign to state->maxTapes
directly, which differs (and not just in the common case where there
are inactive tapes -- it's always at least off-by-one). Right?
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 6, 2016 at 8:44 AM, Peter Geoghegan <pg@heroku.com> wrote:
Besides, what I propose to do is really exactly the same as what you
also want to do, except it avoids actually changing state->maxTapes.
We'd just pass down what you propose to assign to state->maxTapes
directly, which differs (and not just in the common case where there
are inactive tapes -- it's always at least off-by-one). Right?
What I mean is that you should pass down numTapes alongside
numInputTapes. The function init_tape_buffers() could either have an
additional argument (numTapes), or derive numTapes itself.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/06/2016 06:44 PM, Peter Geoghegan wrote:
While the fix you pushed was probably a good idea anyway, I still
think you should not use swhichtate->maxTapes to exhaustively call
LogicalTapeAssignReadBufferSize() on every tape, even non-active
tapes. That's the confusing part.
Admittedly that's confusing. Thinking about this some more, I came up
with the attached. I removed the separate
LogicalTapeAssignReadBufferSize() call altogether - the read buffer size
is now passed as argument to the LogicalTapeRewindForRead() call.
I split LogicalTapeRewind() into two: LogicalTapeRewindForRead() and
LogicalTapeRewindForWrite(). A boolean argument like 'forWrite' is
mentally challenging, because when reading a call site, you have to
remember which way round the boolean is. And now you also pass the extra
buffer_size argument when rewinding for reading, but not when writing.
I gave up on the logic to calculate the quotient and reminder of the
available memory, and assigning one extra buffer to some of the tapes.
I'm now just rounding it down to the nearest BLCKSZ boundary. That
simplifies the code further, although we're now not using all the memory
we could. I'm pretty sure that's OK, although I haven't done any
performance testing of this.
- Heikki
Attachments:
0001-Simplify-the-code-for-logical-tape-read-buffers.patchtext/x-patch; name=0001-Simplify-the-code-for-logical-tape-read-buffers.patchDownload+137-202
On Mon, Oct 10, 2016 at 5:21 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Admittedly that's confusing. Thinking about this some more, I came up with
the attached. I removed the separate LogicalTapeAssignReadBufferSize() call
altogether - the read buffer size is now passed as argument to the
LogicalTapeRewindForRead() call.I split LogicalTapeRewind() into two: LogicalTapeRewindForRead() and
LogicalTapeRewindForWrite(). A boolean argument like 'forWrite' is mentally
challenging, because when reading a call site, you have to remember which
way round the boolean is. And now you also pass the extra buffer_size
argument when rewinding for reading, but not when writing.
I like the general idea here.
I gave up on the logic to calculate the quotient and reminder of the
available memory, and assigning one extra buffer to some of the tapes. I'm
now just rounding it down to the nearest BLCKSZ boundary. That simplifies
the code further, although we're now not using all the memory we could. I'm
pretty sure that's OK, although I haven't done any performance testing of
this.
The only thing I notice is that this new struct field could use a comment:
--- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -366,6 +366,8 @@ struct Tuplesortstate char *slabMemoryEnd; /* end of slab memory arena */ SlabSlot *slabFreeHead; /* head of free list */+ size_t read_buffer_size;
+
Also, something about trace_sort here:
+#ifdef TRACE_SORT + if (trace_sort) + elog(LOG, "using " INT64_FORMAT " KB of memory for read buffers among %d input tapes", + (state->availMem) / 1024, numInputTapes); +#endif + + state->read_buffer_size = state->availMem / numInputTapes; + USEMEM(state, state->availMem);
Maybe you should just make the trace_sort output talk about blocks at
this point?
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/11/2016 12:56 AM, Peter Geoghegan wrote:
Also, something about trace_sort here:
+#ifdef TRACE_SORT + if (trace_sort) + elog(LOG, "using " INT64_FORMAT " KB of memory for read buffers among %d input tapes", + (state->availMem) / 1024, numInputTapes); +#endif + + state->read_buffer_size = state->availMem / numInputTapes; + USEMEM(state, state->availMem);Maybe you should just make the trace_sort output talk about blocks at
this point?
With # of blocks, you then have to mentally divide by 8 to get the
actual memory used. I think kB is nicer in messages that are meant to be
read by humans.
The bigger problem with this message is that it's not very accurate
anymore. The actual amount of memory used is rounded down, and capped by
MaxAllocSize*numInputTapes. And would it be better to print the per-tape
buffer size, rather than the total?
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers