Memory usage during sorting
In tuplesort.c, it initially reads tuples into memory until availMem
is exhausted.
It then switches to the tape sort algorithm, and allocates buffer
space for each tape it will use. This substantially over-runs the
allowedMem, and drives availMem negative.
It works off this deficit by writing tuples to tape, and pfree-ing
their spot in the heap, which pfreed space is more or less randomly
scattered. Once this is done it proceeds to alternate between freeing
more space in the heap and adding things to the heap (in a nearly
strict 1+1 alternation if the tuples are constant size).
The space freed up by the initial round of pfree where it is working
off the space deficit from inittapes is never re-used. It also cannot
be paged out by the VM system, because it is scattered among actively
used memory.
I don't think that small chunks can be reused from one memory context
to another, but I haven't checked. Even if it can be, during a big
sort like an index build the backend process doing the build may have
no other contexts which need to use it.
So having over-ran workMem and stomped all over it to ensure no one
else can re-use it, we then scrupulously refuse to benefit from that
over-run amount ourselves.
The attached patch allows it to reuse that memory. On my meager
system it reduced the building of an index on an integer column in a
skinny 200 million row totally randomly ordered table by about 3% from
a baseline of 25 minutes.
I think it would be better to pre-deduct the tape overhead amount we
will need if we decide to switch to tape sort from the availMem before
we even start reading (and then add it back if we do indeed make that
switch). That way we wouldn't over-run the memory in the first place.
However, that would cause apparent regressions in which sorts that
previously fit into maintenance_work_mem no longer do. Boosting
maintenance_work_mem to a level that was actually being used
previously would fix those regressions, but pointing out that the
previous behavior was not optimal doesn't change the fact that people
are used to it and perhaps tuned to it. So the attached patch seems
more backwards-friendly.
Cheers,
Jeff
Attachments:
sort_mem_usage_v1.patchapplication/octet-stream; name=sort_mem_usage_v1.patchDownload+1-0
On 16 January 2012 00:59, Jeff Janes <jeff.janes@gmail.com> wrote:
I think it would be better to pre-deduct the tape overhead amount we
will need if we decide to switch to tape sort from the availMem before
we even start reading (and then add it back if we do indeed make that
switch). That way we wouldn't over-run the memory in the first place.
However, that would cause apparent regressions in which sorts that
previously fit into maintenance_work_mem no longer do. Boosting
maintenance_work_mem to a level that was actually being used
previously would fix those regressions, but pointing out that the
previous behavior was not optimal doesn't change the fact that people
are used to it and perhaps tuned to it. So the attached patch seems
more backwards-friendly.
Hmm. Are people really setting maintenance_work_mem such that it is
exactly large enough to quicksort when building an index in one case
but not another? Is the difference large enough to warrant avoiding
pre-deduction?
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On Sat, Jan 21, 2012 at 4:51 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
On 16 January 2012 00:59, Jeff Janes <jeff.janes@gmail.com> wrote:
I think it would be better to pre-deduct the tape overhead amount we
will need if we decide to switch to tape sort from the availMem before
we even start reading (and then add it back if we do indeed make that
switch). That way we wouldn't over-run the memory in the first place.
However, that would cause apparent regressions in which sorts that
previously fit into maintenance_work_mem no longer do. Boosting
maintenance_work_mem to a level that was actually being used
previously would fix those regressions, but pointing out that the
previous behavior was not optimal doesn't change the fact that people
are used to it and perhaps tuned to it. So the attached patch seems
more backwards-friendly.Hmm. Are people really setting maintenance_work_mem such that it is
exactly large enough to quicksort when building an index in one case
but not another?
It could also apply to work_mem in other situations. I'm just using
maintenance_work_mem in my example because index creation is the thing
that lead me into this little diversion and so that is what my
test-bed is set up to use.
Some people do have very stable "ritualized" work-loads and so could
have tuned exactly to them. I don't know how common that would be.
More common might be synthetic benchmarks which had been tuned, either
intentionally or accidentally, to just barely fit in the (overshot)
memory, so it would be a perception problem that there was a
regression when in fact the tuning merely became out of date.
Is the difference large enough to warrant avoiding
pre-deduction?
Switching to an external sort when you could have done it all by quick
sort (by cheating just a little bit on memory usage) makes a pretty
big difference, over 2 fold in my tests. If the fast-path
optimization for qsort goes in, the difference might get even bigger.
Of course, there will always be some point at which that switch over
must occur, so the real question is do we need to keep that switch
point historically consistent to avoid nasty surprises for people with
excessively fine-tuned *work_mem settings based on old behavior? And
do such people even exist outside of my imagination?
But a bigger question I now have is if any of this matters. Since
there is currently no way to know how many connections might be
running how many concurrent sorts, there is really no way to tune
work_mem with much precision. So, does it matter if we slightly lie
about how much we will actually use? And is it worth worrying about
whether every last drop of memory is used efficiently?
I was trying to compare the performance I was getting with a
theoretical model of what I should get, and it was confusing to me
that we were using a originally defined size of memory for the
priority heap, and then later reducing that amount of memory. That is
annoying because it complicates the theoretical model, and even more
annoying when I realized that the "freed" memory that results from
doing this is too fragmented to even be used for other purposes. But
theoretical annoyances aside, it is hardly the worst thing about
memory usage in sorting. It is just one that is standing in the way
of my further study.
So I don't think this patch I proposed is particularly important in
its own right. I want to see if anyone will point out that I am all
wet because my analysis failed to take <foo> into account. I
probably should have explained this when I submitted the patch.
The worst thing about the current memory usage is probably that big
machines can't qsort more than 16,777,216 tuples no matter how much
memory they have, because memtuples has to live entirely within a
single allocation, which is currently limited to 1GB. It is probably
too late to fix this problem for 9.2. I wish I had started there
rather than here.
This 16,777,216 tuple limitation will get even more unfortunate if the
specializations that speed up qsort but not external sort get
accepted.
Attached is a completely uncommitable proof of concept/work in
progress patch to get around the limitation. It shows a 2 fold
improvement when indexing an integer column on a 50,000,000 row
randomly ordered table.
As for what to do to make it commitable, it seems like maybe there
should be two different MaxAllocSize. One determined by the "aset.c
assumes they can compute twice an allocation's size without overflow".
I would think that simply testing size_t at compile time would be
enough. Allowing more than 1 GB allocations would probably be
pointless on 32 bit machines, and 64 bit should be able to compute
twice of a much larger value than 1GB without overflow.
For the varlena stuff, I don't know what to do. Is the "I swear to
God I won't pass this pointer to one of those functions" sufficient?
Or does each function need to test it?
And I'm pretty sure that palloc, not just repalloc, would need an
over-size alternative to make this a general-purpose improvement.
Cheers,
Jeff
Attachments:
memtuples_POCWIP.patchapplication/octet-stream; name=memtuples_POCWIP.patchDownload+38-6
On Sat, Feb 4, 2012 at 10:06 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
Attached is a completely uncommitable proof of concept/work in
progress patch to get around the limitation. It shows a 2 fold
improvement when indexing an integer column on a 50,000,000 row
randomly ordered table.
Oops, forgot to say that is with
set maintenance_work_mem=4194304;
On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
The attached patch allows it to reuse that memory. On my meager
system it reduced the building of an index on an integer column in a
skinny 200 million row totally randomly ordered table by about 3% from
a baseline of 25 minutes.
Just to give a standard review, this patch is one line change and
applies cleanly, builds ok.
I'm not pretty sure what exactly you're trying to accomplish, but it
seems to me that it's avoiding the first dumptuples cycle by forcing
availMem = 0 even if it's negative. I read your comments as it'd be
avoiding to alternate reading/writing back and force with scattered
memory failing memory cache much during merge phase, but actually it
doesn't affect merge phase but only init-dump phase, does it? If so,
I'm not so convinced your benchmark gave 3 % gain by this change.
Correct me as I'm probably wrong.
Anyway, it's nice to modify the comment just above the change, since
it's now true with it.
Thanks,
--
Hitoshi Harada
On Sat, Feb 4, 2012 at 10:06 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
The worst thing about the current memory usage is probably that big
machines can't qsort more than 16,777,216 tuples no matter how much
memory they have, because memtuples has to live entirely within a
single allocation, which is currently limited to 1GB. It is probably
too late to fix this problem for 9.2. I wish I had started there
rather than here.This 16,777,216 tuple limitation will get even more unfortunate if the
specializations that speed up qsort but not external sort get
accepted.
I think it's a fair ask to extend our palloc limitation of 1GB to
64bit space. I see there are a lot of applications that want more
memory by one palloc call in user-defined functions, aggregates, etc.
As you may notice, however, the area in postgres to accomplish it
needs to be investigated deeply. I don't know where it's safe to allow
it and where not. varlena types is unsafe, but it might be possible to
extend varlena header to 64 bit in major release somehow.
Attached is a completely uncommitable proof of concept/work in
progress patch to get around the limitation. It shows a 2 fold
improvement when indexing an integer column on a 50,000,000 row
randomly ordered table.
In any case, we do need bird-eye sketch to attack it but I guess it's
worth and at some future point we definitely must do, though I don't
know if it's the next release or third next release from now.
Thanks,
--
Hitoshi Harada
On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
The attached patch allows it to reuse that memory. On my meager
system it reduced the building of an index on an integer column in a
skinny 200 million row totally randomly ordered table by about 3% from
a baseline of 25 minutes.Just to give a standard review, this patch is one line change and
applies cleanly, builds ok.I'm not pretty sure what exactly you're trying to accomplish, but it
seems to me that it's avoiding the first dumptuples cycle by forcing
availMem = 0 even if it's negative.
Yes. Currently when it switches to the TSS_BUILDRUNS part of a
tape-sort, it starts by calling WRITETUP a large number of time
consecutively, to work off the memory deficit incurred by the 3 blocks
per tape of tape overhead, and then after that calls WRITETUP about
once per puttuple.. Under my patch, it would only call WRITETUP
about once per puttuple, right from the beginning.
I read your comments as it'd be
avoiding to alternate reading/writing back and force with scattered
memory failing memory cache much during merge phase, but actually it
doesn't affect merge phase but only init-dump phase, does it?
It effects the building of the runs. But this building of the runs is
not a simple dump, it is itself a mini merge phase, in which it merges
the existing in-memory priority queue against the still-incoming
tuples from the node which invoked the sort. By using less memory
than it could, this means that the resulting runs are smaller than
they could be, and so will sometimes necessitate an additional layer
of merging later on. (This effect is particularly large for the very
first run being built. Generally by merging incoming tuples into the
memory-tuples, you can create runs that are 1.7 times the size of fits
in memory. By wasting some memory, we are getting 1.7 the size of a
smaller starting point. But for the first run, it is worse than that.
Most of the benefit that leads to that 1.7 multiplier comes at the
very early stage of each run-build. But by initially using the full
memory, then writing out a bunch of tuples without doing any merge of
the incoming, we have truncated the part that gives the most benefit.)
My analysis that the "freed" memory is never reused (because we refuse
to reuse it ourselves and it is too fragmented to be reused by anyone
else, like the palloc or VM system) only applies to the run-building
phase. So never was a bit of an overstatement. By the time the last
initial run is completely written out to tape, the heap used for the
priority queue should be totally empty. So at this point the
allocator would have the chance to congeal all of the fragmented
memory back into larger chunks, or maybe it parcels the allocations
back out again in an order so that the unused space is contiguous and
could be meaningfully paged out.
But, it is it worth worrying about how much we fragment memory and if
we overshoot our promises by 10 or 20%?
If so,
I'm not so convinced your benchmark gave 3 % gain by this change.
Correct me as I'm probably wrong.
I've now done more complete testing. Building an index on an
200,000,000 row table with an integer column populated in random order
with integers from 1..500,000,000, non-unique, on a machine with 2GB
of RAM and 600MB of shared_buffers.
It improves things by 6-7 percent at the end of working mem size, the
rest are in the noise except at 77936 KB, where it reproducibly makes
things 4% worse, for reasons I haven't figured out. So maybe the best
thing to do is, rather than micromanaging memory usage, simply don't
set maintenance_work_mem way to low. (But, it is the default).
maintenance_work_mem %Change with patch
16384 6.2
19484 7.8
23170 -0.1
27554 0.1
32768 1.9
38968 -1.9
46341 0.4
55109 0.4
65536 0.6
77936 -4.3
92682 -0.3
110218 0.1
131072 1.7
Anyway, it's nice to modify the comment just above the change, since
it's now true with it.
Much of that comment is referring to other things (some of which I
don't understand). How about an additional comment just before my
code to the gist of:
/* If we have already used, and thus fragmented, the memory then we
* might as well continue to make use of it as no one else can.
*/
(That might not actually be true, if the tuples we are sorting are
very large, or perhaps if they arrive in already reverse sorted order)
Thanks,
Jeff
On Sat, Feb 11, 2012 at 11:34 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
The attached patch allows it to reuse that memory. On my meager
system it reduced the building of an index on an integer column in a
skinny 200 million row totally randomly ordered table by about 3% from
a baseline of 25 minutes.Just to give a standard review, this patch is one line change and
applies cleanly, builds ok.I'm not pretty sure what exactly you're trying to accomplish, but it
seems to me that it's avoiding the first dumptuples cycle by forcing
availMem = 0 even if it's negative.Yes. Currently when it switches to the TSS_BUILDRUNS part of a
tape-sort, it starts by calling WRITETUP a large number of time
consecutively, to work off the memory deficit incurred by the 3 blocks
per tape of tape overhead, and then after that calls WRITETUP about
once per puttuple.. Under my patch, it would only call WRITETUP
about once per puttuple, right from the beginning.I read your comments as it'd be
avoiding to alternate reading/writing back and force with scattered
memory failing memory cache much during merge phase, but actually it
doesn't affect merge phase but only init-dump phase, does it?It effects the building of the runs. But this building of the runs is
not a simple dump, it is itself a mini merge phase, in which it merges
the existing in-memory priority queue against the still-incoming
tuples from the node which invoked the sort. By using less memory
than it could, this means that the resulting runs are smaller than
they could be, and so will sometimes necessitate an additional layer
of merging later on. (This effect is particularly large for the very
first run being built. Generally by merging incoming tuples into the
memory-tuples, you can create runs that are 1.7 times the size of fits
in memory. By wasting some memory, we are getting 1.7 the size of a
smaller starting point. But for the first run, it is worse than that.
Most of the benefit that leads to that 1.7 multiplier comes at the
very early stage of each run-build. But by initially using the full
memory, then writing out a bunch of tuples without doing any merge of
the incoming, we have truncated the part that gives the most benefit.)My analysis that the "freed" memory is never reused (because we refuse
to reuse it ourselves and it is too fragmented to be reused by anyone
else, like the palloc or VM system) only applies to the run-building
phase. So never was a bit of an overstatement. By the time the last
initial run is completely written out to tape, the heap used for the
priority queue should be totally empty. So at this point the
allocator would have the chance to congeal all of the fragmented
memory back into larger chunks, or maybe it parcels the allocations
back out again in an order so that the unused space is contiguous and
could be meaningfully paged out.But, it is it worth worrying about how much we fragment memory and if
we overshoot our promises by 10 or 20%?If so,
I'm not so convinced your benchmark gave 3 % gain by this change.
Correct me as I'm probably wrong.I've now done more complete testing. Building an index on an
200,000,000 row table with an integer column populated in random order
with integers from 1..500,000,000, non-unique, on a machine with 2GB
of RAM and 600MB of shared_buffers.It improves things by 6-7 percent at the end of working mem size, the
rest are in the noise except at 77936 KB, where it reproducibly makes
things 4% worse, for reasons I haven't figured out. So maybe the best
thing to do is, rather than micromanaging memory usage, simply don't
set maintenance_work_mem way to low. (But, it is the default).
I've tested here with only a million rows mix of integer/text (table
size is 80MB or so) with default setting, running CREATE INDEX
continuously, but couldn't find performance improvement. The number
varies from -2% to +2%, which I think is just error.
While I agree with your insist that avoiding the first dump would make
sense, I guess it depends on situations; if the dump goes with lots of
tuples (which should happen when availMem is big), writing tuples a
lot at a time will be faster than writing little by little later.
I'm not sure about the conclusion, but given this discussion, I'm
inclined to mark this Returned with Feedback.
Thanks,
--
Hitoshi Harada
On Tue, Feb 14, 2012 at 1:44 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
On Sat, Feb 11, 2012 at 11:34 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
The attached patch allows it to reuse that memory. On my meager
system it reduced the building of an index on an integer column in a
skinny 200 million row totally randomly ordered table by about 3% from
a baseline of 25 minutes.Just to give a standard review, this patch is one line change and
applies cleanly, builds ok.I'm not pretty sure what exactly you're trying to accomplish, but it
seems to me that it's avoiding the first dumptuples cycle by forcing
availMem = 0 even if it's negative.Yes. Currently when it switches to the TSS_BUILDRUNS part of a
tape-sort, it starts by calling WRITETUP a large number of time
consecutively, to work off the memory deficit incurred by the 3 blocks
per tape of tape overhead, and then after that calls WRITETUP about
once per puttuple.. Under my patch, it would only call WRITETUP
about once per puttuple, right from the beginning.I read your comments as it'd be
avoiding to alternate reading/writing back and force with scattered
memory failing memory cache much during merge phase, but actually it
doesn't affect merge phase but only init-dump phase, does it?It effects the building of the runs. But this building of the runs is
not a simple dump, it is itself a mini merge phase, in which it merges
the existing in-memory priority queue against the still-incoming
tuples from the node which invoked the sort. By using less memory
than it could, this means that the resulting runs are smaller than
they could be, and so will sometimes necessitate an additional layer
of merging later on. (This effect is particularly large for the very
first run being built. Generally by merging incoming tuples into the
memory-tuples, you can create runs that are 1.7 times the size of fits
in memory. By wasting some memory, we are getting 1.7 the size of a
smaller starting point. But for the first run, it is worse than that.
Most of the benefit that leads to that 1.7 multiplier comes at the
very early stage of each run-build. But by initially using the full
memory, then writing out a bunch of tuples without doing any merge of
the incoming, we have truncated the part that gives the most benefit.)My analysis that the "freed" memory is never reused (because we refuse
to reuse it ourselves and it is too fragmented to be reused by anyone
else, like the palloc or VM system) only applies to the run-building
phase. So never was a bit of an overstatement. By the time the last
initial run is completely written out to tape, the heap used for the
priority queue should be totally empty. So at this point the
allocator would have the chance to congeal all of the fragmented
memory back into larger chunks, or maybe it parcels the allocations
back out again in an order so that the unused space is contiguous and
could be meaningfully paged out.But, it is it worth worrying about how much we fragment memory and if
we overshoot our promises by 10 or 20%?If so,
I'm not so convinced your benchmark gave 3 % gain by this change.
Correct me as I'm probably wrong.I've now done more complete testing. Building an index on an
200,000,000 row table with an integer column populated in random order
with integers from 1..500,000,000, non-unique, on a machine with 2GB
of RAM and 600MB of shared_buffers.It improves things by 6-7 percent at the end of working mem size, the
rest are in the noise except at 77936 KB, where it reproducibly makes
things 4% worse, for reasons I haven't figured out. So maybe the best
thing to do is, rather than micromanaging memory usage, simply don't
set maintenance_work_mem way to low. (But, it is the default).I've tested here with only a million rows mix of integer/text (table
size is 80MB or so) with default setting, running CREATE INDEX
continuously, but couldn't find performance improvement. The number
varies from -2% to +2%, which I think is just error.While I agree with your insist that avoiding the first dump would make
sense, I guess it depends on situations; if the dump goes with lots of
tuples (which should happen when availMem is big), writing tuples a
lot at a time will be faster than writing little by little later.I'm not sure about the conclusion, but given this discussion, I'm
inclined to mark this Returned with Feedback.
OK, thanks. Does anyone have additional feed-back on how tightly we
wish to manage memory usage? Is trying to make us use as much memory
as we are allowed to without going over a worthwhile endeavor at all,
or is it just academic nitpicking?
Also, since the default value of work_mem is quite low, should the
docs be more aggressive in suggesting that people using any
realistically modern hardware to tune it upwards? If you have a few
thousand large sorts running concurrently, I think that having all of
them spill to disk intentionally is probably going to destroy your
server's performance about as much as having them spill to disk via
swapping will. If you find yourself in this situation, the answer is
probably to use a connection pooler with admission control, not a
work_mem of 1MB.
Cheers,
Jeff
Thanks,
Jeff
On Sat, Feb 25, 2012 at 4:31 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
I'm not sure about the conclusion, but given this discussion, I'm
inclined to mark this Returned with Feedback.OK, thanks. Does anyone have additional feed-back on how tightly we
wish to manage memory usage? Is trying to make us use as much memory
as we are allowed to without going over a worthwhile endeavor at all,
or is it just academic nitpicking?
I'm not sure, either. It strikes me that, in general, it's hard to
avoid a little bit of work_mem overrun, since we can't know whether
the next tuple will fit until we've read it, and then if it turns out
to be big, well, the best thing we can do is free it, but perhaps
that's closing the barn door after the horse has gotten out already.
Having recently spent quite a bit of time looking at tuplesort.c as a
result of Peter Geoghegan's work and some other concerns, I'm inclined
to think that it needs more than minor surgery. That file is peppered
with numerous references to Knuth which serve the dual function of
impressing the reader with the idea that the code must be very good
(since Knuth is a smart guy) and rendering it almost completely
impenetrable (since the design is justified by reference to a textbook
most of us probably do not have copies of).
A quick Google search for external sorting algorithms suggest that the
typical way of doing an external sort is to read data until you fill
your in-memory buffer, quicksort it, and dump it out as a run. Repeat
until end-of-data; then, merge the runs (either in a single pass, or
if there are too many, in multiple passes). I'm not sure whether that
would be better than what we're doing now, but there seem to be enough
other people doing it that we might want to try it out. Our current
algorithm is to build a heap, bounded in size by work_mem, and dribble
tuples in and out, but maintaining that heap is pretty expensive;
there's a reason people use quicksort rather than heapsort for
in-memory sorting. As a desirable side effect, I think it would mean
that we could dispense with retail palloc and pfree altogether. We
could just allocate a big chunk of memory, copy tuples into it until
it's full, using a pointer to keep track of the next unused byte, and
then, after writing the run, reset the allocation pointer back to the
beginning of the buffer. That would not only avoid the cost of going
through palloc/pfree, but also the memory overhead imposed by
bookkeeping and power-of-two rounding.
If we do want to stick with the current algorithm, there seem to be
some newer techniques for cutting down on the heap maintenance
overhead. Heikki's been investigating that a bit.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
A quick Google search for external sorting algorithms suggest that the
typical way of doing an external sort is to read data until you fill
your in-memory buffer, quicksort it, and dump it out as a run. Repeat
until end-of-data; then, merge the runs (either in a single pass, or
if there are too many, in multiple passes). I'm not sure whether that
would be better than what we're doing now, but there seem to be enough
other people doing it that we might want to try it out. Our current
algorithm is to build a heap, bounded in size by work_mem, and dribble
tuples in and out, but maintaining that heap is pretty expensive;
there's a reason people use quicksort rather than heapsort for
in-memory sorting.
Well, the reason the heapsort approach is desirable here is that you end
up with about half as many runs to merge, because the typical run length
is significantly more than what will fit in work_mem. Don't get too
excited about micro-optimization if you haven't understood the larger
context.
regards, tom lane
On Sun, Feb 26, 2012 at 7:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Feb 25, 2012 at 4:31 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
I'm not sure about the conclusion, but given this discussion, I'm
inclined to mark this Returned with Feedback.OK, thanks. Does anyone have additional feed-back on how tightly we
wish to manage memory usage? Is trying to make us use as much memory
as we are allowed to without going over a worthwhile endeavor at all,
or is it just academic nitpicking?I'm not sure, either. It strikes me that, in general, it's hard to
avoid a little bit of work_mem overrun, since we can't know whether
the next tuple will fit until we've read it, and then if it turns out
to be big, well, the best thing we can do is free it, but perhaps
that's closing the barn door after the horse has gotten out already.
That type of overrun doesn't bother me much, because the size of the
next tuple some else feeds us is mostly outside of this modules
control. And also be cause the memory that is overrun should be
reusable once the offending tuple is written out to tape. The type of
overrun I'm more concerned with is that which is purely under this
modules control, and which is then not re-used productively.
The better solution would be to reduce the overhead in the first
place. While building the initial runs, there is no reason to have 3
blocks worth of overhead for each tape, when only one tape is ever
being used at a time. But that change seems much tougher to
implement.
Having recently spent quite a bit of time looking at tuplesort.c as a
result of Peter Geoghegan's work and some other concerns, I'm inclined
to think that it needs more than minor surgery. That file is peppered
with numerous references to Knuth which serve the dual function of
impressing the reader with the idea that the code must be very good
(since Knuth is a smart guy) and rendering it almost completely
impenetrable (since the design is justified by reference to a textbook
most of us probably do not have copies of).
Yes, I agree with that analysis. And getting the volume I want by
inter-library-loan has been challenging--I keep getting the volume
before or after the one I want. Maybe Knuth starts counting volumes
at 0 and libraries start counting at 1 :)
Anyway, I think the logtape could use redoing. When your tapes are
actually physically tape drives, it is necessary to build up runs one
after the other on physical tapes, because un-mounting a tape from a
tape drive and remounting another tape is not very feasible at scale.
Which then means you need to place your runs carefully, because if the
final merge finds that two runs it needs are back-to-back on one tape,
that is bad. But with disks pretending to be tapes, you could
re-arrange the "runs" with just some book keeping. Maintaining the
distinction between "tapes" and "runs" is pointless, which means the
Fibonacci placement algorithm is pointless as well.
A quick Google search for external sorting algorithms suggest that the
typical way of doing an external sort is to read data until you fill
your in-memory buffer, quicksort it, and dump it out as a run. Repeat
until end-of-data; then, merge the runs (either in a single pass, or
if there are too many, in multiple passes). I'm not sure whether that
would be better than what we're doing now, but there seem to be enough
other people doing it that we might want to try it out. Our current
algorithm is to build a heap, bounded in size by work_mem, and dribble
tuples in and out, but maintaining that heap is pretty expensive;
there's a reason people use quicksort rather than heapsort for
in-memory sorting.
But it would mean we have about 1.7x more runs that need to be merged
(for initially random data). Considering the minimum merge order is
6, that increase in runs is likely not to lead to an additional level
of merging, in which case the extra speed of building the runs would
definitely win. But if it does cause an additional level of merge, it
could end up being a loss.
Is there some broad corpus of sorting benchmarks which changes could
be tested against? I usually end up testing just simple columns of
integers or small strings, because they are easy to set up. That is
not ideal.
As a desirable side effect, I think it would mean
that we could dispense with retail palloc and pfree altogether. We
could just allocate a big chunk of memory, copy tuples into it until
it's full, using a pointer to keep track of the next unused byte, and
then, after writing the run, reset the allocation pointer back to the
beginning of the buffer. That would not only avoid the cost of going
through palloc/pfree, but also the memory overhead imposed by
bookkeeping and power-of-two rounding.
Wouldn't we still need an array of pointers to the start of every
tuple's location in the buffer? Or else, how would qsort know where
to find them?
Also, to do this we would need to get around the 1GB allocation limit.
It is bad enough that memtuples is limited to 1GB, it would be much
worse if the entire arena was limited to that amount.
If we do want to stick with the current algorithm, there seem to be
some newer techniques for cutting down on the heap maintenance
overhead. Heikki's been investigating that a bit.
Interesting. Is that investigation around the poor L1/L2 caching
properties of large heaps? I was wondering if there might be a way to
give tuplesort an initial estimate of how much data there was to sort,
so that it could use a smaller amount of memory than the max if it
decided that that would lead to better caching effects. Once you know
you can't do an in-memory sort, then there is no reason to use more
memory than the amount that lets you merge all the runs in one go.
Cheers,
Jeff
On Sat, Mar 3, 2012 at 4:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
The better solution would be to reduce the overhead in the first
place. While building the initial runs, there is no reason to have 3
blocks worth of overhead for each tape, when only one tape is ever
being used at a time. But that change seems much tougher to
implement.
Hmm, yeah.
Anyway, I think the logtape could use redoing. When your tapes are
actually physically tape drives, it is necessary to build up runs one
after the other on physical tapes, because un-mounting a tape from a
tape drive and remounting another tape is not very feasible at scale.
Which then means you need to place your runs carefully, because if the
final merge finds that two runs it needs are back-to-back on one tape,
that is bad. But with disks pretending to be tapes, you could
re-arrange the "runs" with just some book keeping. Maintaining the
distinction between "tapes" and "runs" is pointless, which means the
Fibonacci placement algorithm is pointless as well.
I think you're right. It seems to me that we could just write out an
arbitrary number of runs, one per file, ending up with files number
1..N. If we can do a final merge of N runs without overrunning
work_mem, fine. If not, we merge the first K runs (for the largest
possible K) and write out a new run N+1. The range of valid run
number is now K+1..N+1. If those can be merged in a single pass, we
do so; otherwise we again merge the first K runs (K+1 through 2K) to
create a new run N+2. And so on.
I am not clear, however, whether this would be any faster. It may not
be worth tinkering with just for the reduction in code complexity.
But it would mean we have about 1.7x more runs that need to be merged
(for initially random data). Considering the minimum merge order is
6, that increase in runs is likely not to lead to an additional level
of merging, in which case the extra speed of building the runs would
definitely win. But if it does cause an additional level of merge, it
could end up being a loss.
That's true, but the real hit to the run size should be quite a bit
less than 1.7x, because we'd also be using memory more efficiently,
and therefore more tuples should fit. A little debugging code suggests
that on my MacBook Pro, things come out like this:
- Allocations of 8 bytes of less use 24 bytes.
- Allocations of 9-16 bytes use 32 bytes.
- Allocations of 17-32 bytes use 48 bytes.
- Allocations of 33-64 bytes use 80 bytes.
- Allocations of 65-128 bytes use 144 bytes.
- Allocations of 129-256 bytes use 272 bytes.
In other words, the palloc overhead seems to be: round up to the next
power of two, and add 16 bytes. This means that if, say, all the rows
are exactly 100 bytes, we're using 44% more memory than we would under
this scheme, which means that the runs would be 44% longer than
otherwise. Of course that's just an example - if you are sorting 132
byte rows, you'll be able to squeeze in 106% more of them, and if
you're sorting 128 byte rows, you'll only be able to squeeze in 12.5%
more of them. So there are certainly cases where the runs would get
shorter, especially if you happened to have mostly-sorted data rather
than randomly-ordered data, but not in every case. Still, it may be a
terrible idea; I only mention it because I did find a lot of
references to people using quicksort to build up the initial runs
rather than our heapsort-based approach.
Is there some broad corpus of sorting benchmarks which changes could
be tested against? I usually end up testing just simple columns of
integers or small strings, because they are easy to set up. That is
not ideal.
I don't know of anything more formal, unfortunately.
As a desirable side effect, I think it would mean
that we could dispense with retail palloc and pfree altogether. We
could just allocate a big chunk of memory, copy tuples into it until
it's full, using a pointer to keep track of the next unused byte, and
then, after writing the run, reset the allocation pointer back to the
beginning of the buffer. That would not only avoid the cost of going
through palloc/pfree, but also the memory overhead imposed by
bookkeeping and power-of-two rounding.Wouldn't we still need an array of pointers to the start of every
tuple's location in the buffer? Or else, how would qsort know where
to find them?
Yes, we'd still need that. I don't see any way to get around that; I
just don't like the expense of palloc-ing so many little chunks in
addition to that array.
Also, to do this we would need to get around the 1GB allocation limit.
It is bad enough that memtuples is limited to 1GB, it would be much
worse if the entire arena was limited to that amount.
Well, we could always allocate multiple 1GB chunks and peel off pieces
from each one in turn.
If we do want to stick with the current algorithm, there seem to be
some newer techniques for cutting down on the heap maintenance
overhead. Heikki's been investigating that a bit.Interesting. Is that investigation around the poor L1/L2 caching
properties of large heaps? I was wondering if there might be a way to
give tuplesort an initial estimate of how much data there was to sort,
so that it could use a smaller amount of memory than the max if it
decided that that would lead to better caching effects. Once you know
you can't do an in-memory sort, then there is no reason to use more
memory than the amount that lets you merge all the runs in one go.
No, it's about reducing the number of comparisons needed to maintain
the heap property.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Mar 7, 2012 at 7:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
But it would mean we have about 1.7x more runs that need to be merged
(for initially random data). Considering the minimum merge order is
6, that increase in runs is likely not to lead to an additional level
of merging, in which case the extra speed of building the runs would
definitely win. But if it does cause an additional level of merge, it
could end up being a loss.That's true, but the real hit to the run size should be quite a bit
less than 1.7x, because we'd also be using memory more efficiently,
and therefore more tuples should fit.
I'm not sure I believe the 1.7x. Keep in mind that even after
starting a new tape we can continue to read in new tuples that belong
on the first tape. So even if you have tuples that are more than N
positions out of place (where N is the size of your heap) as long as
you don't have very many you can keep writing out the first tape for
quite a while.
I suspect analyzing truly random inputs is also a bit like saying no
compression algorithm can work on average. Partly sorted data is quite
common and the tapesort algorithm would be able to do a lot of cases
in a single merge that the quicksort and merge would generate a large
number of merges for.
All that said, quicksort and merge would always do no more i/o in
cases where the total number of tuples to be sorted is significantly
less than N^2 since that would be guaranteed to be possible to process
with a single merge pass. (Where "significantly less" has something to
do with how many tuples you have to read in one i/o to be efficient).
That probably covers a lot of cases, and Postgres already has the
stats to predict when it would happen, more or less.
Fwiw when I was doing the top-k sorting I did a bunch of experiements
and came up with a rule-of-thumb that our quicksort was about twice as
fast as our heapsort. I'm not sure whether that's a big deal or not in
this case. Keep in mind that the only win would be reducing the cpu
time on a sort where every tuple was being written to disk and read
back. For most users that won't run noticeably faster, just reduce cpu
time consumption.
--
greg
On Sat, Mar 17, 2012 at 10:47 AM, Greg Stark <stark@mit.edu> wrote:
On Wed, Mar 7, 2012 at 7:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
But it would mean we have about 1.7x more runs that need to be merged
(for initially random data). Considering the minimum merge order is
6, that increase in runs is likely not to lead to an additional level
of merging, in which case the extra speed of building the runs would
definitely win. But if it does cause an additional level of merge, it
could end up being a loss.That's true, but the real hit to the run size should be quite a bit
less than 1.7x, because we'd also be using memory more efficiently,
and therefore more tuples should fit.I'm not sure I believe the 1.7x. Keep in mind that even after
starting a new tape we can continue to read in new tuples that belong
on the first tape. So even if you have tuples that are more than N
positions out of place (where N is the size of your heap) as long as
you don't have very many you can keep writing out the first tape for
quite a while.I suspect analyzing truly random inputs is also a bit like saying no
compression algorithm can work on average.
I'm not sure what you mean here. The "also" suggests the previous
discussion was about something other than the assumption of
randomness, but that discussion doesn't make much sense unless both
paragraphs are about that.
Anyway I think keeping best/worst cases in mind is certainly a good
thing to do. I've certainly seen train wrecks unfold when people
assumed anything could be compressed without verifying it.
Partly sorted data is quite
common and the tapesort algorithm would be able to do a lot of cases
in a single merge that the quicksort and merge would generate a large
number of merges for.
This is where some common and credible corpus would be invaluable.
All that said, quicksort and merge would always do no more i/o in
cases where the total number of tuples to be sorted is significantly
less than N^2 since that would be guaranteed to be possible to process
with a single merge pass. (Where "significantly less" has something to
do with how many tuples you have to read in one i/o to be efficient).
Unless the tuples are quite large, the number needed for efficient i/o
is quite high.
That probably covers a lot of cases, and Postgres already has the
stats to predict when it would happen, more or less.
Currently sort makes no attempt to use the planner stats. Would that
be a good thing to work on?
Fwiw when I was doing the top-k sorting I did a bunch of experiements
and came up with a rule-of-thumb that our quicksort was about twice as
fast as our heapsort. I'm not sure whether that's a big deal or not in
this case.
I've been seeing around 3 fold, and that might go up more with some of
the work being done that speeds up qsort but not tape sort.
Keep in mind that the only win would be reducing the cpu
time on a sort where every tuple was being written to disk and read
back. For most users that won't run noticeably faster, just reduce cpu
time consumption.
But in many (perhaps most) tape sorts CPU time is most of it. A lot
of tape sorts are entirely memory backed. The tuples either never
reach disk, or are partially written but never read, instead being
served back from the file-systems cache. By changing to a tape sort,
you are buying insurance against a bunch of other sorts also running
simultaneously, but often the insurance doesn't pay off because those
numerous other sorts don't exist and the kernel manages to keep the
few ones that do in RAM anyway.
One avenue for major surgery on the sorts would be for an entry
admission where jobs can negotiable over the memory. Something like
allocation by halves, but with a possibility that a job could decide
it would rather wait for another sort to finish and its memory to be
freed up, than to make do with what is currently available.
Cheers,
Jeff
On Wed, Mar 7, 2012 at 11:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Mar 3, 2012 at 4:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
Anyway, I think the logtape could use redoing. When your tapes are
actually physically tape drives, it is necessary to build up runs one
after the other on physical tapes, because un-mounting a tape from a
tape drive and remounting another tape is not very feasible at scale.
Which then means you need to place your runs carefully, because if the
final merge finds that two runs it needs are back-to-back on one tape,
that is bad. But with disks pretending to be tapes, you could
re-arrange the "runs" with just some book keeping. Maintaining the
distinction between "tapes" and "runs" is pointless, which means the
Fibonacci placement algorithm is pointless as well.I think you're right. It seems to me that we could just write out an
arbitrary number of runs, one per file, ending up with files number
1..N.
The problem there is that none of the files can be deleted until it
was entirely read, so you end up with all the data on disk twice. I
don't know how often people run their databases so close to the edge
on disk space that this matters, but someone felt that that extra
storage was worth avoiding. We could still get rid of the run/tape
distinction but keep the block recycling method--they are conceptually
distinct things. (But hopefully improve the block recycling so the
locality doesn't degrade as much as it seems to now)
If we can do a final merge of N runs without overrunning
work_mem, fine. If not, we merge the first K runs (for the largest
possible K) and write out a new run N+1. The range of valid run
number is now K+1..N+1. If those can be merged in a single pass, we
do so; otherwise we again merge the first K runs (K+1 through 2K) to
create a new run N+2. And so on.I am not clear, however, whether this would be any faster. It may not
be worth tinkering with just for the reduction in code complexity.
Yeah, I was thinking it would only be worth redoing if the current
implementation interferes with other improvements--which I think is
likely, but don't have any concrete ideas yet where it would.
...
As a desirable side effect, I think it would mean
that we could dispense with retail palloc and pfree altogether. We
could just allocate a big chunk of memory, copy tuples into it until
it's full, using a pointer to keep track of the next unused byte, and
then, after writing the run, reset the allocation pointer back to the
beginning of the buffer. That would not only avoid the cost of going
through palloc/pfree, but also the memory overhead imposed by
bookkeeping and power-of-two rounding.Wouldn't we still need an array of pointers to the start of every
tuple's location in the buffer? Or else, how would qsort know where
to find them?Yes, we'd still need that. I don't see any way to get around that; I
just don't like the expense of palloc-ing so many little chunks in
addition to that array.
Right, OK, I had thought you meant the power of two rounding of
mem_tuples, I overlooked the power of two rounding of palloc.
Also, to do this we would need to get around the 1GB allocation limit.
It is bad enough that memtuples is limited to 1GB, it would be much
worse if the entire arena was limited to that amount.Well, we could always allocate multiple 1GB chunks and peel off pieces
from each one in turn.
I thought of that with mem_tuples itself, but I think it would be more
difficult to do that than just fixing the allocation limit would be.
Using multiple chunks might be easier with the buffer space as you
don't need to do heap arithmetic on those.
If we do want to stick with the current algorithm, there seem to be
some newer techniques for cutting down on the heap maintenance
overhead. Heikki's been investigating that a bit.Interesting. Is that investigation around the poor L1/L2 caching
properties of large heaps? I was wondering if there might be a way to
give tuplesort an initial estimate of how much data there was to sort,
so that it could use a smaller amount of memory than the max if it
decided that that would lead to better caching effects. Once you know
you can't do an in-memory sort, then there is no reason to use more
memory than the amount that lets you merge all the runs in one go.No, it's about reducing the number of comparisons needed to maintain
the heap property.
That sounds very interesting. I didn't know it was even theoretically
possible to do that.
Cheers,
Jeff
Jeff Janes <jeff.janes@gmail.com> writes:
On Wed, Mar 7, 2012 at 11:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Mar 3, 2012 at 4:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
Anyway, I think the logtape could use redoing.
The problem there is that none of the files can be deleted until it
was entirely read, so you end up with all the data on disk twice. I
don't know how often people run their databases so close to the edge
on disk space that this matters, but someone felt that that extra
storage was worth avoiding.
Yeah, that was me, and it came out of actual user complaints ten or more
years back. (It's actually not 2X growth but more like 4X growth
according to the comments in logtape.c, though I no longer remember the
exact reasons why.) We knew when we put in the logtape logic that we
were trading off speed for space, and we accepted that. It's possible
that with the growth of hard drive sizes, real-world applications would
no longer care that much about whether the space required to sort is 4X
data size rather than 1X. Or then again, maybe their data has grown
just as fast and they still care.
As a desirable side effect, I think it would mean
that we could dispense with retail palloc and pfree altogether. �We
could just allocate a big chunk of memory, copy tuples into it until
it's full, using a pointer to keep track of the next unused byte, and
then, after writing the run, reset the allocation pointer back to the
beginning of the buffer. �That would not only avoid the cost of going
through palloc/pfree, but also the memory overhead imposed by
bookkeeping and power-of-two rounding.
That would be worthwhile, probably. The power-of-2 rounding in palloc
is not nice at all for tuplesort's purposes. We've occasionally talked
about inventing a different memory context type that is less willing to
waste space ... but on the other hand, such an allocator would run
slower, so you're right back to making a speed for space tradeoff.
If the tuples could be deallocated in bulk then that catch-22 goes away.
No, it's about reducing the number of comparisons needed to maintain
the heap property.
That sounds very interesting. I didn't know it was even theoretically
possible to do that.
So has somebody found a hole in the n log n lower bound on the cost of
comparison-based sorting? I thought that had been proven pretty
rigorously.
regards, tom lane
On 2012-03-18 15:25, Tom Lane wrote:
Jeff Janes<jeff.janes@gmail.com> writes:
On Wed, Mar 7, 2012 at 11:55 AM, Robert Haas<robertmhaas@gmail.com> wrote:
The problem there is that none of the files can be deleted until it
was entirely read, so you end up with all the data on disk twice. I
don't know how often people run their databases so close to the edge
on disk space that this matters, but someone felt that that extra
storage was worth avoiding.Yeah, that was me, and it came out of actual user complaints ten or more
years back. (It's actually not 2X growth but more like 4X growth
according to the comments in logtape.c, though I no longer remember the
exact reasons why.) We knew when we put in the logtape logic that we
were trading off speed for space, and we accepted that.
How about a runtime check of disk-free?
--
Jeremy
Jeremy Harris <jgh@wizmail.org> writes:
On 2012-03-18 15:25, Tom Lane wrote:
Yeah, that was me, and it came out of actual user complaints ten or more
years back. (It's actually not 2X growth but more like 4X growth
according to the comments in logtape.c, though I no longer remember the
exact reasons why.) We knew when we put in the logtape logic that we
were trading off speed for space, and we accepted that.
How about a runtime check of disk-free?
Not very workable, the foremost reason why not being that it assumes
that the current sort is the only thing consuming disk space. I'm not
sure there is any portable method of finding out how much disk space
is available, anyway. (Think user quotas and such before supposing
you know how to do that.)
regards, tom lane
On Sun, Mar 18, 2012 at 3:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
No, it's about reducing the number of comparisons needed to maintain
the heap property.That sounds very interesting. I didn't know it was even theoretically
possible to do that.So has somebody found a hole in the n log n lower bound on the cost of
comparison-based sorting? I thought that had been proven pretty
rigorously.
I think what he's describing is, for example, say you have a heap of
1,000 elements and that lets you do the entire sort in a single pass
-- i.e. it generates a single sorted run after the first pass.
Increasing the heap size to 2,000 will just mean doing an extra
comparison for each tuple to sift up. And increasing the heap size
further will just do more and more work for nothing. It's still nlogn
but the constant factor is slightly higher. Of course to optimize this
you would need to be able to see the future.
I always thought the same behaviour held for if the heap was larger
than necessary to do N merge passes. that is, making the heap larger
might generate fewer tapes but the still enough to require the same
number of passes. However trying to work out an example I'm not too
sure. If you generate fewer tapes then the heap you use to do the
merge is smaller so it might work out the same (or more likely, you
might come out ahead).
--
greg