tuplesort memory usage: grow_memtuples
When sorting small tuples, the memtuples array can use a substantial
fraction of the total per-tuple memory used. (In the case of pass by
value, it is all of it)
The way it grows leads to sub-optimal memory usage.
In one case, if it can grow by 1.999 x, but not by 2 x, we just give
up and use half the memory we could have. This is what actually
happens in pass-by-value sorting when using a power of two for
*work_mem.
While I understand we don't want to thrash memory, it seems like we
could at least grow by less than 2x just one last time.
Also, if there is just barely room to double memtuples (and it is
pass-by-reference sorting) then the number of slots is doubled but
leaves no room for the tuples themselves to be stored, so most of
those slots can't be used.
A solution to both would be to assess when we are about to do one of
the two above things, and instead use the historical tuple size to
compute how much of a growth we can do so that the memtuples slots and
the cumulative palloc heap of tuples will exhaust at the same time.
If the availMem is less than half of allowedMem, then the next
doubling of memtuples would either fail, or would succeed but leave
behind too little allowedMem to hold the full complement of new
tuples. So either way, extrapolate from current usage. The patch
assumes that nothing other than memtuples and individual tuple storage
have been deducted from availMem. Currently that is true, but may not
always be so (we recently discussed pre-deducting the tape buffer
overhead, so it doesn't cause a memory overrun).
As the XXXX indicates, I am not currently able to prove to myself that
there are no conditions under which this change could cause
LACKMEM(state) to become true.
This patch gives about a 2-fold window over which a sort that would
otherwise go to tape sort will instead be done in memory, which is 2-3
times faster. That might not be very impressive, because after all if
you wanted to use more memory you could have just increased work_mem.
But it still seems worthwhile to use the memory we are told to use.
It may be hard to fine-tune the *work_mem settings, but not using the
amount we are given surely doesn't make it any easier.
But other than the tape vs in memory improvement, this also gives
other improvements. For example, this gives me about a 35% speed up
when using default work_mem to do a "select count(distinct k) from t"
where k is random integer and t has 512 million rows. Granted, it is
silly to intentionally do such a large sort with such small memory.
But it still seems worth improving, and the wins should be sitting
around at other sizes as well, though probably not as big.
In my test case, the improvement mostly comes from turning random IO
reads into sequential ones.
The prevelance of random IO is due to a cascade of issues:
1) As discussed above, we are using only half the memory we could be.
2) Due to MINORDER we are spreading that reduced memory over twice as
many tapes as MERGE_BUFFER_SIZE would suggest. Changing this would
obviously have a trade-off
3) MERGE_BUFFER_SIZE itself describes the in-RAM footprint, not the
on-tape footprint, because we unpack tuples as we read them from tape.
Perhaps mergeprereadone could stash the preread data unpacked and
unpack it only as needed. But that is a much more invasive change.
4) Pre-reading all tapes whenever just one becomes empty causes blocks
to be freed, and thus re-used, in smaller contiguous chunks than they
otherwise would. (Easy to change, but there is trade-off to changing
it)
5) Even without 4, logtape.c still makes a jumble of the free list on
multi-level merges. I'm not entirely sure why yet--I think maybe it
is the way indirect blocks are handled.
Add it all up, and instead of pre-reading 32 consecutive 8K blocks, it
pre-reads only about 1 or 2 consecutive ones on the final merge. Now
some of those could be salvaged by the kernel keeping track of
multiple interleaved read ahead opportunities, but in my hands vmstat
shows a lot of IO wait and shows reads that seem to be closer to
random IO than large read-ahead. If it used truly efficient read
ahead, CPU would probably be limiting.
I'll add this to the open commit fest. When doing performance
testing, I find it far easier to alter a guc.c parameter than to keep
multiple compiled binaries around. Would people like a testing patch
that does the same thing as this, or does the original behavior, under
control of a parameter setting?
Cheers,
Jeff
Attachments:
sortmem_grow-v1.patchapplication/octet-stream; name=sortmem_grow-v1.patchDownload
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 1452e8c..9c5d85a
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 273,278 ****
--- 273,280 ----
SortTuple *memtuples; /* array of SortTuple structs */
int memtupcount; /* number of tuples currently present */
int memtupsize; /* allocated length of memtuples array */
+ bool final_memtupsize; /* true if we are no longer growing memtupsize */
+
/*
* While building initial runs, this is the current output run number
*************** tuplesort_begin_common(int workMem, bool
*** 546,551 ****
--- 548,554 ----
state->randomAccess = randomAccess;
state->bounded = false;
state->boundUsed = false;
+ state->final_memtupsize = false;
state->allowedMem = workMem * 1024L;
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
*************** tuplesort_end(Tuplesortstate *state)
*** 937,950 ****
* Grow the memtuples[] array, if possible within our memory constraint.
* Return TRUE if able to enlarge the array, FALSE if not.
*
! * At each increment we double the size of the array. When we are short
* on memory we could consider smaller increases, but because availMem
! * moves around with tuple addition/removal, this might result in thrashing.
* Small increases in the array size are likely to be pretty inefficient.
*/
static bool
grow_memtuples(Tuplesortstate *state)
{
/*
* We need to be sure that we do not cause LACKMEM to become true, else
* the space management algorithm will go nuts. We assume here that the
--- 940,957 ----
* Grow the memtuples[] array, if possible within our memory constraint.
* Return TRUE if able to enlarge the array, FALSE if not.
*
! * At each increment except possibly the last one we double the size of
! * the array. When we are short
* on memory we could consider smaller increases, but because availMem
! * moves around with tuple addition/removal, this might result in thrashing,
! * so only allow it to happen once.
* Small increases in the array size are likely to be pretty inefficient.
*/
static bool
grow_memtuples(Tuplesortstate *state)
{
+ int new_memtupsize;
+
/*
* We need to be sure that we do not cause LACKMEM to become true, else
* the space management algorithm will go nuts. We assume here that the
*************** grow_memtuples(Tuplesortstate *state)
*** 953,971 ****
* minimum array size established in tuplesort_begin_common is large
* enough to force palloc to treat it as a separate chunk, so this
* assumption should be good. But let's check it.)
*/
! if (state->availMem <= (long) (state->memtupsize * sizeof(SortTuple)))
return false;
/*
* On a 64-bit machine, allowedMem could be high enough to get us into
* trouble with MaxAllocSize, too.
*/
! if ((Size) (state->memtupsize * 2) >= MaxAllocSize / sizeof(SortTuple))
return false;
FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
! state->memtupsize *= 2;
state->memtuples = (SortTuple *)
repalloc(state->memtuples,
state->memtupsize * sizeof(SortTuple));
--- 960,994 ----
* minimum array size established in tuplesort_begin_common is large
* enough to force palloc to treat it as a separate chunk, so this
* assumption should be good. But let's check it.)
+ * XXXX is the new method still follow this? The last allocation is no
+ * longer necessarily a power of 2, but that is not freed.
+ *
+ * Once we are approaching the final growth of memtuples, use the
+ * historical size of the tuples seen so far try to estimate the
+ * best final growth size to make most efficient use of memory.
*/
! if (state->final_memtupsize)
return false;
+ if (state->availMem < state->allowedMem/2)
+ {
+ new_memtupsize = (int) ((float)state->memtupsize * (float) state->allowedMem / (float) (state->allowedMem - state->availMem));
+ state->final_memtupsize = true;
+ }
+ else
+ {
+ new_memtupsize = state->memtupsize * 2;
+ };
+
/*
* On a 64-bit machine, allowedMem could be high enough to get us into
* trouble with MaxAllocSize, too.
*/
! if ((Size) (new_memtupsize) >= MaxAllocSize / sizeof(SortTuple))
return false;
FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
! state->memtupsize = new_memtupsize;
state->memtuples = (SortTuple *)
repalloc(state->memtuples,
state->memtupsize * sizeof(SortTuple));
On Sat, Mar 3, 2012 at 3:22 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
When sorting small tuples, the memtuples array can use a substantial
fraction of the total per-tuple memory used. (In the case of pass by
value, it is all of it)The way it grows leads to sub-optimal memory usage.
Greg, I see you signed up to review this on the 14th, but I don't see
a review yet. Are you planning to post one soon?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 3 March 2012 20:22, Jeff Janes <jeff.janes@gmail.com> wrote:
Add it all up, and instead of pre-reading 32 consecutive 8K blocks, it
pre-reads only about 1 or 2 consecutive ones on the final merge. Now
some of those could be salvaged by the kernel keeping track of
multiple interleaved read ahead opportunities, but in my hands vmstat
shows a lot of IO wait and shows reads that seem to be closer to
random IO than large read-ahead. If it used truly efficient read
ahead, CPU would probably be limiting.
Can you suggest a benchmark that will usefully exercise this patch?
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On Wed, Jul 25, 2012 at 2:51 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
On 3 March 2012 20:22, Jeff Janes <jeff.janes@gmail.com> wrote:
Add it all up, and instead of pre-reading 32 consecutive 8K blocks, it
pre-reads only about 1 or 2 consecutive ones on the final merge. Now
some of those could be salvaged by the kernel keeping track of
multiple interleaved read ahead opportunities, but in my hands vmstat
shows a lot of IO wait and shows reads that seem to be closer to
random IO than large read-ahead. If it used truly efficient read
ahead, CPU would probably be limiting.Can you suggest a benchmark that will usefully exercise this patch?
I think the given sizes below work on most 64 bit machines.
unpatched:
jeff=# set work_mem=16384;
jeff=# select count(distinct foo) from (select random() as foo from
generate_series(1,524200)) asdf;
Time: 498.944 ms
jeff=# select count(distinct foo) from (select random() as foo from
generate_series(1,524300)) asdf;
Time: 909.125 ms
patched:
jeff=# set work_mem=16384;
jeff=# select count(distinct foo) from (select random() as foo from
generate_series(1,524200)) asdf;
Time: 493.208 ms
jeff=# select count(distinct foo) from (select random() as foo from
generate_series(1,524300)) asdf;
Time: 497.035 ms
If you want to get a picture of what is going on internally, you can set:
set client_min_messages =log;
set trace_sort = on;
(Although trace_sort isn't all that informative as it currently
exists, it does at least let you see the transition from internal to
external.)
Cheers,
Jeff
On 27 July 2012 16:39, Jeff Janes <jeff.janes@gmail.com> wrote:
Can you suggest a benchmark that will usefully exercise this patch?
I think the given sizes below work on most 64 bit machines.
My apologies for not getting around to taking a look at this sooner.
I tested this patch on my x86_64 laptop:
[peter@peterlaptop tests]$ uname -r
3.4.4-4.fc16.x86_64
I have been able to recreate your results with the work_mem setting
you supplied, 16384 (both queries that you suggested are executed
together, for a less sympathetic workload):
[peter@peterlaptop tests]$ cat sortmemgrow_sort.sql
select count(distinct foo) from (select random() as foo from
generate_series(1,524200)) asdf;
select count(distinct foo) from (select random() as foo from
generate_series(1,524300)) asdf;
[peter@peterlaptop tests]$ # For master:
[peter@peterlaptop tests]$ pgbench -f sortmemgrow_sort.sql -T 45 -n
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 45 s
number of transactions actually processed: 45
tps = 0.992526 (including connections establishing)
tps = 0.992633 (excluding connections establishing)
[peter@peterlaptop tests]$ # For patch:
[peter@peterlaptop tests]$ pgbench -f sortmemgrow_sort.sql -T 45 -n
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 45 s
number of transactions actually processed: 66
tps = 1.461739 (including connections establishing)
tps = 1.461911 (excluding connections establishing)
I didn't find trace_sort all that useful for my earlier work on
optimising in memory-sorting (at least when running benchmarks), as
the granularity is too small for simple, relatively inexpensive
queries with sort nodes (or some tuplesort usage, like the datum sort
of your example). Also, the instrumentation can have a Heisenberg
effect. I've avoided using it here. The less expensive query's
executions costs are essentially the same with and without the patch.
This patch works by not doubling the size of state->memtupsize when
available memory is less than half of allowed memory (i.e. we are one
call to grow_memtuples() away from reporting ). Rather, in that
situation, it is set to a size that extrapolates the likely size of
the total amount of memory needed in a way that is quite flawed, but
likely to work well for the pass-by-value Datum case. Now, on the face
of it, this may appear to be a re-hashing of the question of "how
paranoid do we need to be about wasting memory in memory-constrained
situations - should we consider anything other than a geometric growth
rate, to be parsimonious with memory at the risk of thrashing?".
However, this is not really the case, because this is only a single,
last-ditch effort to avoid a particularly undesirable eventuality. We
have little to lose and quite a bit to gain. A cheap guestimation
seems reasonable.
I have attached a revision for your consideration, with a few
editorialisations, mostly style-related.
I removed this entirely:
+ * XXXX is the new method still follow this? The last allocation is no
+ * longer necessarily a power of 2, but that is not freed.
I did so because, according to aset.c:
* Further improvement 12/00: as the code stood, request sizes in the
* midrange between "small" and "large" were handled very inefficiently,
* because any sufficiently large free chunk would be used to satisfy a
* request, even if it was much larger than necessary. This led to more
* and more wasted space in allocated chunks over time. To fix, get rid
* of the midrange behavior: we now handle only "small" power-of-2-size
* chunks as chunks. Anything "large" is passed off to malloc(). Change
* the number of freelists to change the small/large boundary.
So, at the top of grow_memtuples, this remark still holds:
* and so there will be no unexpected addition to what we ask for. (The
* minimum array size established in tuplesort_begin_common is large
* enough to force palloc to treat it as a separate chunk, so this
* assumption should be good. But let's check it.)
It continues to hold because we're still invariably passing off this
request to malloc() (or, in this particular case, realloc())
regardless of the alignment of the request size. Certainly, the extant
code verifies !LACKMEM, and the regression tests pass when this patch
is applied.
However, there is still a danger of LACKMEM. This revision has the
following logic for extrapolating newmemtupsize (This is almost the
same as the original patch):
+ newmemtupsize = (int) floor(oldmemtupsize * allowedMem / memNowUsed);
Suppose that the code was:
+ newmemtupsize = (int) ceil(oldmemtupsize * allowedMem / memNowUsed);
We'd now have an error with your latter example query because
!LACKMEM, due to the inclusion of a single additional SortTuple. This
is because GetMemoryChunkSpace (and, ultimately,
AllocSetGetChunkSpace) return header size too. We were getting away
with that before with the doubling strategy, but now we have to factor
in that header's size into allowedMem.
I don't think my revision satisfactory in what it does here. It isn't
obvious what a better principled implementation would do though. Does
anyone have any other ideas?
I think this patch (or at least your observation about I/O waits
within vmstat) may point to a more fundamental issue with our sort
code: Why are we not using asynchronous I/O in our implementation?
There are anecdotal reports of other RDBMS implementations doing far
better than we do here, and I believe asynchronous I/O, pipelining,
and other such optimisations have a lot to do with that. It's
something I'd hoped to find the time to look at in detail, but
probably won't in the 9.3 cycle. One of the more obvious ways of
optimising an external sort is to use asynchronous I/O so that one run
of data can be sorted or merged while other runs are being read from
or written to disk. Our current implementation seems naive about this.
There are some interesting details about how this is exposed by POSIX
here:
http://www.gnu.org/software/libc/manual/html_node/Asynchronous-I_002fO.html
It's already anticipated that we might take advantage of libaio for
the benefit of FilePrefetch() (see its accompanying comments - it uses
posix_fadvise itself - effective_io_concurrency must be > 0 for this
to ever be called). It perhaps could be considered parallel
"low-hanging fruit" in that it allows us to offer limited though
useful backend parallelism without first resolving thorny issues
around what abstraction we might use, or how we might eventually make
backends thread-safe. AIO supports registering signal callbacks (a
SIGPOLL handler can be called), which seems relatively
uncontroversial.
Platform support for AIO might be a bit lacking, but then you can say
the same about posix_fadvise. We don't assume that poll(2) is
available, but we already use it where it is within the latch code.
Besides, in-kernel support can be emulated if POSIX threads is
available, which I believe would make this broadly useful on unix-like
platforms.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Attachments:
sortmem_grow-v2.patchapplication/octet-stream; name=sortmem_grow-v2.patchDownload
diff src/backend/utils/sort/tuplesort.c
index d5a2003..27d833f
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 275,280 ****
--- 275,281 ----
SortTuple *memtuples; /* array of SortTuple structs */
int memtupcount; /* number of tuples currently present */
int memtupsize; /* allocated length of memtuples array */
+ bool fin_growth; /* are we no longer growing memtupsize? */
/*
* While building initial runs, this is the current output run number
*************** tuplesort_begin_common(int workMem, bool
*** 569,574 ****
--- 570,576 ----
state->memtupcount = 0;
state->memtupsize = 1024; /* initial guess */
+ state->fin_growth = false;
state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
USEMEM(state, GetMemoryChunkSpace(state->memtuples));
*************** tuplesort_end(Tuplesortstate *state)
*** 956,965 ****
* Grow the memtuples[] array, if possible within our memory constraint.
* Return TRUE if able to enlarge the array, FALSE if not.
*
! * At each increment we double the size of the array. When we are short
! * on memory we could consider smaller increases, but because availMem
! * moves around with tuple addition/removal, this might result in thrashing.
! * Small increases in the array size are likely to be pretty inefficient.
*/
static bool
grow_memtuples(Tuplesortstate *state)
--- 958,968 ----
* Grow the memtuples[] array, if possible within our memory constraint.
* Return TRUE if able to enlarge the array, FALSE if not.
*
! * At each increment (except, possibly, the last) we double the size of the
! * array. When we are short on memory we consider smaller increases, but
! * because availMem moves around with tuple addition/removal, this might result
! * in thrashing, which is why it will only occur once at the most. Small
! * increases in the array size are likely to be pretty inefficient.
*/
static bool
grow_memtuples(Tuplesortstate *state)
*************** grow_memtuples(Tuplesortstate *state)
*** 973,990 ****
* enough to force palloc to treat it as a separate chunk, so this
* assumption should be good. But let's check it.)
*/
! if (state->availMem <= (long) (state->memtupsize * sizeof(SortTuple)))
return false;
/*
* On a 64-bit machine, allowedMem could be high enough to get us into
* trouble with MaxAllocSize, too.
*/
! if ((Size) (state->memtupsize * 2) >= MaxAllocSize / sizeof(SortTuple))
return false;
FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
! state->memtupsize *= 2;
state->memtuples = (SortTuple *)
repalloc(state->memtuples,
state->memtupsize * sizeof(SortTuple));
--- 976,1028 ----
* enough to force palloc to treat it as a separate chunk, so this
* assumption should be good. But let's check it.)
*/
! int newmemtupsize;
!
! if (state->fin_growth)
! {
return false;
+ }
+ else if (state->availMem < state->allowedMem / 2)
+ {
+ /*
+ * Make a last-ditch effort to avoid exceeding allowedMem - abandon
+ * doubling strategy.
+ *
+ * Once we are approaching the final growth of memtuples, use the known
+ * size of the tuples seen so far to estimate final growth size. This
+ * should be on average more space-efficient, which is particularly
+ * important here.
+ *
+ * N.B. We rely on the assumption that nothing other than memtuples and
+ * individual tuple storage has been deducted from availMem.
+ */
+ float oldmemtupsize = (float) state->memtupsize;
+ float allowedMem = (float) state->allowedMem;
+ float memNowUsed = (float) state->allowedMem - state->availMem;
+
+ Assert(memNowUsed > 0);
+
+ /*
+ * XXX: This feels quite brittle; is there a better principled approach,
+ * that does not violate modularity?
+ */
+ newmemtupsize = (int) floor(oldmemtupsize * allowedMem / memNowUsed);
+ state->fin_growth = true;
+ }
+ else
+ {
+ newmemtupsize = state->memtupsize * 2;
+ }
/*
* On a 64-bit machine, allowedMem could be high enough to get us into
* trouble with MaxAllocSize, too.
*/
! if ((Size) (newmemtupsize) >= MaxAllocSize / sizeof(SortTuple))
return false;
FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
! state->memtupsize = newmemtupsize;
state->memtuples = (SortTuple *)
repalloc(state->memtuples,
state->memtupsize * sizeof(SortTuple));
Do you intend to follow through with this, Jeff?
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Aug 16, 2012 at 4:26 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
On 27 July 2012 16:39, Jeff Janes <jeff.janes@gmail.com> wrote:
Can you suggest a benchmark that will usefully exercise this patch?
I think the given sizes below work on most 64 bit machines.
My apologies for not getting around to taking a look at this sooner.
...
I have attached a revision for your consideration, with a few
editorialisations, mostly style-related.
Sorry, this fell through the cracks. Your proposed patch looks good.
...
I think this patch (or at least your observation about I/O waits
within vmstat) may point to a more fundamental issue with our sort
code: Why are we not using asynchronous I/O in our implementation?
I only see a lot of io waits when using a small value of work_mem.
And I don't know how useful async would be there, as where would we
stash the read-ahead data with work_mem being so small? At larger
sizes, the kernel or raid controller automatic read ahead seems to be
enough to saturate a CPU.
Maybe just giving more aggressive advice about the default value of
work_mem being quite low for modern hardware, or making it scale with
shared_buffers by default similar to the way wal_buffers now does,
would be sufficient.
Cheers,
Jeff
On 17 August 2012 00:26, Peter Geoghegan <peter@2ndquadrant.com> wrote:
This patch works by not doubling the size of state->memtupsize when
available memory is less than half of allowed memory (i.e. we are one
call to grow_memtuples() away from reporting ). Rather, in that
situation, it is set to a size that extrapolates the likely size of
the total amount of memory needed in a way that is quite flawed, but
likely to work well for the pass-by-value Datum case. Now, on the face
of it, this may appear to be a re-hashing of the question of "how
paranoid do we need to be about wasting memory in memory-constrained
situations - should we consider anything other than a geometric growth
rate, to be parsimonious with memory at the risk of thrashing?".
However, this is not really the case, because this is only a single,
last-ditch effort to avoid a particularly undesirable eventuality. We
have little to lose and quite a bit to gain. A cheap guestimation
seems reasonable.
This is a very useful optimisation, for both the low and the high end.
The current coding allows full use of memory iff the memory usage is
an exact power of 2, so this patch will allow an average of 50% and as
much as 100% gain in memory for sorts. We need to be careful to note
this as a change on the release notes, since users may well have set
work_mem to x2 what was actually needed to get it to work - that means
most users will see a sudden leap in actual memory usage, which could
lead to swapping in edge cases.
I notice that cost_sort doesn't take into account the space used for
sorting other than tuple space so apparently requires no changes with
this patch. ISTM that cost_sort should be less optimistic about memory
efficiency than it is.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 14 October 2012 09:19, Simon Riggs <simon@2ndquadrant.com> wrote:
This is a very useful optimisation, for both the low and the high end.
I suppose it's possible to not think of it as an optimisation at all.
Rather, it could be considered a way of making tuplesort really use
the work_mem allotted to it, or at least use it more effectively, so
that DBAs don't have to oversize work_mem. You're getting the same
behaviour as when you oversized work_mem, except that you don't run
the risk of thrashing due to excessive paging if ever there is a sort
big enough to have that effect.
The current coding allows full use of memory iff the memory usage is
an exact power of 2, so this patch will allow an average of 50% and as
much as 100% gain in memory for sorts. We need to be careful to note
this as a change on the release notes, since users may well have set
work_mem to x2 what was actually needed to get it to work - that means
most users will see a sudden leap in actual memory usage, which could
lead to swapping in edge cases.
Yes, that's probably true.
I notice that cost_sort doesn't take into account the space used for
sorting other than tuple space so apparently requires no changes with
this patch. ISTM that cost_sort should be less optimistic about memory
efficiency than it is.
Perhaps. I don't have an intuitive sense of what is and is not worth
modelling in the optimiser, so I can't really comment here.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On 14 October 2012 09:19, Simon Riggs <simon@2ndquadrant.com> wrote:
This is a very useful optimisation, for both the low and the high end.
Well, I'm about ready to mark this one "ready for committer". There is
this outstanding issue in my revision of August 17th, though:
+ /*
+ * XXX: This feels quite brittle; is there a better principled approach,
+ * that does not violate modularity?
+ */
+ newmemtupsize = (int) floor(oldmemtupsize * allowedMem / memNowUsed);
+ state->fin_growth = true;
I suppose that I should just recognise that this *is* nothing more
than a heuristic, and leave it at that.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On 16 October 2012 13:42, Peter Geoghegan <peter@2ndquadrant.com> wrote:
On 14 October 2012 09:19, Simon Riggs <simon@2ndquadrant.com> wrote:
This is a very useful optimisation, for both the low and the high end.
Well, I'm about ready to mark this one "ready for committer". There is
this outstanding issue in my revision of August 17th, though:+ /* + * XXX: This feels quite brittle; is there a better principled approach, + * that does not violate modularity? + */ + newmemtupsize = (int) floor(oldmemtupsize * allowedMem / memNowUsed); + state->fin_growth = true;I suppose that I should just recognise that this *is* nothing more
than a heuristic, and leave it at that.
It's a simple and reasonable heuristic, and a great improvement on the
previous situation.
If you describe in detail that it is a heuristic and why that is
proposed over other approaches that should be sufficient for future
generations to read and understand.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 16 October 2012 14:24, Simon Riggs <simon@2ndquadrant.com> wrote:
If you describe in detail that it is a heuristic and why that is
proposed over other approaches that should be sufficient for future
generations to read and understand.
I've done so, in the attached revision. Things have been simplified
somewhat too.
The same basic strategy for sizing the tuplesort memtuples array in
also exists in tuplestore. I wonder if we should repeat this there? I
suppose that that could follow later.
The patch will now been marked "ready for committer". Does this need
doc changes, in light of what is arguably a behavioural difference?
You only mentioned release notes.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Attachments:
sortmem_grow-v3.patchapplication/octet-stream; name=sortmem_grow-v3.patchDownload
diff src/backend/utils/sort/tuplesort.c
index ce27e40..80ac11b
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** tuplesort_end(Tuplesortstate *state)
*** 957,970 ****
* Grow the memtuples[] array, if possible within our memory constraint.
* Return TRUE if able to enlarge the array, FALSE if not.
*
! * At each increment we double the size of the array. When we are short
! * on memory we could consider smaller increases, but because availMem
! * moves around with tuple addition/removal, this might result in thrashing.
! * Small increases in the array size are likely to be pretty inefficient.
*/
static bool
grow_memtuples(Tuplesortstate *state)
{
/*
* We need to be sure that we do not cause LACKMEM to become true, else
* the space management algorithm will go nuts. We assume here that the
--- 957,974 ----
* Grow the memtuples[] array, if possible within our memory constraint.
* Return TRUE if able to enlarge the array, FALSE if not.
*
! * At each increment we double the size of the array. When we are short on
! * memory we do attempt one last, smaller increase. This only happens at most
! * once, since availMem moves around with tuple addition/removal. To do othewise
! * might result in thrashing. This is nothing more than a last-ditch effort to
! * avoid exceeding allowedMem, an undesirable outcome if avoidable.
*/
static bool
grow_memtuples(Tuplesortstate *state)
{
+ int newmemtupsize;
+ long memNowUsed = state->allowedMem - state->availMem;
+
/*
* We need to be sure that we do not cause LACKMEM to become true, else
* the space management algorithm will go nuts. We assume here that the
*************** grow_memtuples(Tuplesortstate *state)
*** 974,991 ****
* enough to force palloc to treat it as a separate chunk, so this
* assumption should be good. But let's check it.)
*/
! if (state->availMem <= (long) (state->memtupsize * sizeof(SortTuple)))
! return false;
/*
* On a 64-bit machine, allowedMem could be high enough to get us into
* trouble with MaxAllocSize, too.
*/
! if ((Size) (state->memtupsize * 2) >= MaxAllocSize / sizeof(SortTuple))
return false;
FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
! state->memtupsize *= 2;
state->memtuples = (SortTuple *)
repalloc(state->memtuples,
state->memtupsize * sizeof(SortTuple));
--- 978,1020 ----
* enough to force palloc to treat it as a separate chunk, so this
* assumption should be good. But let's check it.)
*/
! if (memNowUsed > state->availMem)
! {
! int memtupsize = state->memtupsize;
! long allowedMem = state->allowedMem;
!
! /*
! * For this last increment, abandon doubling strategy.
! *
! * Use the known size (in bytes) of the tuples seen so far to estimate a
! * memtuples size that is just within our constraint. This is nothing
! * more than a heuristic.
! *
! * N.B. We rely on the assumption that nothing other than memtuples and
! * individual tuple storage has been deducted from availMem.
! */
! newmemtupsize = memtupsize * allowedMem / memNowUsed;
!
! Assert(newmemtupsize <= state->memtupsize * 2);
!
! /* This may not be our first time through */
! if (newmemtupsize <= memtupsize)
! return false;
! }
! else
! {
! newmemtupsize = state->memtupsize * 2;
! }
/*
* On a 64-bit machine, allowedMem could be high enough to get us into
* trouble with MaxAllocSize, too.
*/
! if ((Size) (newmemtupsize) >= MaxAllocSize / sizeof(SortTuple))
return false;
FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
! state->memtupsize = newmemtupsize;
state->memtuples = (SortTuple *)
repalloc(state->memtuples,
state->memtupsize * sizeof(SortTuple));
On Tue, Oct 16, 2012 at 9:47 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
The patch will now been marked "ready for committer". Does this need
doc changes, in light of what is arguably a behavioural difference?
You only mentioned release notes.
I'm happy to look at this one, probably next week at pgconf.eu. It seems like a
reasonable size patch to get back into things.
That's assuming my committer bits haven't lapsed and people are ok
with me stepping back into things?
--
greg
On 16 October 2012 22:18, Greg Stark <stark@mit.edu> wrote:
That's assuming my committer bits haven't lapsed and people are ok
with me stepping back into things?
I personally have no objections.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Greg Stark escribió:
On Tue, Oct 16, 2012 at 9:47 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
The patch will now been marked "ready for committer". Does this need
doc changes, in light of what is arguably a behavioural difference?
You only mentioned release notes.I'm happy to look at this one, probably next week at pgconf.eu. It seems like a
reasonable size patch to get back into things.That's assuming my committer bits haven't lapsed and people are ok
with me stepping back into things?
Feel free. Your committer bits should still work.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 16 October 2012 21:47, Peter Geoghegan <peter@2ndquadrant.com> wrote:
The same basic strategy for sizing the tuplesort memtuples array in
also exists in tuplestore. I wonder if we should repeat this there? I
suppose that that could follow later.
Incidentally, the basis of this remark is commit 2689abf0, where Tom
decided to keep the two in sync. That's a precedent for what we need
to do here, I suppose.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On Tue, Oct 16, 2012 at 4:47 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
The same basic strategy for sizing the tuplesort memtuples array in
also exists in tuplestore. I wonder if we should repeat this there? I
suppose that that could follow later.
I think it'd be a good idea to either adjust tuplestore as well or
explain in the comments there why the algorithm is different, because
it has a comment which references this comment, and now the logic
won't be in sync in spite of a comment implying that it is.
I didn't initially understand why the proposed heuristic proposed is
safe. It's clear that we have to keep LACKMEM() from becoming true,
or we'll bail out of this code with elog(ERROR). It will become true
if we increase the size of the array more by a number of elements
which exceeds state->availmem / sizeof(SortTuple), but the actual
number of elements being added is memtupsize * allowedMem / memNowUsed
- memtupsize, which doesn't look closely related. However, I think I
figured it out. We know that memNowUsed >= memtupsize *
sizeof(SortTuple); substituting, we're adding no more than memtupsize
* allowedMem / (memtupsize * sizeof(SortTuple)) - memtupsize elements,
which simplifies to allowedMem / sizeof(SortTuple) - memtupsize =
(allowedMem - sizeof(SortTuple) * memtupsize) / sizeof(SortTuple).
Since we know availMem < allowedMem - sizeof(SortTuple) * memtupsize,
that means the number of new elements is less than
state->availMem/sizeof(SortTuple). Whew. In the attached version, I
updated the comment to reflect this, and also reversed the order of
the if/else block to put the shorter and more common case first, which
I think is better style.
I'm still not too sure why this part is OK:
/* This may not be our first time through */
if (newmemtupsize <= memtupsize)
return false;
Suppose we enlarge the memtuples array by something other than a
multiple of 2, and then we kick out all of the tuples that are
currently in memory and load a new group with a smaller average size.
ISTM that it could now appear that the memtuples array can be useful
further enlarged, perhaps by as little as one tuple, and that that
this could happen repeatedly. I think we should either refuse to grow
the array unless we can increase it by at least, say, 10%, or else set
a flag after the final enlargement that acts as a hard stop to any
further growth.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
sortmem_grow-v4.patchapplication/octet-stream; name=sortmem_grow-v4.patchDownload
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index ce27e40..a97e3b1 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -957,14 +957,18 @@ tuplesort_end(Tuplesortstate *state)
* Grow the memtuples[] array, if possible within our memory constraint.
* Return TRUE if able to enlarge the array, FALSE if not.
*
- * At each increment we double the size of the array. When we are short
- * on memory we could consider smaller increases, but because availMem
- * moves around with tuple addition/removal, this might result in thrashing.
- * Small increases in the array size are likely to be pretty inefficient.
+ * At each increment we double the size of the array. When we are short on
+ * memory we do attempt one last, smaller increase. This only happens at most
+ * once, since availMem moves around with tuple addition/removal. To do othewise
+ * might result in thrashing. This is nothing more than a last-ditch effort to
+ * avoid exceeding allowedMem, an undesirable outcome if avoidable.
*/
static bool
grow_memtuples(Tuplesortstate *state)
{
+ int newmemtupsize;
+ long memNowUsed = state->allowedMem - state->availMem;
+
/*
* We need to be sure that we do not cause LACKMEM to become true, else
* the space management algorithm will go nuts. We assume here that the
@@ -974,18 +978,50 @@ grow_memtuples(Tuplesortstate *state)
* enough to force palloc to treat it as a separate chunk, so this
* assumption should be good. But let's check it.)
*/
- if (state->availMem <= (long) (state->memtupsize * sizeof(SortTuple)))
- return false;
+ if (memNowUsed <= state->availMem)
+ newmemtupsize = state->memtupsize * 2;
+ else
+ {
+ int memtupsize = state->memtupsize;
+ long allowedMem = state->allowedMem;
+
+ /*
+ * For this last increment, abandon doubling strategy.
+ *
+ * To make sure LACKMEM(state) doesn't become true, we can't increase
+ * memtupsize by more than state->availMem/sizeof(SortTuple) elements.
+ * In practice, we want to increase it by considerably less, because
+ * we need to leave some space for the tuples to which the new array
+ * slots will refer. We assume the new tuples will be about the same
+ * size as the tuples we've already seen, and thus use the known size
+ * (in bytes) of the tuples seen so far to estimate an appropriate new
+ * size for the memtuples array. The optimal value might be higher or
+ * lower than we estimate, but it's hard to know that in advance.
+ *
+ * In any case, we're definitely safe against enlarging the array so
+ * much that LACKMEM(state) becomes true, because the memory currently
+ * used includes the present array; thus, there would be enough
+ * allowedMem for the new array elements even if no other memory were
+ * currently used.
+ */
+ newmemtupsize = memtupsize * allowedMem / memNowUsed;
+
+ Assert(newmemtupsize <= state->memtupsize * 2);
+
+ /* This may not be our first time through */
+ if (newmemtupsize <= memtupsize)
+ return false;
+ }
/*
* On a 64-bit machine, allowedMem could be high enough to get us into
* trouble with MaxAllocSize, too.
*/
- if ((Size) (state->memtupsize * 2) >= MaxAllocSize / sizeof(SortTuple))
+ if ((Size) (newmemtupsize) >= MaxAllocSize / sizeof(SortTuple))
return false;
FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
- state->memtupsize *= 2;
+ state->memtupsize = newmemtupsize;
state->memtuples = (SortTuple *)
repalloc(state->memtuples,
state->memtupsize * sizeof(SortTuple));
On 15 November 2012 16:09, Robert Haas <robertmhaas@gmail.com> wrote:
[Lots of complicated commentary on tuplesort variables]
Whew. In the attached version, I
updated the comment to reflect this, and also reversed the order of
the if/else block to put the shorter and more common case first, which
I think is better style.
Fine by me.
In conversation with Greg Stark in Prague, he seemed to think that
there was an integer overflow hazard in v3, which is, in terms of the
machine code it will produce, identical to your revision. He pointed
this out to me when we were moving between sessions, and I only
briefly looked over his shoulder, so while I don't want to
misrepresent what he said, I *think* he said that this could overflow:
+ newmemtupsize = memtupsize * allowedMem / memNowUsed;
Having only looked at this properly now, I don't think that that's
actually the case, but I thought that it should be noted. I'm sorry if
it's unfair to Greg to correct him, even though that's something he
didn't formally put on the record, and may have only looked at
briefly. I can see why it might have looked like a concern, since
we're assigning the result of an arithmetic expression involving long
variables to an int, newmemtupsize. The fact of the matter is that
we're already asserting that:
+ Assert(newmemtupsize <= state->memtupsize * 2);
Previously, we'd just have doubled memtupsize anyway. So any
eventuality in which newmemtupsize overflows ought to be logically
impossible.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On 15 November 2012 16:09, Robert Haas <robertmhaas@gmail.com> wrote:
I'm still not too sure why this part is OK:
/* This may not be our first time through */
if (newmemtupsize <= memtupsize)
return false;Suppose we enlarge the memtuples array by something other than a
multiple of 2, and then we kick out all of the tuples that are
currently in memory and load a new group with a smaller average size.
ISTM that it could now appear that the memtuples array can be useful
further enlarged, perhaps by as little as one tuple, and that that
this could happen repeatedly. I think we should either refuse to grow
the array unless we can increase it by at least, say, 10%, or else set
a flag after the final enlargement that acts as a hard stop to any
further growth.
I thought that the flag added to Tuplesortstate in earlier revisions
was a little bit ugly. I can see the concern though, and I suppose
that given the alternative is to add a heuristic, simply using a flag
may be the best way to proceed.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Nov 15, 2012 at 12:14 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
On 15 November 2012 16:09, Robert Haas <robertmhaas@gmail.com> wrote:
[Lots of complicated commentary on tuplesort variables]Whew. In the attached version, I
updated the comment to reflect this, and also reversed the order of
the if/else block to put the shorter and more common case first, which
I think is better style.Fine by me.
In conversation with Greg Stark in Prague, he seemed to think that
there was an integer overflow hazard in v3, which is, in terms of the
machine code it will produce, identical to your revision. He pointed
this out to me when we were moving between sessions, and I only
briefly looked over his shoulder, so while I don't want to
misrepresent what he said, I *think* he said that this could overflow:+ newmemtupsize = memtupsize * allowedMem / memNowUsed;
Ah, yeah. I wondered in passing about that but forgot to follow up on
it. The problem specifically is that the intermediate result
memtupsize * newmemtuples might overflow. I believe that the old
memtupsize can never be more than 2^26 bytes, because the allocation
limit is 1GB and each SortTuple is 16 bytes. So if this is done as a
32-bit calculation, we'll potentially overflow if allowedMem is more
than 64 bytes i.e. almost for sure. If it is done as a 64-byte
calculation we'll potentially overflow if allowedMem is more than 2^38
bytes = 256GB. The actual declared type is "long" which I assume is
probably 64 bits at least for most people these days, but maybe not
for everyone, and even 256GB is not necessarily safe. The highest
value for work_mem I can set here is 2047GB. So I think there is an
issue here, unless there's something wrong with my analysis.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 15 November 2012 18:13, Robert Haas <robertmhaas@gmail.com> wrote:
Ah, yeah. I wondered in passing about that but forgot to follow up on
it. The problem specifically is that the intermediate result
memtupsize * newmemtuples might overflow. I believe that the old
memtupsize can never be more than 2^26 bytes, because the allocation
limit is 1GB and each SortTuple is 16 bytes.
Do you mean the intermediate result of memtupsize * allowedMem? Oh,
yeah, that could overflow rather easily on a platform where long is
only 32-bit. We're multiplying the entire current allocation size of
the array by the maximum length. I guess the fact that you didn't spot
it made me overconfident. :-)
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Nov 15, 2012 at 2:13 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
On 15 November 2012 18:13, Robert Haas <robertmhaas@gmail.com> wrote:
Ah, yeah. I wondered in passing about that but forgot to follow up on
it. The problem specifically is that the intermediate result
memtupsize * newmemtuples might overflow. I believe that the old
memtupsize can never be more than 2^26 bytes, because the allocation
limit is 1GB and each SortTuple is 16 bytes.Do you mean the intermediate result of memtupsize * allowedMem?
Yes.
Oh,
yeah, that could overflow rather easily on a platform where long is
only 32-bit.
Or even 64-bit, with really large values of work_mem.
We're multiplying the entire current allocation size of
the array by the maximum length. I guess the fact that you didn't spot
it made me overconfident. :-)
Ha!
So what's next here? Do you want to work on these issue some more?
Or does Jeff? I'd like to see this go in, but I'm not sure I have the
bandwidth to do the legwork myself.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 15 November 2012 19:16, Robert Haas <robertmhaas@gmail.com> wrote:
So what's next here? Do you want to work on these issue some more?
Or does Jeff? I'd like to see this go in, but I'm not sure I have the
bandwidth to do the legwork myself.
I'll take another look. No elegant solution immediately occurs to me, though.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Nov 15, 2012 at 7:36 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
On 15 November 2012 19:16, Robert Haas <robertmhaas@gmail.com> wrote:
So what's next here? Do you want to work on these issue some more?
Or does Jeff? I'd like to see this go in, but I'm not sure I have the
bandwidth to do the legwork myself.I'll take another look. No elegant solution immediately occurs to me, though.
The overflow was trivial to fix.
The only concern I had was about the behaviour after it did the
special case. I didn't want it to keep doing the math and trying to
grow again a little bit every tuple. I think I was leaning to putting
the magic flag back. The alternative is to only do the one-off grow if
we can grow at least some arbitrary percentage like 10% or something
like that. But it seems like a lot of arithmetic to be doing each time
around for probably no gain.
--
greg
On 15 November 2012 19:54, Greg Stark <stark@mit.edu> wrote:
The only concern I had was about the behaviour after it did the
special case. I didn't want it to keep doing the math and trying to
grow again a little bit every tuple. I think I was leaning to putting
the magic flag back.
The alternative might just be to add a new constant to the
TupSortStatus enum. That might be more logical.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Nov 15, 2012 at 2:54 PM, Greg Stark <stark@mit.edu> wrote:
On Thu, Nov 15, 2012 at 7:36 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
On 15 November 2012 19:16, Robert Haas <robertmhaas@gmail.com> wrote:
So what's next here? Do you want to work on these issue some more?
Or does Jeff? I'd like to see this go in, but I'm not sure I have the
bandwidth to do the legwork myself.I'll take another look. No elegant solution immediately occurs to me, though.
The overflow was trivial to fix.
Mmm... how?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Nov 15, 2012 at 4:13 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
On 15 November 2012 19:54, Greg Stark <stark@mit.edu> wrote:
The only concern I had was about the behaviour after it did the
special case. I didn't want it to keep doing the math and trying to
grow again a little bit every tuple. I think I was leaning to putting
the magic flag back.The alternative might just be to add a new constant to the
TupSortStatus enum. That might be more logical.
That seems like a misfit to me, but throwing in "bool
cangrowmemtuples" or something like that seems like a good solution.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Nov 15, 2012 at 11:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:
So what's next here? Do you want to work on these issue some more?
Or does Jeff?
This has been rewritten enough that I no longer feel much ownership of it.
I'd prefer to leave it to Peter or Greg S., if they are willing to do it.
Cheers,
Jeff
Hi,
On 2012-11-21 14:52:18 -0800, Jeff Janes wrote:
On Thu, Nov 15, 2012 at 11:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:
So what's next here? Do you want to work on these issue some more?
Or does Jeff?This has been rewritten enough that I no longer feel much ownership of it.
I'd prefer to leave it to Peter or Greg S., if they are willing to do it.
Is anybody planning to work on this? There hasn't been any activity
since the beginning of the CF and it doesn't look like there is much
work left?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8 December 2012 14:41, Andres Freund <andres@2ndquadrant.com> wrote:
Is anybody planning to work on this? There hasn't been any activity
since the beginning of the CF and it doesn't look like there is much
work left?
I took another look at this.
The growmemtuples bool from Jeff's original patch has been re-added.
My strategy for preventing overflow is to use a uint64, and to use
Min()/Max() as appropriate. As Robert mentioned, even a 64-bit integer
could overflow here, and I account for that. Actually, right now this
is only a theoretical problem on 64-bit platforms, because of the
MaxAllocSize limitation - allowedMem being more than 2^38 (bytes, or
256GB) is a situation in which we won't repalloc anyway, because of
this:
/*
* On a 64-bit machine, allowedMem could be high enough to get us into
* trouble with MaxAllocSize, too.
*/
! if ((Size) (newmemtupsize) >= MaxAllocSize / sizeof(SortTuple))
! goto noalloc;
I reintroduced this check, absent in prior revisions, positioned
around the new code:
! /* We assume here that the memory chunk overhead associated with the
! * memtuples array is constant and so there will be no unexpected addition
! * to what we ask for. (The minimum array size established in
! * tuplesort_begin_common is large enough to force palloc to treat it as a
! * separate chunk, so this assumption should be good. But let's check it,
! * since the above fall-back may be used.)
*/
if (state->availMem <= (long) (state->memtupsize * sizeof(SortTuple)))
return false;
Though we use a uint64 for memtupsize here, we still don't fully trust
the final value:
! newmemtupsize = Min(Max(memtupsize * allowedMem / memNowUsed,
! memtupsize),
! memtupsize * 2);
I also added a brief note within tuplestore.c to the effect that the
two buffer sizing strategies are not in sync.
Thoughts?
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Attachments:
sortmem_grow-v5.patchapplication/octet-stream; name=sortmem_grow-v5.patchDownload
diff src/backend/utils/sort/tuplesort.c
index ce27e40..04a4cbf
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 276,281 ****
--- 276,282 ----
SortTuple *memtuples; /* array of SortTuple structs */
int memtupcount; /* number of tuples currently present */
int memtupsize; /* allocated length of memtuples array */
+ bool growmemtuples; /* memtuples' growth still underway */
/*
* While building initial runs, this is the current output run number
*************** tuplesort_begin_common(int workMem, bool
*** 570,575 ****
--- 571,577 ----
state->memtupcount = 0;
state->memtupsize = 1024; /* initial guess */
+ state->growmemtuples = true;
state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
USEMEM(state, GetMemoryChunkSpace(state->memtuples));
*************** tuplesort_end(Tuplesortstate *state)
*** 957,991 ****
* Grow the memtuples[] array, if possible within our memory constraint.
* Return TRUE if able to enlarge the array, FALSE if not.
*
! * At each increment we double the size of the array. When we are short
! * on memory we could consider smaller increases, but because availMem
! * moves around with tuple addition/removal, this might result in thrashing.
! * Small increases in the array size are likely to be pretty inefficient.
*/
static bool
grow_memtuples(Tuplesortstate *state)
{
/*
! * We need to be sure that we do not cause LACKMEM to become true, else
! * the space management algorithm will go nuts. We assume here that the
! * memory chunk overhead associated with the memtuples array is constant
! * and so there will be no unexpected addition to what we ask for. (The
! * minimum array size established in tuplesort_begin_common is large
! * enough to force palloc to treat it as a separate chunk, so this
! * assumption should be good. But let's check it.)
*/
if (state->availMem <= (long) (state->memtupsize * sizeof(SortTuple)))
! return false;
/*
* On a 64-bit machine, allowedMem could be high enough to get us into
* trouble with MaxAllocSize, too.
*/
! if ((Size) (state->memtupsize * 2) >= MaxAllocSize / sizeof(SortTuple))
! return false;
FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
! state->memtupsize *= 2;
state->memtuples = (SortTuple *)
repalloc(state->memtuples,
state->memtupsize * sizeof(SortTuple));
--- 959,1050 ----
* Grow the memtuples[] array, if possible within our memory constraint.
* Return TRUE if able to enlarge the array, FALSE if not.
*
! * At each increment we double the size of the array. When we are short on
! * memory we do attempt one last, smaller increase. This only happens at most
! * once, since availMem moves around with tuple addition/removal. To do othewise
! * might result in thrashing. This is nothing more than a last-ditch effort to
! * avoid exceeding allowedMem, an undesirable outcome if avoidable.
*/
static bool
grow_memtuples(Tuplesortstate *state)
{
+ int newmemtupsize;
+ int memtupsize = state->memtupsize;
+ long memNowUsed = state->allowedMem - state->availMem;
+
+ /* This may not be our first time through */
+ if (!state->growmemtuples)
+ return false;
+
/*
! * We need to be sure that we do not cause LACKMEM to become true, else the
! * space management algorithm will go nuts.
! */
! if (memNowUsed <= state->availMem)
! {
! newmemtupsize = memtupsize * 2;
! }
! else
! {
! uint64 allowedMem = state->allowedMem;
!
! /*
! * For this last increment, abandon doubling strategy.
! *
! * To make sure LACKMEM doesn't become true, we can't increase
! * memtupsize by more than state->availMem / sizeof(SortTuple)
! * elements. In practice, we want to increase it by considerably less,
! * because we need to leave some space for the tuples to which the new
! * array slots will refer. We assume the new tuples will be about the
! * same size as the tuples we've already seen, and thus use the known
! * size (in bytes) of the tuples seen so far to estimate an appropriate
! * new size for the memtuples array. The optimal value might be higher
! * or lower than we estimate, but it's hard to know that in advance.
! *
! * In any case, we're definitely safe against enlarging the array so
! * much that LACKMEM becomes true, because the memory currently used
! * includes the present array; thus, there would be enough allowedMem
! * for the new array elements even if no other memory were currently
! * used.
! *
! * allowedMem has been converted to uint64 to prevent overflow on
! * platforms where long is only 32 bits wide. Even still, the first
! * argument to Max() below could overflow.
! *
! * XXX: This approach could prevent us from allocating a very large
! * amount of memory that is still within allowedMem in some cases.
! * However, the MaxAllocSize limitation would prevent such an
! * allocation in that situation.
! */
! state->growmemtuples = false;
! newmemtupsize = Min(Max(memtupsize * allowedMem / memNowUsed,
! memtupsize),
! memtupsize * 2);
! }
!
! /*
! * We assume here that the memory chunk overhead associated with the
! * memtuples array is constant and so there will be no unexpected addition
! * to what we ask for. (The minimum array size established in
! * tuplesort_begin_common is large enough to force palloc to treat it as a
! * separate chunk, so this assumption should be good. But let's check it.)
*/
if (state->availMem <= (long) (state->memtupsize * sizeof(SortTuple)))
! goto noalloc;
/*
* On a 64-bit machine, allowedMem could be high enough to get us into
* trouble with MaxAllocSize, too.
*/
! if ((Size) (newmemtupsize) >= MaxAllocSize / sizeof(SortTuple))
! goto noalloc;
!
! /* Avoid redundant repalloc */
! if (newmemtupsize <= memtupsize)
! goto noalloc;
FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
! state->memtupsize = newmemtupsize;
state->memtuples = (SortTuple *)
repalloc(state->memtuples,
state->memtupsize * sizeof(SortTuple));
*************** grow_memtuples(Tuplesortstate *state)
*** 993,998 ****
--- 1052,1061 ----
if (LACKMEM(state))
elog(ERROR, "unexpected out-of-memory situation during sort");
return true;
+
+ noalloc:
+ state->growmemtuples = false;
+ return false;
}
/*
diff src/backend/utils/sort/tuplestore.c
index 1b1cf35..743e578
*** a/src/backend/utils/sort/tuplestore.c
--- b/src/backend/utils/sort/tuplestore.c
*************** tuplestore_puttuple_common(Tuplestoresta
*** 632,639 ****
if (state->memtupcount >= state->memtupsize - 1)
{
/*
! * See grow_memtuples() in tuplesort.c for the rationale
! * behind these two tests.
*/
if (state->availMem > (long) (state->memtupsize * sizeof(void *)) &&
(Size) (state->memtupsize * 2) < MaxAllocSize / sizeof(void *))
--- 632,640 ----
if (state->memtupcount >= state->memtupsize - 1)
{
/*
! * See grow_memtuples() in tuplesort.c for the rationale behind
! * these two tests. Note that that module uses a slightly more
! * sophisticated strategy for sizing its memtuples array.
*/
if (state->availMem > (long) (state->memtupsize * sizeof(void *)) &&
(Size) (state->memtupsize * 2) < MaxAllocSize / sizeof(void *))
Peter Geoghegan <peter@2ndquadrant.com> writes:
I took another look at this.
Since Greg S. seems to have lost interest in committing this, I am
picking it up.
My strategy for preventing overflow is to use a uint64, and to use
Min()/Max() as appropriate. As Robert mentioned, even a 64-bit integer
could overflow here, and I account for that.
It seems to me that there's a much more robust way to do this, namely
use float8 arithmetic. Anybody have a problem with this version of
the last-increment branch?
/*
* This will be the last increment of memtupsize. Abandon doubling
* strategy and instead increase as much as we safely can.
*
* To stay within allowedMem, we can't increase memtupsize by more
* than availMem / sizeof(SortTuple) elements. In practice, we want
* to increase it by considerably less, because we need to leave some
* space for the tuples to which the new array slots will refer. We
* assume the new tuples will be about the same size as the tuples
* we've already seen, and thus we can extrapolate from the space
* consumption so far to estimate an appropriate new size for the
* memtuples array. The optimal value might be higher or lower than
* this estimate, but it's hard to know that in advance.
*
* This calculation is definitely safe against enlarging the array so
* much that LACKMEM becomes true, because the memory currently used
* includes the present array; thus, there would be enough allowedMem
* for the new array elements even if no other memory were currently
* used.
*
* We do the arithmetic in float8, because otherwise the product of
* memtupsize and allowedMem would be quite likely to overflow. Any
* inaccuracy in the result should be insignificant, but just for
* paranoia's sake, we bound the result to be 1 to 2 times the
* existing value. (A little algebra shows that grow_ratio must be
* less than 2 here, so except for roundoff error the Min/Max bounds
* should never do anything.)
*
* Note: it might seem that we need to worry about memtupsize * 2
* overflowing an int, but the MaxAllocSize bound applied below will
* ensure that can't happen.
*/
double grow_ratio;
grow_ratio = (double) state->allowedMem / (double) memNowUsed;
newmemtupsize = (int) (memtupsize * grow_ratio);
newmemtupsize = Max(newmemtupsize, memtupsize);
newmemtupsize = Min(newmemtupsize, memtupsize * 2);
/* We won't make any further enlargement attempts */
state->growmemtuples = false;
I also added a brief note within tuplestore.c to the effect that the
two buffer sizing strategies are not in sync.
My inclination is to just copy the whole grow_memtuples function into
tuplestore.c too. There's no very good reason why tuplestore should
be stupider than tuplesort about this.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter Geoghegan <peter@2ndquadrant.com> writes:
On 8 December 2012 14:41, Andres Freund <andres@2ndquadrant.com> wrote:
Is anybody planning to work on this? There hasn't been any activity
since the beginning of the CF and it doesn't look like there is much
work left?
I took another look at this.
Applied with some changes:
* Use float8 arithmetic to prevent intermediate-result overflow, as I
suggested last night.
* Rearrange the subsequent checks so that they provide bulletproof
clamping behavior on out-of-range newmemtupsize values; this allows
dropping the very ad-hoc range limiting used in the previous patch (and
in what I had last night).
* Fix the check against availMem; it was supposed to be testing that the
increment in the array size was within availMem. (In the original
coding the increment was always the same as the original size, but not
so much anymore.)
* Allow the array size to grow to the MaxAllocSize limit, instead of
punting if we'd exceed that. If we're attempting to use as much of
work_mem as we possibly can, I don't see why that should be a reject
case.
* Improve the comments, including the existing one about the availMem
check, since that evidently wasn't clear enough.
* Copy the whole mess into tuplestore.c too.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 17 January 2013 18:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Applied with some changes:
Thank you. That feedback is useful.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers