Use generation context to speed up tuplesorts
Hackers,
While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum
sorts for single column sorts), I noticed that we use a separate
memory context to store tuple data and we just reset when we're done
if the sort fits in memory, or we dump the tuples to disk in the same
order they're added and reset the context when it does not. There is
a little pfree() work going on via writetup_heap() which I think
possibly could just be removed to get some additional gains.
Anyway, this context that stores tuples uses the standard aset.c
allocator which has the usual power of 2 wastage and additional
overheads of freelists etc. I wondered how much faster it would go if
I set it to use a generation context instead of an aset.c one.
After running make installcheck to make the tenk1 table, running the
attached tuplesortbench script, I get this:
Master:
work_mem = '4MB';
Sort Method: external merge Disk: 2496kB
work_mem = '4GB';
Sort Method: quicksort Memory: 5541kB
Patched:
work_mem = '4MB';
Sort Method: quicksort Memory: 3197kB
So it seems to save quite a bit of memory getting away from rounding
up allocations to the next power of 2.
Performance-wise, there's some pretty good gains. (Results in TPS)
work_mem = '4GB';
Test master gen sort compare
Test1 317.2 665.6 210%
Test2 228.6 388.9 170%
Test3 207.4 330.7 159%
Test4 185.5 279.4 151%
Test5 292.2 563.9 193%
If I drop the work_mem down to standard the unpatched version does to
disk, but the patched version does not. The gains get a little
bigger.
work_mem = '4MB';
Test master gen sort compare
Test1 177.5 658.2 371%
Test2 149.7 385.2 257%
Test3 137.5 330.0 240%
Test4 129.0 275.1 213%
Test5 161.7 546.4 338%
The patch is just a simple 1-liner at the moment. I likely do need to
adjust what I'm passing as the blockSize to GenerationContextCreate().
Maybe a better number would be something that's calculated from
work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64))
so that we just allocate at most a 16th of work_mem per chunk, but not
bigger than 8MB. I don't think changing this will affect the
performance of the above very much.
David
On Fri, Jul 30, 2021 at 2:42 AM David Rowley <dgrowleyml@gmail.com> wrote:
Master:
Sort Method: quicksort Memory: 5541kB
Patched:
Sort Method: quicksort Memory: 3197kB
Whoa.
work_mem = '4GB';
Test master gen sort compare
Test1 317.2 665.6 210%
Test2 228.6 388.9 170%
Test3 207.4 330.7 159%
Test4 185.5 279.4 151%
Test5 292.2 563.9 193%
Very impressive.
An early version of what eventually became DSA worked with
backend-local memory and I saw very substantial memory usage
improvements on large sorts, similar to what you show here. I am not
sure I saw the same CPU improvements, and in any case I abandoned the
idea of using that infrastructure to manage backend-local memory at
some point, since the whole thing had lots of problems that I didn't
know how to solve. What you've done here looks like a much more
promising approach.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2021-07-30 18:42:18 +1200, David Rowley wrote:
While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum
sorts for single column sorts), I noticed that we use a separate
memory context to store tuple data and we just reset when we're done
if the sort fits in memory, or we dump the tuples to disk in the same
order they're added and reset the context when it does not. There is
a little pfree() work going on via writetup_heap() which I think
possibly could just be removed to get some additional gains.Anyway, this context that stores tuples uses the standard aset.c
allocator which has the usual power of 2 wastage and additional
overheads of freelists etc. I wondered how much faster it would go if
I set it to use a generation context instead of an aset.c one.After running make installcheck to make the tenk1 table, running the
attached tuplesortbench script, I get this:
So it seems to save quite a bit of memory getting away from rounding
up allocations to the next power of 2.Performance-wise, there's some pretty good gains. (Results in TPS)
Very nice!
I wonder if there's cases where generation.c would regress performance
over aset.c due to not having an initial / "keeper" block?
The patch is just a simple 1-liner at the moment. I likely do need to
adjust what I'm passing as the blockSize to GenerationContextCreate().
Maybe a better number would be something that's calculated from
work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64))
so that we just allocate at most a 16th of work_mem per chunk, but not
bigger than 8MB. I don't think changing this will affect the
performance of the above very much.
I think it's bad that both genereration and slab don't have internal
handling of block sizes. Needing to err on the size of too big blocks to
handle large amounts of memory well, just so the contexts don't need to
deal with variably sized blocks isn't a sensible tradeoff.
I don't think it's acceptable to use ALLOCSET_DEFAULT_MAXSIZE or
Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64) for
tuplesort.c. There's plenty cases where we'll just sort a handful of
tuples, and increasing the memory usage of those by a factor of 1024
isn't good. The Min() won't do any good if a larger work_mem is used.
Nor will it be good to use thousands of small allocations for a large
in-memory tuplesort just because we're concerned about the initial
allocation size. Both because of the allocation overhead, but
importantly also because that will make context resets more expensive.
To me this says that we should transplant aset.c's block size growing
into generation.c.
There is at least one path using tuplecontext that reaches code that
could end up freeing memory to a significant enough degree to care about
generation.c effectively not using that memory:
tuplesort_putdatum()->datumCopy()->EOH_flatten_into()
On a quick look I didn't find any expanded record user that frees
nontrivial amounts of memory, but I didn't look all that carefully.
Greetings,
Andres Freund
On 7/30/21 10:38 PM, Andres Freund wrote:
Hi,
On 2021-07-30 18:42:18 +1200, David Rowley wrote:
While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum
sorts for single column sorts), I noticed that we use a separate
memory context to store tuple data and we just reset when we're done
if the sort fits in memory, or we dump the tuples to disk in the same
order they're added and reset the context when it does not. There is
a little pfree() work going on via writetup_heap() which I think
possibly could just be removed to get some additional gains.Anyway, this context that stores tuples uses the standard aset.c
allocator which has the usual power of 2 wastage and additional
overheads of freelists etc. I wondered how much faster it would go if
I set it to use a generation context instead of an aset.c one.After running make installcheck to make the tenk1 table, running the
attached tuplesortbench script, I get this:So it seems to save quite a bit of memory getting away from rounding
up allocations to the next power of 2.Performance-wise, there's some pretty good gains. (Results in TPS)
Very nice!
Yes, very nice. I wouldn't have expected such significant difference,
particularly in CPU usage. It's pretty interesting that it both reduces
memory and CPU usage, I'd have guessed it's either one of the other.
I wonder if there's cases where generation.c would regress performance
over aset.c due to not having an initial / "keeper" block?
Not sure. I guess such workload would need to allocate and free a single
block (so very little memory) very often. I guess that's possible, but
I'm not aware of a place doing that very often. Although, maybe decoding
could do that for simple (serial) workload.
I'm not opposed to adding a keeper block to Generation, similarly to
what was discussed for Slab not too long ago.
The patch is just a simple 1-liner at the moment. I likely do need to
adjust what I'm passing as the blockSize to GenerationContextCreate().
Maybe a better number would be something that's calculated from
work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64))
so that we just allocate at most a 16th of work_mem per chunk, but not
bigger than 8MB. I don't think changing this will affect the
performance of the above very much.I think it's bad that both genereration and slab don't have internal
handling of block sizes. Needing to err on the size of too big blocks to
handle large amounts of memory well, just so the contexts don't need to
deal with variably sized blocks isn't a sensible tradeoff.
Well, back then it seemed like a sensible trade off to me, but I agree
it may have negative consequences. I'm not opposed to revisiting this.
I don't think it's acceptable to use ALLOCSET_DEFAULT_MAXSIZE or
Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64) for
tuplesort.c. There's plenty cases where we'll just sort a handful of
tuples, and increasing the memory usage of those by a factor of 1024
isn't good. The Min() won't do any good if a larger work_mem is used.Nor will it be good to use thousands of small allocations for a large
in-memory tuplesort just because we're concerned about the initial
allocation size. Both because of the allocation overhead, but
importantly also because that will make context resets more expensive.
True.
To me this says that we should transplant aset.c's block size growing
into generation.c.
Yeah, maybe.
There is at least one path using tuplecontext that reaches code that
could end up freeing memory to a significant enough degree to care about
generation.c effectively not using that memory:
tuplesort_putdatum()->datumCopy()->EOH_flatten_into()
On a quick look I didn't find any expanded record user that frees
nontrivial amounts of memory, but I didn't look all that carefully.
Not sure, I'm not familiar with EOH_flatten_into or expanded records.
But I wonder if there's some sort of metric that we could track in
Generation and use it to identify "interesting" places.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
I spent a bit of time hacking on the Generation context, adding the two
improvements discussed in this thread:
1) internal handling of block sizes, similar to what AllocSet does (it
pretty much just copies parts of it)
2) keeper block (we keep one empry block instead of freeing it)
3) I've also added allocChunkLimit, which makes it look a bit more like
AllocSet (instead of using just blockSize/8, which does not work too
well with dynamic blockSize)
I haven't done any extensive tests on it, but it does pass check-world
with asserts etc. I haven't touched the comments, those need updating.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
0001-Generation-grow-blocks.patchtext/x-patch; charset=UTF-8; name=0001-Generation-grow-blocks.patchDownload+44-18
0002-Generation-keeper-block.patchtext/x-patch; charset=UTF-8; name=0002-Generation-keeper-block.patchDownload+54-8
0003-Generation-allocChunkLimit.patchtext/x-patch; charset=UTF-8; name=0003-Generation-allocChunkLimit.patchDownload+15-5
On Sat, 31 Jul 2021 at 14:34, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
I spent a bit of time hacking on the Generation context, adding the two
improvements discussed in this thread:1) internal handling of block sizes, similar to what AllocSet does (it
pretty much just copies parts of it)2) keeper block (we keep one empry block instead of freeing it)
3) I've also added allocChunkLimit, which makes it look a bit more like
AllocSet (instead of using just blockSize/8, which does not work too
well with dynamic blockSize)I haven't done any extensive tests on it, but it does pass check-world
with asserts etc. I haven't touched the comments, those need updating.
regards
Thanks for starting work on that. I've only had a quick look, but I
can have a more detailed look once you've got it more complete.
For now it does not really look like the keeper block stuff is wired
up the same way as in aset.c. I'd expect you to be allocating that in
the same malloc as you're using to allocate the context struct itself
in GenerationContextCreate().
Also, likely as a result of the above, minContextSize does not seem to
be wired up to anything apart from an Assert().
David
On Sat, 31 Jul 2021 at 08:38, Andres Freund <andres@anarazel.de> wrote:
There is at least one path using tuplecontext that reaches code that
could end up freeing memory to a significant enough degree to care about
generation.c effectively not using that memory:
tuplesort_putdatum()->datumCopy()->EOH_flatten_into()
On a quick look I didn't find any expanded record user that frees
nontrivial amounts of memory, but I didn't look all that carefully.
I guess we could just use a normal context for datum sorts if we
thought that might be a problem.
I'm not too familiar with the expanded object code, but I'm struggling
to imagine why anything would need to do a pfree in there. We just do
EOH_get_flat_size() to determine how big to make the allocation then
allocate some memory for EOH_flatten_into() to use to expand the
object into.
David
On 8/2/21 1:17 PM, David Rowley wrote:
On Sat, 31 Jul 2021 at 14:34, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:I spent a bit of time hacking on the Generation context, adding the two
improvements discussed in this thread:1) internal handling of block sizes, similar to what AllocSet does (it
pretty much just copies parts of it)2) keeper block (we keep one empry block instead of freeing it)
3) I've also added allocChunkLimit, which makes it look a bit more like
AllocSet (instead of using just blockSize/8, which does not work too
well with dynamic blockSize)I haven't done any extensive tests on it, but it does pass check-world
with asserts etc. I haven't touched the comments, those need updating.
regardsThanks for starting work on that. I've only had a quick look, but I
can have a more detailed look once you've got it more complete.
A review would be nice, although it can wait - It'd be interesting to
know if those patches help with the workload(s) you've been looking at.
For now it does not really look like the keeper block stuff is wired
up the same way as in aset.c. I'd expect you to be allocating that in
the same malloc as you're using to allocate the context struct itself
in GenerationContextCreate().
Yes, that difference is natural. The AllocSet works a bit differently,
as it does not release the blocks (except during reset), while the
Generation context frees the blocks. So it seems pointless to use the
same "keeper" block as AllocSet - instead my intention was to keep one
"allocated" block as a cache, which should help with tight pfree/palloc
cycles. Maybe we should not call that "keeper" block?
Also, likely as a result of the above, minContextSize does not seem to
be wired up to anything apart from an Assert().
Hmm, yeah. This is probably due to copying some of the block-growth and
keeper block code from AllocSet. There should be just init/max block
size, I think.
I did run the same set of benchmarks as for Slab, measuring some usual
allocation patterns. The results for i5-2500k machine are attached (for
the xeon it's almost exactly the same behavior). While running those
tests I realized the last patch is wrong and sets allocChunkLimit=1,
which is bogus and causes significant regression. So here's an updated
version of the patch series too.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
0001-generation-bench-v2.patchtext/x-patch; charset=UTF-8; name=0001-generation-bench-v2.patchDownload+523-1
0002-Generation-grow-blocks-v2.patchtext/x-patch; charset=UTF-8; name=0002-Generation-grow-blocks-v2.patchDownload+47-21
0003-Generation-keeper-block-v2.patchtext/x-patch; charset=UTF-8; name=0003-Generation-keeper-block-v2.patchDownload+54-8
0004-Generation-allocChunkLimit-v2.patchtext/x-patch; charset=UTF-8; name=0004-Generation-allocChunkLimit-v2.patchDownload+24-5
generation i5.odsapplication/vnd.oasis.opendocument.spreadsheet; name="generation i5.ods"Download+10-6
Hi,
On 2021-08-03 10:59:25 +1200, David Rowley wrote:
On Sat, 31 Jul 2021 at 08:38, Andres Freund <andres@anarazel.de> wrote:
There is at least one path using tuplecontext that reaches code that
could end up freeing memory to a significant enough degree to care about
generation.c effectively not using that memory:
tuplesort_putdatum()->datumCopy()->EOH_flatten_into()
On a quick look I didn't find any expanded record user that frees
nontrivial amounts of memory, but I didn't look all that carefully.I guess we could just use a normal context for datum sorts if we
thought that might be a problem.
I think that's probably a cure worse than the disease. I suspect datum sorts
can benefit from the higher density quite a bit...
I'm not too familiar with the expanded object code, but I'm struggling
to imagine why anything would need to do a pfree in there. We just do
EOH_get_flat_size() to determine how big to make the allocation then
allocate some memory for EOH_flatten_into() to use to expand the
object into.
I can see some scenarios with a bit more creative uses of expanded
objects. We've e.g. been talking about using EA to avoid repeated and partial
detoasting overhead and you might need to do some more toast fetches when
flattening. Toast fetches always allocate, and if the fetch were only for
later parts of the tuple, the fetched data would need to be freed.
It's probably fine to deal with this at later time, and just leave a comment
somewhere.
It could be addressed by having a bump style allocator combined with having
freelists. It's not like the tuplesort.c case is actually interested in the
generational behaviour of generation.c (which makes freelists uninteresting),
it's just that generation.c is the densest allocator that we have right now...
Greetings,
Andres Freund
On Wed, 4 Aug 2021 at 02:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
A review would be nice, although it can wait - It'd be interesting to
know if those patches help with the workload(s) you've been looking at.
I tried out the v2 set of patches using the attached scripts. The
attached spreadsheet includes the original tests and compares master
with the patch which uses the generation context vs that patch plus
your v2 patch.
I've also included 4 additional tests, each of which starts with a 1
column table and then adds another 32 columns testing the performance
after adding each additional column. I did this because I wanted to
see if the performance was more similar to master when the allocations
had less power of 2 wastage from allocset. If, for example, you look
at row 123 of the spreadsheet you can see both patched and unpatched
the allocations were 272 bytes each yet there was still a 50%
performance improvement with just the generation context patch when
compared to master.
Looking at the spreadsheet, you'll also notice that in the 2 column
test of each of the 4 new tests the number of bytes used for each
allocation is larger with the generation context. 56 vs 48. This is
due to the GenerationChunk struct size being later than the Allocset's
version by 8 bytes. This is because it also holds the
GenerationBlock. So with the patch there are some cases where we'll
use slightly more memory.
Additional tests:
1. Sort 10000 tuples on a column with values 0-99 in memory.
2. As #1 but with 1 million tuples.
3 As #1 but with a large OFFSET to remove the overhead of sending to the client.
4. As #2 but with a large OFFSET.
Test #3 above is the most similar one to the original tests and shows
similar gains. When the sort becomes larger (1 million tuple test),
the gains reduce. This indicates the gains are coming from improved
CPU cache efficiency from the removal of the power of 2 wastage in
memory allocations.
All of the tests show that the patches to improve the allocation
efficiency of generation.c don't help to improve the results of the
test cases. I wondered if it's maybe worth trying to see what happens
if instead of doubling the allocations each time, quadruple them
instead. I didn't try this.
David
Attachments:
sortbench_1m.sh.txttext/plain; charset=US-ASCII; name=sortbench_1m.sh.txtDownload
generation context tuplesort.odsapplication/vnd.oasis.opendocument.spreadsheet; name="generation context tuplesort.ods"Download+1-0
sortbench_10k.sh.txttext/plain; charset=US-ASCII; name=sortbench_10k.sh.txtDownload
On 8/6/21 3:07 PM, David Rowley wrote:
On Wed, 4 Aug 2021 at 02:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
A review would be nice, although it can wait - It'd be interesting to
know if those patches help with the workload(s) you've been looking at.I tried out the v2 set of patches using the attached scripts. The
attached spreadsheet includes the original tests and compares master
with the patch which uses the generation context vs that patch plus
your v2 patch.I've also included 4 additional tests, each of which starts with a 1
column table and then adds another 32 columns testing the performance
after adding each additional column. I did this because I wanted to
see if the performance was more similar to master when the allocations
had less power of 2 wastage from allocset. If, for example, you look
at row 123 of the spreadsheet you can see both patched and unpatched
the allocations were 272 bytes each yet there was still a 50%
performance improvement with just the generation context patch when
compared to master.Looking at the spreadsheet, you'll also notice that in the 2 column
test of each of the 4 new tests the number of bytes used for each
allocation is larger with the generation context. 56 vs 48. This is
due to the GenerationChunk struct size being later than the Allocset's
version by 8 bytes. This is because it also holds the
GenerationBlock. So with the patch there are some cases where we'll
use slightly more memory.Additional tests:
1. Sort 10000 tuples on a column with values 0-99 in memory.
2. As #1 but with 1 million tuples.
3 As #1 but with a large OFFSET to remove the overhead of sending to the client.
4. As #2 but with a large OFFSET.Test #3 above is the most similar one to the original tests and shows
similar gains. When the sort becomes larger (1 million tuple test),
the gains reduce. This indicates the gains are coming from improved
CPU cache efficiency from the removal of the power of 2 wastage in
memory allocations.All of the tests show that the patches to improve the allocation
efficiency of generation.c don't help to improve the results of the
test cases. I wondered if it's maybe worth trying to see what happens
if instead of doubling the allocations each time, quadruple them
instead. I didn't try this.
Thanks for the scripts and the spreadsheet with results.
I doubt quadrupling the allocations won't help very much, but I suspect
the problem might be in the 0004 patch - at least that's what shows
regression in my results. Could you try with just 0001-0003 applied?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sat, 7 Aug 2021 at 12:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 8/6/21 3:07 PM, David Rowley wrote:
All of the tests show that the patches to improve the allocation
efficiency of generation.c don't help to improve the results of the
test cases. I wondered if it's maybe worth trying to see what happens
if instead of doubling the allocations each time, quadruple them
instead. I didn't try this.I doubt quadrupling the allocations won't help very much, but I suspect
the problem might be in the 0004 patch - at least that's what shows
regression in my results. Could you try with just 0001-0003 applied?
But 0004 only changes the logic which controls the threshold of when
we allocate an oversized chunk. It looks like the threshold is 512KB
with the 0004 patch. My test is only doing a maximum allocation of
296 bytes so will never allocate an oversized chunk.
Can you explain why you think 0004 would cause performance regressions?
David
On Sat, 7 Aug 2021 at 12:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
All of the tests show that the patches to improve the allocation
efficiency of generation.c don't help to improve the results of the
test cases. I wondered if it's maybe worth trying to see what happens
if instead of doubling the allocations each time, quadruple them
instead. I didn't try this.I doubt quadrupling the allocations won't help very much, but I suspect
the problem might be in the 0004 patch - at least that's what shows
regression in my results. Could you try with just 0001-0003 applied?
I tried the quadrupling of the buffer instead of doubling it each time
and got the attached. Column E, or green in the graphs show the
results of that. It's now much closer to the original patch which just
made the block size 8MB.
David
Attachments:
generation context tuplesort_v2.odsapplication/vnd.oasis.opendocument.spreadsheet; name="generation context tuplesort_v2.ods"Download+3-0
On 8/8/21 9:02 AM, David Rowley wrote:
On Sat, 7 Aug 2021 at 12:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 8/6/21 3:07 PM, David Rowley wrote:
All of the tests show that the patches to improve the allocation
efficiency of generation.c don't help to improve the results of the
test cases. I wondered if it's maybe worth trying to see what happens
if instead of doubling the allocations each time, quadruple them
instead. I didn't try this.I doubt quadrupling the allocations won't help very much, but I suspect
the problem might be in the 0004 patch - at least that's what shows
regression in my results. Could you try with just 0001-0003 applied?But 0004 only changes the logic which controls the threshold of when
we allocate an oversized chunk. It looks like the threshold is 512KB
with the 0004 patch. My test is only doing a maximum allocation of
296 bytes so will never allocate an oversized chunk.Can you explain why you think 0004 would cause performance regressions?
It's based solely on results of my benchmarks, where this patch seems to
cause performance regression. I agree it's a bit bizzare, considering
what the patch does.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 8/8/21 12:11 PM, David Rowley wrote:
On Sat, 7 Aug 2021 at 12:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
All of the tests show that the patches to improve the allocation
efficiency of generation.c don't help to improve the results of the
test cases. I wondered if it's maybe worth trying to see what happens
if instead of doubling the allocations each time, quadruple them
instead. I didn't try this.I doubt quadrupling the allocations won't help very much, but I suspect
the problem might be in the 0004 patch - at least that's what shows
regression in my results. Could you try with just 0001-0003 applied?I tried the quadrupling of the buffer instead of doubling it each time
and got the attached. Column E, or green in the graphs show the
results of that. It's now much closer to the original patch which just
made the block size 8MB.
Interesting, I wouldn't have expected that to make such difference.
I'm not sure quadrupling the size is a good idea, though, because it
increases the amount of memory we might be wasting. With the doubling,
the amount of wasted /unused memory is limited to ~50%, because the next
block is (roughly) equal to sum of already allocated blocks, so
allocating just 1B on it leaves us with 50%. But quadrupling the size
means we'll end up with ~75% free space. Of course, this is capped by
the maximum block size etc. but still ...
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, 4 Aug 2021 at 02:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
I did run the same set of benchmarks as for Slab, measuring some usual
allocation patterns. The results for i5-2500k machine are attached (for
the xeon it's almost exactly the same behavior). While running those
tests I realized the last patch is wrong and sets allocChunkLimit=1,
which is bogus and causes significant regression. So here's an updated
version of the patch series too.
I know you're not done with these yet, but FWIW, I was getting an
Assert failure with these patches on:
Assert(total_allocated == context->mem_allocated);
It seems to be because you've forgotten to ignore keeper blocks when
adjusting context->mem_allocated in GenerationReset()
David
On Mon, 9 Aug 2021 at 00:38, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
I'm not sure quadrupling the size is a good idea, though, because it
increases the amount of memory we might be wasting. With the doubling,
the amount of wasted /unused memory is limited to ~50%, because the next
block is (roughly) equal to sum of already allocated blocks, so
allocating just 1B on it leaves us with 50%. But quadrupling the size
means we'll end up with ~75% free space. Of course, this is capped by
the maximum block size etc. but still ...
Yeah, not sure what is best. It does however seem likely that the
majority of the performance improvement that I saw is due to either
malloc()/free() calls or just having fewer blocks in the context.
Maybe it's worth getting the planner on board with deciding how to do
the allocations. It feels a bit overcautious to go allocating blocks
in each power of two starting at 8192 bytes when doing a 1GB sort.
Maybe we should be looking towards doing something more like making
the init allocation size more like pg_prevpower2_64(Min(work_mem *
1024L, sort_tuples * tuple_width)), or maybe half or quarter that.
It would certainly not be the only executor node to allocate memory
based on what the planner thought. Just look at ExecHashTableCreate().
David
Hi,
I've spent quite a bit of time investigating the regressions clearly
visible in the benchmark results for some allocation/free patterns.
Attached is a subset of results from a recent run, but the old results
show mostly the same thing.
Some allocation patterns (e.g. fifo-cycle of lifo-cycle) are perfectly
fine - the patches are a clear improvement. For for example results for
fifo-decrease pattern look like this (showing just timings of palloc):
block chunk 0001 0002 0003 0004
----------------------------------------------------
32768 512 38497 37234 39817 226451
0001 just adds an extension used for the benchmarking, so it's pretty
much "master" without any behavior changes. 0002 and 0003 make some
changes that seem to have no impact on this workload, but 0004 makes it
about 5x slower. Which is bizarre, because 0004 just tweaks how we
calculate the threshold for oversized chunks.
But that shouldn't have any impact, because with the benchmark
parameters the threshold should be 64kB, way more than the chunk size
(which is what we allocate).
Similarly, it should not matter whether we double or quadruple the block
size, because we reach the maximum block size (1MB) very fast in both
cases, and then just just 1MB blocks for 99% of the benchmark.
I was thinking that maybe the extra allocChunkLimit increases the size
of the GenerationContext struct, requiring another cacheline. But no,
it's 3 cachelines in both cases.
But while trying to investigate / profile this, I noticed a rather
bizarre behavior, which I can't explain but I suspect it may be related
to the slowdown.
The benchmark queries are executing generation_bench_fifo() for a range
of block size / chunk_size combinations, and look about like this:
select block_size, chunk_size, x.*
from generate_series(32,512,32) a(chunk_size),
(values (1024), (2048), (4096), (8192), (16384), (32768))
AS b(block_size),
lateral generation_bench_fifo(1000000, block_size, chunk_size,
2*chunk_size, 100, 10000, 5000) x;
Which takes a while to run, and then produces something like this:
block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
1024 | 32 | 81528832 | 62016 | 24601
1024 | 64 | 137449472 | 61951 | 36034
1024 | 96 | 208408464 | 85437 | 57984
...
32768 | 448 | 707756032 | 36605 | 67648
32768 | 480 | 757104640 | 36838 | 71967
32768 | 512 | 806256640 | 37073 | 76489
(96 rows)
Now, I can run benchmark for a single case (e.g. the last one in the
results above), say like this:
select block_size, chunk_size, x.* from
(values (512)) AS a(chunk_size),
(values (32768)) AS b(block_size),
lateral generation_bench_fifo(1000000, block_size, chunk_size,
2*chunk_size, 100, 10000, 5000) x;
And now comes the funny part - if I run it in the same backend as the
"full" benchmark, I get roughly the same results:
block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 806256640 | 37159 | 76669
but if I reconnect and run it in the new backend, I get this:
block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 806158336 | 233909 | 100785
(1 row)
It does not matter if I wait a bit before running the query, if I run it
repeatedly, etc. The machine is not doing anything else, the CPU is set
to use "performance" governor, etc.
This is really strange, and the really suspicious thing is that this
slower timing is very close to the 0004 timing (which is 226451). So
perhaps the reason for these slowdowns is the same, it's just that with
0004 it happens all the time.
According to perf profiling, it seems there's one pretty significant
change - 0004 spends quite a bit of time in asm_exc_page_fault
41.17% 5.26% postgres postgres [.] GenerationAlloc
|
--36.06%--GenerationAlloc
|
|--28.95%--asm_exc_page_fault
| |
| --23.17%--exc_page_fault
| |
while on 0003 there's not a single frame with asm_exc_page_fault. I have
no idea why, though.
I wonder if this might be some unexpected / unfortunate interaction with
kernel memory management, or something like that. Any ideas what might
be the root cause, what to look for, etc.?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
generation.odsapplication/vnd.oasis.opendocument.spreadsheet; name=generation.odsDownload+3-3
0001-generation-bench.patchtext/x-patch; charset=UTF-8; name=0001-generation-bench.patchDownload+582-1
0002-Generation-grow-blocks.patchtext/x-patch; charset=UTF-8; name=0002-Generation-grow-blocks.patchDownload+47-21
0003-Generation-keeper-block.patchtext/x-patch; charset=UTF-8; name=0003-Generation-keeper-block.patchDownload+54-8
0004-Generation-allocChunkLimit.patchtext/x-patch; charset=UTF-8; name=0004-Generation-allocChunkLimit.patchDownload+24-5
Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit :
And now comes the funny part - if I run it in the same backend as the
"full" benchmark, I get roughly the same results:block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 806256640 | 37159 | 76669but if I reconnect and run it in the new backend, I get this:
block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 806158336 | 233909 | 100785
(1 row)It does not matter if I wait a bit before running the query, if I run it
repeatedly, etc. The machine is not doing anything else, the CPU is set
to use "performance" governor, etc.
I've reproduced the behaviour you mention.
I also noticed asm_exc_page_fault showing up in the perf report in that case.
Running an strace on it shows that in one case, we have a lot of brk calls,
while when we run in the same process as the previous tests, we don't.
My suspicion is that the previous workload makes glibc malloc change it's
trim_threshold and possibly other dynamic options, which leads to constantly
moving the brk pointer in one case and not the other.
Running your fifo test with absurd malloc options shows that indeed that might
be the case (I needed to change several, because changing one disable the
dynamic adjustment for every single one of them, and malloc would fall back to
using mmap and freeing it on each iteration):
mallopt(M_TOP_PAD, 1024 * 1024 * 1024);
mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024);
mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long));
I get the following results for your self contained test. I ran the query
twice, in each case, seeing the importance of the first allocation and the
subsequent ones:
With default malloc options:
block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 795836416 | 300156 | 207557
block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 795836416 | 211942 | 77207
With the oversized values above:
block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 795836416 | 219000 | 36223
block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 795836416 | 75761 | 78082
(1 row)
I can't tell how representative your benchmark extension would be of real life
allocation / free patterns, but there is probably something we can improve
here.
I'll try to see if I can understand more precisely what is happening.
--
Ronan Dunklau
On 12/8/21 16:51, Ronan Dunklau wrote:
Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit :
And now comes the funny part - if I run it in the same backend as the
"full" benchmark, I get roughly the same results:block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 806256640 | 37159 | 76669but if I reconnect and run it in the new backend, I get this:
block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 806158336 | 233909 | 100785
(1 row)It does not matter if I wait a bit before running the query, if I run it
repeatedly, etc. The machine is not doing anything else, the CPU is set
to use "performance" governor, etc.I've reproduced the behaviour you mention.
I also noticed asm_exc_page_fault showing up in the perf report in that case.Running an strace on it shows that in one case, we have a lot of brk calls,
while when we run in the same process as the previous tests, we don't.My suspicion is that the previous workload makes glibc malloc change it's
trim_threshold and possibly other dynamic options, which leads to constantly
moving the brk pointer in one case and not the other.Running your fifo test with absurd malloc options shows that indeed that might
be the case (I needed to change several, because changing one disable the
dynamic adjustment for every single one of them, and malloc would fall back to
using mmap and freeing it on each iteration):mallopt(M_TOP_PAD, 1024 * 1024 * 1024);
mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024);
mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long));I get the following results for your self contained test. I ran the query
twice, in each case, seeing the importance of the first allocation and the
subsequent ones:With default malloc options:
block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 795836416 | 300156 | 207557block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 795836416 | 211942 | 77207With the oversized values above:
block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 795836416 | 219000 | 36223block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 795836416 | 75761 | 78082
(1 row)I can't tell how representative your benchmark extension would be of real life
allocation / free patterns, but there is probably something we can improve
here.
Thanks for looking at this. I think those allocation / free patterns are
fairly extreme, and there probably are no workloads doing exactly this.
The idea is the actual workloads are likely some combination of these
extreme cases.
I'll try to see if I can understand more precisely what is happening.
Thanks, that'd be helpful. Maybe we can learn something about tuning
malloc parameters to get significantly better performance.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company