slab allocator performance issues
Hi,
I just tried to use the slab allocator for a case where aset.c was
bloating memory usage substantially. First: It worked wonders for memory
usage, nearly eliminating overhead.
But it turned out to cause a *substantial* slowdown. With aset the
allocator is barely in the profile. With slab the profile is dominated
by allocator performance.
slab:
NOTICE: 00000: 100000000 ordered insertions in 5.216287 seconds, 19170724/sec
LOCATION: bfm_test_insert_bulk, radix.c:2880
Overhead Command Shared Object Symbol
+ 28.27% postgres postgres [.] SlabAlloc
+ 9.64% postgres bdbench.so [.] bfm_delete
+ 9.03% postgres bdbench.so [.] bfm_set
+ 8.39% postgres bdbench.so [.] bfm_lookup
+ 8.36% postgres bdbench.so [.] bfm_set_leaf.constprop.0
+ 8.16% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms
+ 6.88% postgres bdbench.so [.] bfm_delete_leaf
+ 3.24% postgres libc-2.31.so [.] _int_malloc
+ 2.58% postgres bdbench.so [.] bfm_tests
+ 2.33% postgres postgres [.] SlabFree
+ 1.29% postgres libc-2.31.so [.] _int_free
+ 1.09% postgres libc-2.31.so [.] unlink_chunk.constprop.0
aset:
NOTICE: 00000: 100000000 ordered insertions in 2.082602 seconds, 48016848/sec
LOCATION: bfm_test_insert_bulk, radix.c:2880
+ 16.43% postgres bdbench.so [.] bfm_lookup
+ 15.38% postgres bdbench.so [.] bfm_delete
+ 12.82% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms
+ 12.65% postgres bdbench.so [.] bfm_set
+ 12.15% postgres bdbench.so [.] bfm_set_leaf.constprop.0
+ 10.57% postgres bdbench.so [.] bfm_delete_leaf
+ 4.05% postgres bdbench.so [.] bfm_tests
+ 2.93% postgres [kernel.vmlinux] [k] clear_page_erms
+ 1.59% postgres postgres [.] AllocSetAlloc
+ 1.15% postgres bdbench.so [.] memmove@plt
+ 1.06% postgres bdbench.so [.] bfm_grow_leaf_16
OS:
NOTICE: 00000: 100000000 ordered insertions in 2.089790 seconds, 47851690/sec
LOCATION: bfm_test_insert_bulk, radix.c:2880
That is somewhat surprising - part of the promise of a slab allocator is
that it's fast...
This is caused by multiple issues, I think. Some of which seems fairly easy to
fix.
1) If allocations are short-lived slab.c, can end up constantly
freeing/initializing blocks. Which requires fairly expensively iterating over
all potential chunks in the block and initializing it. Just to then free that
memory again after a small number of allocations. The extreme case of this is
when there are phases of alloc/free of a single allocation.
I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that
only works if the problem is with the only, so it's not a great
approach. Perhaps just keeping the last allocated block around would work?
2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
up ~50% of the cycles in SlabFree(). A 64bit div, according to [1]https://www.agner.org/optimize/instruction_tables.pdf , has a
latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
still slow enough there to be very worrisome.
I don't see a way to get around the division while keeping the freelist
structure as is. But:
ISTM that we only need the index because of the free-chunk list, right? Why
don't we make the chunk list use actual pointers? Is it concern that that'd
increase the minimum allocation size? If so, I see two ways around that:
First, we could make the index just the offset from the start of the block,
that's much cheaper to calculate. Second, we could store the next pointer in
SlabChunk->slab/block instead (making it a union) - while on the freelist we
don't need to dereference those, right?
I suspect both would also make the block initialization a bit cheaper.
That should also accelerate SlabBlockGetChunk(), which currently shows up as
an imul, which isn't exactly fast either (and uses a lot of execution ports).
3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
shows up prominently in profiles. That's ~tripling the number of cachelines
touched in the happy path, with unpredictable accesses to boot.
Perhaps we could reduce the precision of slab->freelist indexing to amortize
that cost? I.e. not have one slab->freelist entry for each nfree, but instead
have an upper limit on the number of freelists?
4) Less of a performance, and more of a usability issue: The constant
block size strikes me as problematic. Most users of an allocator can
sometimes be used with a small amount of data, and sometimes with a
large amount.
Greetings,
Andres Freund
Hi,
On 2021-07-17 12:43:33 -0700, Andres Freund wrote:
2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
still slow enough there to be very worrisome.I don't see a way to get around the division while keeping the freelist
structure as is. But:ISTM that we only need the index because of the free-chunk list, right? Why
don't we make the chunk list use actual pointers? Is it concern that that'd
increase the minimum allocation size? If so, I see two ways around that:
First, we could make the index just the offset from the start of the block,
that's much cheaper to calculate. Second, we could store the next pointer in
SlabChunk->slab/block instead (making it a union) - while on the freelist we
don't need to dereference those, right?I suspect both would also make the block initialization a bit cheaper.
That should also accelerate SlabBlockGetChunk(), which currently shows up as
an imul, which isn't exactly fast either (and uses a lot of execution ports).
Oh - I just saw that effectively the allocation size already is a
uintptr_t at minimum. I had only seen
/* Make sure the linked list node fits inside a freed chunk */
if (chunkSize < sizeof(int))
chunkSize = sizeof(int);
but it's followed by
/* chunk, including SLAB header (both addresses nicely aligned) */
fullChunkSize = sizeof(SlabChunk) + MAXALIGN(chunkSize);
which means we are reserving enough space for a pointer on just about
any platform already? Seems we can just make that official and reserve
space for a pointer as part of the chunk size rounding up, instead of
fullChunkSize?
Greetings,
Andres Freund
Hi,
On 7/17/21 9:43 PM, Andres Freund wrote:
Hi,
I just tried to use the slab allocator for a case where aset.c was
bloating memory usage substantially. First: It worked wonders for memory
usage, nearly eliminating overhead.But it turned out to cause a *substantial* slowdown. With aset the
allocator is barely in the profile. With slab the profile is dominated
by allocator performance.slab:
NOTICE: 00000: 100000000 ordered insertions in 5.216287 seconds, 19170724/sec
LOCATION: bfm_test_insert_bulk, radix.c:2880
Overhead Command Shared Object Symbol+ 28.27% postgres postgres [.] SlabAlloc + 9.64% postgres bdbench.so [.] bfm_delete + 9.03% postgres bdbench.so [.] bfm_set + 8.39% postgres bdbench.so [.] bfm_lookup + 8.36% postgres bdbench.so [.] bfm_set_leaf.constprop.0 + 8.16% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms + 6.88% postgres bdbench.so [.] bfm_delete_leaf + 3.24% postgres libc-2.31.so [.] _int_malloc + 2.58% postgres bdbench.so [.] bfm_tests + 2.33% postgres postgres [.] SlabFree + 1.29% postgres libc-2.31.so [.] _int_free + 1.09% postgres libc-2.31.so [.] unlink_chunk.constprop.0aset:
NOTICE: 00000: 100000000 ordered insertions in 2.082602 seconds, 48016848/sec
LOCATION: bfm_test_insert_bulk, radix.c:2880+ 16.43% postgres bdbench.so [.] bfm_lookup + 15.38% postgres bdbench.so [.] bfm_delete + 12.82% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms + 12.65% postgres bdbench.so [.] bfm_set + 12.15% postgres bdbench.so [.] bfm_set_leaf.constprop.0 + 10.57% postgres bdbench.so [.] bfm_delete_leaf + 4.05% postgres bdbench.so [.] bfm_tests + 2.93% postgres [kernel.vmlinux] [k] clear_page_erms + 1.59% postgres postgres [.] AllocSetAlloc + 1.15% postgres bdbench.so [.] memmove@plt + 1.06% postgres bdbench.so [.] bfm_grow_leaf_16OS:
NOTICE: 00000: 100000000 ordered insertions in 2.089790 seconds, 47851690/sec
LOCATION: bfm_test_insert_bulk, radix.c:2880That is somewhat surprising - part of the promise of a slab allocator is
that it's fast...This is caused by multiple issues, I think. Some of which seems fairly easy to
fix.1) If allocations are short-lived slab.c, can end up constantly
freeing/initializing blocks. Which requires fairly expensively iterating over
all potential chunks in the block and initializing it. Just to then free that
memory again after a small number of allocations. The extreme case of this is
when there are phases of alloc/free of a single allocation.I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that
only works if the problem is with the only, so it's not a great
approach. Perhaps just keeping the last allocated block around would work?
+1
I think it makes perfect sense to not free the blocks immediately, and
keep one (or a small number) as a cache. I'm not sure why we decided not
to have a "keeper" block, but I suspect memory consumption was my main
concern at that point. But I never expected the cost to be this high.
2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
still slow enough there to be very worrisome.I don't see a way to get around the division while keeping the freelist
structure as is. But:ISTM that we only need the index because of the free-chunk list, right? Why
don't we make the chunk list use actual pointers? Is it concern that that'd
increase the minimum allocation size? If so, I see two ways around that:
First, we could make the index just the offset from the start of the block,
that's much cheaper to calculate. Second, we could store the next pointer in
SlabChunk->slab/block instead (making it a union) - while on the freelist we
don't need to dereference those, right?I suspect both would also make the block initialization a bit cheaper.
That should also accelerate SlabBlockGetChunk(), which currently shows up as
an imul, which isn't exactly fast either (and uses a lot of execution ports).
Hmm, I think you're right we could simply use the pointers, but I have
not tried that.
3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
shows up prominently in profiles. That's ~tripling the number of cachelines
touched in the happy path, with unpredictable accesses to boot.Perhaps we could reduce the precision of slab->freelist indexing to amortize
that cost? I.e. not have one slab->freelist entry for each nfree, but instead
have an upper limit on the number of freelists?
Yeah. The purpose of organizing the freelists like this is to prioritize
the "more full" blocks" when allocating new chunks, in the hope that the
"less full" blocks will end up empty and freed faster.
But this is naturally imprecise, and strongly depends on the workload,
of course, and I bet for most cases a less precise approach would work
just as fine.
I'm not sure how exactly would the upper limit you propose work, but
perhaps we could group the blocks for nfree ranges, say [0-15], [16-31]
and so on. So after the alloc/free we'd calculate the new freelist index
as (nfree/16) and only moved the block if the index changed. This would
reduce the overhead to 1/16 and probably even more in practice.
Of course, we could also say we have e.g. 8 freelists and work the
ranges backwards from that, I guess that's what you propose.
4) Less of a performance, and more of a usability issue: The constant
block size strikes me as problematic. Most users of an allocator can
sometimes be used with a small amount of data, and sometimes with a
large amount.
I doubt this is worth the effort, really. The constant block size makes
various places much simpler (both to code and reason about), so this
should not make a huge difference in performance. And IMHO the block
size is mostly an implementation detail, so I don't see that as a
usability issue.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2021-07-17 22:35:07 +0200, Tomas Vondra wrote:
On 7/17/21 9:43 PM, Andres Freund wrote:
1) If allocations are short-lived slab.c, can end up constantly
freeing/initializing blocks. Which requires fairly expensively iterating over
all potential chunks in the block and initializing it. Just to then free that
memory again after a small number of allocations. The extreme case of this is
when there are phases of alloc/free of a single allocation.I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that
only works if the problem is with the only, so it's not a great
approach. Perhaps just keeping the last allocated block around would work?+1
I think it makes perfect sense to not free the blocks immediately, and keep
one (or a small number) as a cache. I'm not sure why we decided not to have
a "keeper" block, but I suspect memory consumption was my main concern at
that point. But I never expected the cost to be this high.
I think one free block might be too low in some cases. It's pretty
common to have workloads where the number of allocations is "bursty",
and it's imo one case where one might justifiably want to use a slab
allocator... Perhaps a portion of a high watermark? Or a portion of the
in use blocks?
Hm. I wonder if we should just not populate the freelist eagerly, to
drive down the initialization cost. I.e. have a separate allocation path
for chunks that have never been allocated, by having a
SlabBlock->free_offset or such.
Sure, it adds a branch to the allocation happy path, but it also makes the
first allocation for a chunk cheaper, because there's no need to get the next
element from the freelist (adding a likely cache miss). And it should make the
allocation of a new block faster by a lot.
2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
still slow enough there to be very worrisome.I don't see a way to get around the division while keeping the freelist
structure as is. But:ISTM that we only need the index because of the free-chunk list, right? Why
don't we make the chunk list use actual pointers? Is it concern that that'd
increase the minimum allocation size? If so, I see two ways around that:
First, we could make the index just the offset from the start of the block,
that's much cheaper to calculate. Second, we could store the next pointer in
SlabChunk->slab/block instead (making it a union) - while on the freelist we
don't need to dereference those, right?I suspect both would also make the block initialization a bit cheaper.
That should also accelerate SlabBlockGetChunk(), which currently shows up as
an imul, which isn't exactly fast either (and uses a lot of execution ports).Hmm, I think you're right we could simply use the pointers, but I have not
tried that.
I quickly tried that, and it does seem to improve matters considerably. The
block initialization still shows up as expensive, but not as bad. The div and
imul are gone (exept in an assertion build right now). The list manipulation
still is visible.
3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
shows up prominently in profiles. That's ~tripling the number of cachelines
touched in the happy path, with unpredictable accesses to boot.Perhaps we could reduce the precision of slab->freelist indexing to amortize
that cost? I.e. not have one slab->freelist entry for each nfree, but instead
have an upper limit on the number of freelists?Yeah. The purpose of organizing the freelists like this is to prioritize the
"more full" blocks" when allocating new chunks, in the hope that the "less
full" blocks will end up empty and freed faster.But this is naturally imprecise, and strongly depends on the workload, of
course, and I bet for most cases a less precise approach would work just as
fine.I'm not sure how exactly would the upper limit you propose work, but perhaps
we could group the blocks for nfree ranges, say [0-15], [16-31] and so on.
So after the alloc/free we'd calculate the new freelist index as (nfree/16)
and only moved the block if the index changed. This would reduce the
overhead to 1/16 and probably even more in practice.
Of course, we could also say we have e.g. 8 freelists and work the ranges
backwards from that, I guess that's what you propose.
Yea, I was thinking something along those lines. As you say, ow about there's
always at most 8 freelists or such. During initialization we compute a shift
that distributes chunksPerBlock from 0 to 8. Then we only need to perform list
manipulation if block->nfree >> slab->freelist_shift != --block->nfree >> slab->freelist_shift.
That seems nice from a memory usage POV as well - for small and frequent
allocations needing chunksPerBlock freelists isn't great. The freelists are on
a fair number of cachelines right now.
4) Less of a performance, and more of a usability issue: The constant
block size strikes me as problematic. Most users of an allocator can
sometimes be used with a small amount of data, and sometimes with a
large amount.I doubt this is worth the effort, really. The constant block size makes
various places much simpler (both to code and reason about), so this should
not make a huge difference in performance. And IMHO the block size is mostly
an implementation detail, so I don't see that as a usability issue.
Hm? It's something the user has to specify, so I it's not really an
implementation detail. It needs to be specified without sufficient
information, as well, since externally one doesn't know how much memory the
block header and chunk headers + rounding up will use, so computing a good
block size isn't easy. I've wondered whether it should just be a count...
Why do you not think it's relevant for performance? Either one causes too much
memory usage by using a too large block size, wasting memory, or one ends up
loosing perf through frequent allocations?
Greetings,
Andres Freund
On 7/17/21 11:14 PM, Andres Freund wrote:
Hi,
On 2021-07-17 22:35:07 +0200, Tomas Vondra wrote:
On 7/17/21 9:43 PM, Andres Freund wrote:
1) If allocations are short-lived slab.c, can end up constantly
freeing/initializing blocks. Which requires fairly expensively iterating over
all potential chunks in the block and initializing it. Just to then free that
memory again after a small number of allocations. The extreme case of this is
when there are phases of alloc/free of a single allocation.I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that
only works if the problem is with the only, so it's not a great
approach. Perhaps just keeping the last allocated block around would work?+1
I think it makes perfect sense to not free the blocks immediately, and keep
one (or a small number) as a cache. I'm not sure why we decided not to have
a "keeper" block, but I suspect memory consumption was my main concern at
that point. But I never expected the cost to be this high.I think one free block might be too low in some cases. It's pretty
common to have workloads where the number of allocations is "bursty",
and it's imo one case where one might justifiably want to use a slab
allocator... Perhaps a portion of a high watermark? Or a portion of the
in use blocks?
I think the portion of watermark would be problematic for cases with one
huge transaction - that'll set a high watermark, and we'll keep way too
many free blocks. But the portion of in use blocks might work, I think.
Hm. I wonder if we should just not populate the freelist eagerly, to
drive down the initialization cost. I.e. have a separate allocation path
for chunks that have never been allocated, by having a
SlabBlock->free_offset or such.Sure, it adds a branch to the allocation happy path, but it also makes the
first allocation for a chunk cheaper, because there's no need to get the next
element from the freelist (adding a likely cache miss). And it should make the
allocation of a new block faster by a lot.
Not sure what you mean by 'not populate eagerly' so can't comment :-(
2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
still slow enough there to be very worrisome.I don't see a way to get around the division while keeping the freelist
structure as is. But:ISTM that we only need the index because of the free-chunk list, right? Why
don't we make the chunk list use actual pointers? Is it concern that that'd
increase the minimum allocation size? If so, I see two ways around that:
First, we could make the index just the offset from the start of the block,
that's much cheaper to calculate. Second, we could store the next pointer in
SlabChunk->slab/block instead (making it a union) - while on the freelist we
don't need to dereference those, right?I suspect both would also make the block initialization a bit cheaper.
That should also accelerate SlabBlockGetChunk(), which currently shows up as
an imul, which isn't exactly fast either (and uses a lot of execution ports).Hmm, I think you're right we could simply use the pointers, but I have not
tried that.I quickly tried that, and it does seem to improve matters considerably. The
block initialization still shows up as expensive, but not as bad. The div and
imul are gone (exept in an assertion build right now). The list manipulation
still is visible.
Understood. I didn't expect this to be a full solution.
3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
shows up prominently in profiles. That's ~tripling the number of cachelines
touched in the happy path, with unpredictable accesses to boot.Perhaps we could reduce the precision of slab->freelist indexing to amortize
that cost? I.e. not have one slab->freelist entry for each nfree, but instead
have an upper limit on the number of freelists?Yeah. The purpose of organizing the freelists like this is to prioritize the
"more full" blocks" when allocating new chunks, in the hope that the "less
full" blocks will end up empty and freed faster.But this is naturally imprecise, and strongly depends on the workload, of
course, and I bet for most cases a less precise approach would work just as
fine.I'm not sure how exactly would the upper limit you propose work, but perhaps
we could group the blocks for nfree ranges, say [0-15], [16-31] and so on.
So after the alloc/free we'd calculate the new freelist index as (nfree/16)
and only moved the block if the index changed. This would reduce the
overhead to 1/16 and probably even more in practice.Of course, we could also say we have e.g. 8 freelists and work the ranges
backwards from that, I guess that's what you propose.Yea, I was thinking something along those lines. As you say, ow about there's
always at most 8 freelists or such. During initialization we compute a shift
that distributes chunksPerBlock from 0 to 8. Then we only need to perform list
manipulation if block->nfree >> slab->freelist_shift != --block->nfree >> slab->freelist_shift.That seems nice from a memory usage POV as well - for small and frequent
allocations needing chunksPerBlock freelists isn't great. The freelists are on
a fair number of cachelines right now.
Agreed.
4) Less of a performance, and more of a usability issue: The constant
block size strikes me as problematic. Most users of an allocator can
sometimes be used with a small amount of data, and sometimes with a
large amount.I doubt this is worth the effort, really. The constant block size makes
various places much simpler (both to code and reason about), so this should
not make a huge difference in performance. And IMHO the block size is mostly
an implementation detail, so I don't see that as a usability issue.Hm? It's something the user has to specify, so I it's not really an
implementation detail. It needs to be specified without sufficient
information, as well, since externally one doesn't know how much memory the
block header and chunk headers + rounding up will use, so computing a good
block size isn't easy. I've wondered whether it should just be a count...
I think this is mixing two problems - how to specify the block size, and
whether the block size is constant (as in slab) or grows over time (as
in allocset).
As for specifying the block size, I agree maybe setting chunk count and
deriving the bytes from that might be easier to use, for the reasons you
mentioned.
But growing the block size seems problematic for long-lived contexts
with workloads that change a lot over time - imagine e.g. decoding many
small transactions, with one huge transaction mixed in. The one huge
transaction will grow the block size, and we'll keep using it forever.
But in that case we might have just as well allocate the large blocks
from the start, I guess.
Why do you not think it's relevant for performance? Either one causes too much
memory usage by using a too large block size, wasting memory, or one ends up
loosing perf through frequent allocations?
True. I simply would not expect this to make a huge difference - I may
be wrong, and I'm sure there are workloads where it matters. But I still
think it's easier to just use larger blocks than to make the slab code
more complex.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2021-07-18 00:46:09 +0200, Tomas Vondra wrote:
On 7/17/21 11:14 PM, Andres Freund wrote:
Hm. I wonder if we should just not populate the freelist eagerly, to
drive down the initialization cost. I.e. have a separate allocation path
for chunks that have never been allocated, by having a
SlabBlock->free_offset or such.Sure, it adds a branch to the allocation happy path, but it also makes the
first allocation for a chunk cheaper, because there's no need to get the next
element from the freelist (adding a likely cache miss). And it should make the
allocation of a new block faster by a lot.Not sure what you mean by 'not populate eagerly' so can't comment :-(
Instead of populating a linked list with all chunks upon creation of a block -
which requires touching a fair bit of memory - keep a per-block pointer (or an
offset) into "unused" area of the block. When allocating from the block and
theres still "unused" memory left, use that, instead of bothering with the
freelist.
I tried that, and it nearly got slab up to the allocation/freeing performance
of aset.c (while winning after allocation, due to the higher memory density).
4) Less of a performance, and more of a usability issue: The constant
block size strikes me as problematic. Most users of an allocator can
sometimes be used with a small amount of data, and sometimes with a
large amount.I doubt this is worth the effort, really. The constant block size makes
various places much simpler (both to code and reason about), so this should
not make a huge difference in performance. And IMHO the block size is mostly
an implementation detail, so I don't see that as a usability issue.Hm? It's something the user has to specify, so I it's not really an
implementation detail. It needs to be specified without sufficient
information, as well, since externally one doesn't know how much memory the
block header and chunk headers + rounding up will use, so computing a good
block size isn't easy. I've wondered whether it should just be a count...I think this is mixing two problems - how to specify the block size, and
whether the block size is constant (as in slab) or grows over time (as in
allocset).
That was in response to the "implementation detail" bit solely.
But growing the block size seems problematic for long-lived contexts with
workloads that change a lot over time - imagine e.g. decoding many small
transactions, with one huge transaction mixed in. The one huge transaction
will grow the block size, and we'll keep using it forever. But in that case
we might have just as well allocate the large blocks from the start, I
guess.
I was thinking of capping the growth fairly low. I don't think after a 16x
growth or so you're likely to still see allocation performance gains with
slab. And I don't think that'd be too bad for decoding - we'd start with a
small initial block size, and in many workloads that will be enough, and just
workloads where that doesn't suffice will adapt performance wise. And: Medium
term I wouldn't expect reorderbuffer.c to stay the only slab.c user...
Why do you not think it's relevant for performance? Either one causes too much
memory usage by using a too large block size, wasting memory, or one ends up
loosing perf through frequent allocations?True. I simply would not expect this to make a huge difference - I may be
wrong, and I'm sure there are workloads where it matters. But I still think
it's easier to just use larger blocks than to make the slab code more
complex.
IDK. I'm looking at using slab as part of a radix tree implementation right
now. Which I'd hope to be used in various different situations. So it's hard
to choose the right block size - and it does seem to matter for performance.
Greetings,
Andres Freund
Hi,
On 2021-07-17 16:10:19 -0700, Andres Freund wrote:
Instead of populating a linked list with all chunks upon creation of a block -
which requires touching a fair bit of memory - keep a per-block pointer (or an
offset) into "unused" area of the block. When allocating from the block and
theres still "unused" memory left, use that, instead of bothering with the
freelist.I tried that, and it nearly got slab up to the allocation/freeing performance
of aset.c (while winning after allocation, due to the higher memory density).
Combining that with limiting the number of freelists, and some
microoptimizations, allocation performance is now on par.
Freeing still seems to be a tad slower, mostly because SlabFree()
practically is immediately stalled on fetching the block, whereas
AllocSetFree() can happily speculate ahead and do work like computing
the freelist index. And then aset only needs to access memory inside the
context - which is much more likely to be in cache than a freelist
inside a block (there are many more).
But that's ok, I think. It's close and it's only a small share of the
overall runtime of my workload...
Greetings,
Andres Freund
On 7/18/21 3:06 AM, Andres Freund wrote:
Hi,
On 2021-07-17 16:10:19 -0700, Andres Freund wrote:
Instead of populating a linked list with all chunks upon creation of a block -
which requires touching a fair bit of memory - keep a per-block pointer (or an
offset) into "unused" area of the block. When allocating from the block and
theres still "unused" memory left, use that, instead of bothering with the
freelist.I tried that, and it nearly got slab up to the allocation/freeing performance
of aset.c (while winning after allocation, due to the higher memory density).Combining that with limiting the number of freelists, and some
microoptimizations, allocation performance is now on par.Freeing still seems to be a tad slower, mostly because SlabFree()
practically is immediately stalled on fetching the block, whereas
AllocSetFree() can happily speculate ahead and do work like computing
the freelist index. And then aset only needs to access memory inside the
context - which is much more likely to be in cache than a freelist
inside a block (there are many more).But that's ok, I think. It's close and it's only a small share of the
overall runtime of my workload...
Sounds great! Thanks for investigating this and for the improvements.
It might be good to do some experiments to see how the changes affect
memory consumption for practical workloads. I'm willing to spend soem
time on that, if needed.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2021-07-18 19:23:31 +0200, Tomas Vondra wrote:
Sounds great! Thanks for investigating this and for the improvements.
It might be good to do some experiments to see how the changes affect memory
consumption for practical workloads. I'm willing to spend soem time on that,
if needed.
I've attached my changes. They're in a rough shape right now, but I
think good enough for an initial look.
Greetings,
Andres Freund
Em seg., 19 de jul. de 2021 às 17:56, Andres Freund <andres@anarazel.de>
escreveu:
Hi,
On 2021-07-18 19:23:31 +0200, Tomas Vondra wrote:
Sounds great! Thanks for investigating this and for the improvements.
It might be good to do some experiments to see how the changes affect
memory
consumption for practical workloads. I'm willing to spend soem time on
that,
if needed.
I've attached my changes. They're in a rough shape right now, but I
think good enough for an initial look.
Hi Andres, I take a look.
Perhaps you would agree with me that in the most absolute of times, malloc
will not fail.
So it makes more sense to test:
if (ret != NULL)
than
if (ret == NULL)
What might help branch prediction.
With this change wins too, the possibility
to reduce the scope of some variable.
Example:
+static void * pg_noinline
+AllocSetAllocLarge(AllocSet set, Size size, int flags)
+{
+ AllocBlock block;
+ Size chunk_size;
+ Size blksize;
+
+ /* check size, only allocation path where the limits could be hit */
+ MemoryContextCheckSize(&set->header, size, flags);
+
+ AssertArg(AllocSetIsValid(set));
+
+ chunk_size = MAXALIGN(size);
+ blksize = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+ block = (AllocBlock) malloc(blksize);
+ if (block != NULL)
+ {
+ AllocChunk chunk;
+
+ set->header.mem_allocated += blksize;
+
+ block->aset = set;
+ block->freeptr = block->endptr = ((char *) block) + blksize;
+
+ /*
+ * Stick the new block underneath the active allocation block, if
any,
+ * so that we don't lose the use of the space remaining therein.
+ */
+ if (set->blocks != NULL)
+ {
+ block->prev = set->blocks;
+ block->next = set->blocks->next;
+ if (block->next)
+ block->next->prev = block;
+ set->blocks->next = block;
+ }
+ else
+ {
+ block->prev = NULL;
+ block->next = NULL;
+ set->blocks = block;
+ }
+
+ chunk = (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ);
+ chunk->size = chunk_size;
+
+ return AllocSetAllocReturnChunk(set, size, chunk, chunk_size);
+ }
+
+ return NULL;
+}
regards,
Ranier Vilela
On Tue, 20 Jul 2021 at 10:04, Ranier Vilela <ranier.vf@gmail.com> wrote:
Perhaps you would agree with me that in the most absolute of times, malloc will not fail.
So it makes more sense to test:
if (ret != NULL)
than
if (ret == NULL)
I think it'd be better to use unlikely() for that.
David
Em ter., 20 de jul. de 2021 às 11:15, David Rowley <dgrowleyml@gmail.com>
escreveu:
On Tue, 20 Jul 2021 at 10:04, Ranier Vilela <ranier.vf@gmail.com> wrote:
Perhaps you would agree with me that in the most absolute of times,
malloc will not fail.
So it makes more sense to test:
if (ret != NULL)
than
if (ret == NULL)I think it'd be better to use unlikely() for that.
Sure, it can be, but in this case, there is no way to reduce the scope.
regards,
Ranier Vilela
Hi,
I spent a bit of time benchmarking this - the first patch adds an
extension with three functions, each executing a slightly different
allocation pattern:
1) FIFO (allocates and frees in the same order)
2) LIFO (frees in reverse order)
3) random
Each function can also do a custom number of iterations, each allocating
and freeing certain number of chunks. The bench.sql script executes
three combinations (for each pattern)
1) no loops
2) increase: 100 loops, each freeing 10k chunks and allocating 15k
3) decrease: 100 loops, each freeing 10k chunks and allocating 5k
The idea is to test simple one-time allocation, and workloads that mix
allocations and frees (to see how the changes affect reuse etc.).
The script tests this with a range of block sizes (1k-32k) and chunk
sizes (32B-512B).
In the attached .ods file with results, the "comparison" sheets are the
interesting ones - the last couple columns compare the main metrics for
the two patches (labeled patch-1 and patch-2) to master.
Overall, the results look quite good - patch-1 is mostly on par with
master, with maybe 5% variability in both directions. That's expected,
considering the patch does not aim to improve performance.
The second patch brings some nice improvements - 30%-50% in most cases
(for both allocation and free) seems pretty nice. But for the "increase"
FIFO pattern (incrementally allocating/freeing more memory) there's a
significant regression - particularly for the allocation time. In some
cases (larger chunks, block size does not matter too much) it jumps from
25ms to almost 200ms.
This seems unfortunate - the allocation pattern (FIFO, allocating more
memory over time) seems pretty common, and the slowdown is significant.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
0001-slab-bench.patchtext/x-patch; charset=UTF-8; name=0001-slab-bench.patchDownload+480-1
0002-WIP-optimize-allocations-by-separating-hot-from-cold.patchtext/x-patch; charset=UTF-8; name=0002-WIP-optimize-allocations-by-separating-hot-from-cold.patchDownload+300-362
0003-WIP-slab-performance.patchtext/x-patch; charset=UTF-8; name=0003-WIP-slab-performance.patchDownload+285-150
slab-results.odsapplication/vnd.oasis.opendocument.spreadsheet; name=slab-results.odsDownload+3-2
Hi,
On 2021-08-01 19:59:18 +0200, Tomas Vondra wrote:
In the attached .ods file with results, the "comparison" sheets are the
interesting ones - the last couple columns compare the main metrics for the
two patches (labeled patch-1 and patch-2) to master.
I assume with patch-1/2 you mean the ones after the benchmark patch
itself?
Overall, the results look quite good - patch-1 is mostly on par with master,
with maybe 5% variability in both directions. That's expected, considering
the patch does not aim to improve performance.
Not for slab anyway...
The second patch brings some nice improvements - 30%-50% in most cases (for
both allocation and free) seems pretty nice. But for the "increase" FIFO
pattern (incrementally allocating/freeing more memory) there's a significant
regression - particularly for the allocation time. In some cases (larger
chunks, block size does not matter too much) it jumps from 25ms to almost
200ms.
I'm not surprised to see some memory usage increase some, but that
degree of time overhead does surprise me. ISTM there's something wrong.
It'd probably worth benchmarking the different improvements inside the
WIP: slab performance. patch. There's some that I'd expect to be all
around improvements, whereas others likely aren't quite that clear
cut. I assume you'd prefer that I split the patch up?
This seems unfortunate - the allocation pattern (FIFO, allocating more
memory over time) seems pretty common, and the slowdown is significant.
Did you analyze what causes the regressions?
Greetings,
Andres Freund
On 8/1/21 11:07 PM, Andres Freund wrote:
Hi,
On 2021-08-01 19:59:18 +0200, Tomas Vondra wrote:
In the attached .ods file with results, the "comparison" sheets are the
interesting ones - the last couple columns compare the main metrics for the
two patches (labeled patch-1 and patch-2) to master.I assume with patch-1/2 you mean the ones after the benchmark patch
itself?
Yes, those are the two WIP patches you shared on 19/7.
Overall, the results look quite good - patch-1 is mostly on par with master,
with maybe 5% variability in both directions. That's expected, considering
the patch does not aim to improve performance.Not for slab anyway...
Maybe the hot/cold separation could have some effect, but probably not
for the workloads I've tested.
The second patch brings some nice improvements - 30%-50% in most cases (for
both allocation and free) seems pretty nice. But for the "increase" FIFO
pattern (incrementally allocating/freeing more memory) there's a significant
regression - particularly for the allocation time. In some cases (larger
chunks, block size does not matter too much) it jumps from 25ms to almost
200ms.I'm not surprised to see some memory usage increase some, but that
degree of time overhead does surprise me. ISTM there's something wrong.
Yeah, the higher amount of allocated memory is due to the couple fields
added to the SlabBlock struct, but even that only affects a single case
with 480B chunks and 1kB blocks. Seems fine to me, especially if we end
up growing the slab blocks.
Not sure about the allocation time, though.
It'd probably worth benchmarking the different improvements inside the
WIP: slab performance. patch. There's some that I'd expect to be all
around improvements, whereas others likely aren't quite that clear
cut. I assume you'd prefer that I split the patch up?
Yeah, if you split that patch into sensible parts, I'll benchmark those.
Also, we can add more interesting workloads if you have some ideas.
This seems unfortunate - the allocation pattern (FIFO, allocating more
memory over time) seems pretty common, and the slowdown is significant.Did you analyze what causes the regressions?
No, not yet. I'll run the same set of benchmarks for the Generation,
discussed in the other thread, and then I'll investigate this. But if
you split the patch, that'd probably help.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
FWIW I tried running the benchmarks again, with some minor changes in
the extension code - most importantly, the time is counted in microsecs
(instead of milisecs).
I suspected the rounding might have been causing some rounding errors
(essentially not counting anything below 1ms, because it rounds to 0),
and the results are a bit different.
On the i5-2500k machine it's an improvement across the board, while on
the bigger Xeon e5-2620v3 machine it shows roughly the same regression
for the "decreasing" allocation pattern.
There's another issue in the benchmarking script - the queries are meant
to do multiple runs for each combination of parameters, but it's written
in a way that simply runs it once and then does cross product with the
generate_sequence(1,5). I'll look into fixing that, but judging by the
stability of results for similar chunk sizes it won't change much.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
0001-slab-bench-v2.patchtext/x-patch; charset=UTF-8; name=0001-slab-bench-v2.patchDownload+496-1
0002-WIP-optimize-allocations-by-separating-hot-from-c-v2.patchtext/x-patch; charset=UTF-8; name=0002-WIP-optimize-allocations-by-separating-hot-from-c-v2.patchDownload+300-362
0003-WIP-slab-performance-v2.patchtext/x-patch; charset=UTF-8; name=0003-WIP-slab-performance-v2.patchDownload+285-150
slab i5.odsapplication/vnd.oasis.opendocument.spreadsheet; name="slab i5.ods"Download+9-12
slab xeon.odsapplication/vnd.oasis.opendocument.spreadsheet; name="slab xeon.ods"Download+9-9
Hi,
I've been investigating the regressions in some of the benchmark
results, together with the generation context benchmarks [1]/messages/by-id/bcdd4e3e-c12d-cd2b-7ead-a91ad416100a@enterprisedb.com.
Turns out it's pretty difficult to benchmark this, because the results
strongly depend on what the backend did before. For example if I run
slab_bench_fifo with the "decreasing" test for 32kB blocks and 512B
chunks, I get this:
select * from slab_bench_fifo(1000000, 32768, 512, 100, 10000, 5000);
mem_allocated | alloc_ms | free_ms
---------------+----------+---------
528547840 | 155394 | 87440
i.e. palloc() takes ~155ms and pfree() ~87ms (and these result are
stable, the numbers don't change much with more runs).
But if I run a set of "lifo" tests in the backend first, the results
look like this:
mem_allocated | alloc_ms | free_ms
---------------+----------+---------
528547840 | 41728 | 71524
(1 row)
so the pallocs are suddenly about ~4x faster. Clearly, what the backend
did before may have pretty dramatic impact on results, even for simple
benchmarks like this.
Note: The benchmark was a single SQL script, running all the different
workloads in the same backend.
I did a fair amount of perf profiling, and the main difference between
the slow and fast runs seems to be this:
0 page-faults:u
0 minor-faults:u
0 major-faults:u
vs
20,634,153 page-faults:u
20,634,153 minor-faults:u
0 major-faults:u
Attached is a more complete perf stat output, but the page faults seem
to be the main issue. My theory is that in the "fast" case, the past
backend activity puts the glibc memory management into a state that
prevents page faults in the benchmark.
But of course, this theory may be incomplete - for example it's not
clear why running the benchmark repeatedly would not "condition" the
backend the same way. But it doesn't - it's ~150ms even for repeated runs.
Secondly, I'm not sure this explains why some of the timings actually
got much slower with the 0003 patch, when the sequence of the steps is
still the same. Of course, it's possible 0003 changes the allocation
pattern a bit, interfering with glibc memory management.
This leads to a couple of interesting questions, I think:
1) I've only tested this on Linux, with glibc. I wonder how it'd behave
on other platforms, or with other allocators.
2) Which cases are more important? When the backend was warmed up, or
when each benchmark runs in a new backend? It seems the "new backend" is
something like a "worst case" leading to more page faults, so maybe
that's the thing to watch. OTOH it's unlikely to have a completely new
backend, so maybe not.
3) Can this teach us something about how to allocate stuff, to better
"prepare" the backend for future allocations? For example, it's a bit
strange that repeated runs of the same benchmark don't do the trick, for
some reason.
regards
[1]: /messages/by-id/bcdd4e3e-c12d-cd2b-7ead-a91ad416100a@enterprisedb.com
/messages/by-id/bcdd4e3e-c12d-cd2b-7ead-a91ad416100a@enterprisedb.com
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
perf.stat.txttext/plain; charset=UTF-8; name=perf.stat.txtDownload
On Sat, 11 Sept 2021 at 09:07, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
I've been investigating the regressions in some of the benchmark
results, together with the generation context benchmarks [1].
I've not looked into the regression you found with this yet, but I did
rebase the patch. slab.c has seen quite a number of changes recently.
I didn't spend a lot of time checking over the patch. I mainly wanted
to see what the performance was like before reviewing in too much
detail.
To test the performance, I used [1]/messages/by-id/attachment/137056/allocate_performance_functions.patch.txt and ran:
select pg_allocate_memory_test(<nbytes>, 1024*1024,
10::bigint*1024*1024*1024, 'slab');
that basically allocates chunks of <nbytes> and keeps around 1MB of
them at a time and allocates a total of 10GBs of them.
I saw:
Master:
16 byte chunk = 8754.678 ms
32 byte chunk = 4511.725 ms
64 byte chunk = 2244.885 ms
128 byte chunk = 1135.349 ms
256 byte chunk = 548.030 ms
512 byte chunk = 272.017 ms
1024 byte chunk = 144.618 ms
Master + attached patch:
16 byte chunk = 5255.974 ms
32 byte chunk = 2640.807 ms
64 byte chunk = 1328.949 ms
128 byte chunk = 668.078 ms
256 byte chunk = 330.564 ms
512 byte chunk = 166.844 ms
1024 byte chunk = 85.399 ms
So patched runs in about 60% of the time that master runs in.
I plan to look at the patch in a bit more detail and see if I can
recreate and figure out the regression that Tomas reported. For now, I
just want to share the rebased patch.
The only thing I really adjusted from Andres' version is to instead of
using pointers for the linked list block freelist, I made it store the
number of bytes into the block that the chunk is. This means we can
use 4 bytes instead of 8 bytes for these pointers. The block size is
limited to 1GB now anyway, so 32-bit is large enough for these
offsets.
David
[1]: /messages/by-id/attachment/137056/allocate_performance_functions.patch.txt
Attachments:
v3-0001-WIP-slab-performance.patchtext/plain; charset=US-ASCII; name=v3-0001-WIP-slab-performance.patchDownload+294-156
On Wed, Oct 12, 2022 at 4:37 PM David Rowley <dgrowleyml@gmail.com> wrote:
[v3]
+ /*
+ * Compute a shift that guarantees that shifting chunksPerBlock with it
+ * yields is smaller than SLAB_FREELIST_COUNT - 1 (one freelist is used
for full blocks).
+ */
+ slab->freelist_shift = 0;
+ while ((slab->chunksPerBlock >> slab->freelist_shift) >=
(SLAB_FREELIST_COUNT - 1))
+ slab->freelist_shift++;
+ /*
+ * Ensure, without a branch, that index 0 is only used for blocks entirely
+ * without free chunks.
+ * XXX: There probably is a cheaper way to do this. Needing to shift twice
+ * by slab->freelist_shift isn't great.
+ */
+ index = (freecount + (1 << slab->freelist_shift) - 1) >>
slab->freelist_shift;
How about something like
#define SLAB_FREELIST_COUNT ((1<<3) + 1)
index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0);
and dispense with both freelist_shift and the loop that computes it?
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, 11 Nov 2022 at 22:20, John Naylor <john.naylor@enterprisedb.com> wrote:
On Wed, Oct 12, 2022 at 4:37 PM David Rowley <dgrowleyml@gmail.com> wrote:
[v3]
+ /* + * Compute a shift that guarantees that shifting chunksPerBlock with it + * yields is smaller than SLAB_FREELIST_COUNT - 1 (one freelist is used for full blocks). + */ + slab->freelist_shift = 0; + while ((slab->chunksPerBlock >> slab->freelist_shift) >= (SLAB_FREELIST_COUNT - 1)) + slab->freelist_shift++;+ /* + * Ensure, without a branch, that index 0 is only used for blocks entirely + * without free chunks. + * XXX: There probably is a cheaper way to do this. Needing to shift twice + * by slab->freelist_shift isn't great. + */ + index = (freecount + (1 << slab->freelist_shift) - 1) >> slab->freelist_shift;How about something like
#define SLAB_FREELIST_COUNT ((1<<3) + 1)
index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0);
Doesn't this create a sort of round-robin use of the free list? What
we want is a sort of "histogram" bucket set of free lists so we can
group together blocks that have a close-enough free number of chunks.
Unless I'm mistaken, I think what you have doesn't do that.
I wondered if simply:
index = -(-freecount >> slab->freelist_shift);
would be faster than Andres' version. I tried it out and on my AMD
machine, it's about the same speed. Same on a Raspberry Pi 4.
Going by [2]https://godbolt.org/z/dh95TohEG, the instructions are very different with each method, so
other machines with different latencies on those instructions might
show something different. I attached what I used to test if anyone
else wants a go.
AMD Zen2
$ ./freecount 2000000000
Test 'a' in 0.922766 seconds
Test 'd' in 0.922762 seconds (0.000433% faster)
RPI4
$ ./freecount 2000000000
Test 'a' in 3.341350 seconds
Test 'd' in 3.338690 seconds (0.079672% faster)
That was gcc. Trying it with clang, it went in a little heavy-handed
and optimized out my loop, so some more trickery might be needed for a
useful test on that compiler.
David