PG15 beta1 sort performance regression due to Generation context change
Hackers,
Over the past few days I've been gathering some benchmark results
together to show the sort performance improvements in PG15 [1]https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/speeding-up-sort-performance-in-postgres-15/ba-p/3396953#change4.
One of the test cases I did was to demonstrate Heikki's change to use
a k-way merge (65014000b).
The test I did to try this out was along the lines of:
set max_parallel_workers_per_gather = 0;
create table t (a bigint not null, b bigint not null, c bigint not
null, d bigint not null, e bigint not null, f bigint not null);
insert into t select x,x,x,x,x,x from generate_Series(1,140247142) x; -- 10GB!
vacuum freeze t;
The query I ran was:
select * from t order by a offset 140247142;
I tested various sizes of work_mem starting at 4MB and doubled that
all the way to 16GB. For many of the smaller values of work_mem the
performance is vastly improved by Heikki's change, however for
work_mem = 64MB I detected quite a large slowdown. PG14 took 20.9
seconds and PG15 beta 1 took 29 seconds!
I've been trying to get to the bottom of this today and finally have
discovered this is due to the tuple size allocations in the sort being
exactly 64 bytes. Prior to 40af10b57 (Use Generation memory contexts
to store tuples in sorts) the tuple for the sort would be stored in an
aset context. After 40af10b57 we'll use a generation context. The
idea with that change is that the generation context does no
power-of-2 round ups for allocations, so we save memory in most cases.
However, due to this particular test having a tuple size of 64-bytes,
there was no power-of-2 wastage with aset.
The problem is that generation chunks have a larger chunk header than
aset do due to having to store the block pointer that the chunk
belongs to so that GenerationFree() can increment the nfree chunks in
the block. aset.c does not require this as freed chunks just go onto a
freelist that's global to the entire context.
Basically, for my test query, the slowdown is because instead of being
able to store 620702 tuples per tape over 226 tapes with an aset
context, we can now only store 576845 tuples per tape resulting in
requiring 244 tapes when using the generation context.
If I had added column "g" to make the tuple size 72 bytes causing
aset's code to round allocations up to 128 bytes and generation.c to
maintain the 72 bytes then the sort would have stored 385805 tuples
over 364 batches for aset and 538761 tuples over 261 batches using the
generation context. That would have been a huge win.
So it basically looks like I discovered a very bad case that causes a
significant slowdown. Yet other cases that are not an exact power of
2 stand to gain significantly from this change.
One thing 40af10b57 does is stops those terrible performance jumps
when the tuple size crosses a power-of-2 boundary. The performance
should be more aligned to the size of the data being sorted now...
Unfortunately, that seems to mean regressions for large sorts with
power-of-2 sized tuples.
I'm unsure exactly what I should do about this right now.
David
On 20/05/2022 08:56, David Rowley wrote:
The problem is that generation chunks have a larger chunk header than
aset do due to having to store the block pointer that the chunk
belongs to so that GenerationFree() can increment the nfree chunks in
the block. aset.c does not require this as freed chunks just go onto a
freelist that's global to the entire context.
Could the 'context' field be moved from GenerationChunk to GenerationBlock?
- Heikki
Hi David,
Over the past few days I've been gathering some benchmark results
together to show the sort performance improvements in PG15 [1].
One of the test cases I did was to demonstrate Heikki's change to use
a k-way merge (65014000b).
The test I did to try this out was along the lines of:
set max_parallel_workers_per_gather = 0;
create table t (a bigint not null, b bigint not null, c bigint not
null, d bigint not null, e bigint not null, f bigint not null);
insert into t select x,x,x,x,x,x from generate_Series(1,140247142) x; --
10GB!
vacuum freeze t;
The query I ran was:
select * from t order by a offset 140247142;
I redid this test here:
Windows 10 64 bits
msvc 2019 64 bits
RAM 8GB
SSD 256 GB
HEAD (default configuration)
Time: 229396,551 ms (03:49,397)
PATCHED:
Time: 220887,346 ms (03:40,887)
I tested various sizes of work_mem starting at 4MB and doubled that
all the way to 16GB. For many of the smaller values of work_mem the
performance is vastly improved by Heikki's change, however for
work_mem = 64MB I detected quite a large slowdown. PG14 took 20.9
seconds and PG15 beta 1 took 29 seconds!
I've been trying to get to the bottom of this today and finally have
discovered this is due to the tuple size allocations in the sort being
exactly 64 bytes. Prior to 40af10b57 (Use Generation memory contexts
to store tuples in sorts) the tuple for the sort would be stored in an
aset context. After 40af10b57 we'll use a generation context. The
idea with that change is that the generation context does no
power-of-2 round ups for allocations, so we save memory in most cases.
However, due to this particular test having a tuple size of 64-bytes,
there was no power-of-2 wastage with aset.
The problem is that generation chunks have a larger chunk header than
aset do due to having to store the block pointer that the chunk
belongs to so that GenerationFree() can increment the nfree chunks in
the block. aset.c does not require this as freed chunks just go onto a
freelist that's global to the entire context.
Basically, for my test query, the slowdown is because instead of being
able to store 620702 tuples per tape over 226 tapes with an aset
context, we can now only store 576845 tuples per tape resulting in
requiring 244 tapes when using the generation context.
If I had added column "g" to make the tuple size 72 bytes causing
aset's code to round allocations up to 128 bytes and generation.c to
maintain the 72 bytes then the sort would have stored 385805 tuples
over 364 batches for aset and 538761 tuples over 261 batches using the
generation context. That would have been a huge win.
So it basically looks like I discovered a very bad case that causes a
significant slowdown. Yet other cases that are not an exact power of
2 stand to gain significantly from this change.
One thing 40af10b57 does is stops those terrible performance jumps
when the tuple size crosses a power-of-2 boundary. The performance
should be more aligned to the size of the data being sorted now...
Unfortunately, that seems to mean regressions for large sorts with
power-of-2 sized tuples.
It seems to me that the solution would be to use aset allocations
when the size of the tuples is power-of-2?
if (state->sortopt & TUPLESORT_ALLOWBOUNDED ||
(state->memtupsize & (state->memtupsize - 1)) == 0)
state->tuplecontext = AllocSetContextCreate(state->sortcontext,
"Caller tuples", ALLOCSET_DEFAULT_SIZES);
else
state->tuplecontext = GenerationContextCreate(state->sortcontext,
"Caller tuples", ALLOCSET_DEFAULT_SIZES);
I took a look and tried some improvements to see if I had a better result.
Would you mind taking a look and testing?
regards,
Ranier Vilela
Import Notes
Resolved by subject fallback
On 5/20/22 12:01, Heikki Linnakangas wrote:
On 20/05/2022 08:56, David Rowley wrote:
The problem is that generation chunks have a larger chunk header than
aset do due to having to store the block pointer that the chunk
belongs to so that GenerationFree() can increment the nfree chunks in
the block. aset.c does not require this as freed chunks just go onto a
freelist that's global to the entire context.Could the 'context' field be moved from GenerationChunk to GenerationBlock?
Not easily, because GetMemoryChunkContext() expects the context to be
stored right before the chunk. In principle we could add "get context"
callback to MemoryContextMethods, so that different implementations can
override that.
I wonder how expensive the extra redirect would be, but probably not
much because the places touching chunk->context deal with the block too
(e.g. GenerationFree has to tweak block->nfree).
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
On 5/20/22 12:01, Heikki Linnakangas wrote:
Could the 'context' field be moved from GenerationChunk to GenerationBlock?
Not easily, because GetMemoryChunkContext() expects the context to be
stored right before the chunk. In principle we could add "get context"
callback to MemoryContextMethods, so that different implementations can
override that.
How would you know which context type to consult for that?
regards, tom lane
On Tue, 24 May 2022 at 08:32, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
On 5/20/22 12:01, Heikki Linnakangas wrote:
Could the 'context' field be moved from GenerationChunk to GenerationBlock?
Not easily, because GetMemoryChunkContext() expects the context to be
stored right before the chunk. In principle we could add "get context"
callback to MemoryContextMethods, so that different implementations can
override that.
hmm, but we need to know the context first before we can know which
callback to call.
David
On 5/23/22 22:47, Tom Lane wrote:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
On 5/20/22 12:01, Heikki Linnakangas wrote:
Could the 'context' field be moved from GenerationChunk to GenerationBlock?
Not easily, because GetMemoryChunkContext() expects the context to be
stored right before the chunk. In principle we could add "get context"
callback to MemoryContextMethods, so that different implementations can
override that.How would you know which context type to consult for that?
D'oh! I knew there has to be some flaw in that idea, but I forgot about
this chicken-or-egg issue.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
FYI not sure why, but your responses seem to break threading quite
often, due to missing headers identifying the message you're responding
to (In-Reply-To, References). Not sure why or how to fix it, but this
makes it much harder to follow the discussion.
On 5/22/22 21:11, Ranier Vilela wrote:
Hi David,
Over the past few days I've been gathering some benchmark results
together to show the sort performance improvements in PG15 [1].One of the test cases I did was to demonstrate Heikki's change to use
a k-way merge (65014000b).The test I did to try this out was along the lines of:
set max_parallel_workers_per_gather = 0;
create table t (a bigint not null, b bigint not null, c bigint not
null, d bigint not null, e bigint not null, f bigint not null);insert into t select x,x,x,x,x,x from generate_Series(1,140247142) x;
-- 10GB!
vacuum freeze t;
The query I ran was:
select * from t order by a offset 140247142;
I redid this test here:
Windows 10 64 bits
msvc 2019 64 bits
RAM 8GB
SSD 256 GBHEAD (default configuration)
Time: 229396,551 ms (03:49,397)
PATCHED:
Time: 220887,346 ms (03:40,887)
This is 10x longer than reported by David. Presumably David used a
machine a lot of RAM, while your system has 8GB and so is I/O bound.
Also, what exactly does "patched" mean? The patch you attached?
I tested various sizes of work_mem starting at 4MB and doubled that
all the way to 16GB. For many of the smaller values of work_mem the
performance is vastly improved by Heikki's change, however for
work_mem = 64MB I detected quite a large slowdown. PG14 took 20.9
seconds and PG15 beta 1 took 29 seconds!I've been trying to get to the bottom of this today and finally have
discovered this is due to the tuple size allocations in the sort being
exactly 64 bytes. Prior to 40af10b57 (Use Generation memory contexts
to store tuples in sorts) the tuple for the sort would be stored in an
aset context. After 40af10b57 we'll use a generation context. The
idea with that change is that the generation context does no
power-of-2 round ups for allocations, so we save memory in most cases.
However, due to this particular test having a tuple size of 64-bytes,
there was no power-of-2 wastage with aset.The problem is that generation chunks have a larger chunk header than
aset do due to having to store the block pointer that the chunk
belongs to so that GenerationFree() can increment the nfree chunks in
the block. aset.c does not require this as freed chunks just go onto a
freelist that's global to the entire context.Basically, for my test query, the slowdown is because instead of being
able to store 620702 tuples per tape over 226 tapes with an aset
context, we can now only store 576845 tuples per tape resulting in
requiring 244 tapes when using the generation context.If I had added column "g" to make the tuple size 72 bytes causing
aset's code to round allocations up to 128 bytes and generation.c to
maintain the 72 bytes then the sort would have stored 385805 tuples
over 364 batches for aset and 538761 tuples over 261 batches using the
generation context. That would have been a huge win.So it basically looks like I discovered a very bad case that causes a
significant slowdown. Yet other cases that are not an exact power of
2 stand to gain significantly from this change.One thing 40af10b57 does is stops those terrible performance jumps
when the tuple size crosses a power-of-2 boundary. The performance
should be more aligned to the size of the data being sorted now...
Unfortunately, that seems to mean regressions for large sorts with
power-of-2 sized tuples.It seems to me that the solution would be to use aset allocations
when the size of the tuples is power-of-2?
if (state->sortopt & TUPLESORT_ALLOWBOUNDED ||
(state->memtupsize & (state->memtupsize - 1)) == 0)
state->tuplecontext = AllocSetContextCreate(state->sortcontext,
"Caller tuples", ALLOCSET_DEFAULT_SIZES);
else
state->tuplecontext = GenerationContextCreate(state->sortcontext,
"Caller tuples", ALLOCSET_DEFAULT_SIZES);
I'm pretty sure this is pointless, because memtupsize is the size of the
memtuples array. But the issue is about size of the tuples. After all,
David was talking about 64B chunks, but the array is always at least
1024 elements, so it obviously can't be the same thing.
How would we even know how large the tuples will be at this point,
before we even see the first of them?
I took a look and tried some improvements to see if I had a better result.
IMHO special-casing should be the last resort, because it makes the
behavior much harder to follow. Also, we're talking about sort, but
don't other places using Generation context have the same issue?
Treating prefferrable to find a fix addressing all those places,
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, 24 May 2022 at 08:52, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
On 5/23/22 22:47, Tom Lane wrote:
How would you know which context type to consult for that?
D'oh! I knew there has to be some flaw in that idea, but I forgot about
this chicken-or-egg issue.
Handy wavy idea: It's probably too complex for now, and it also might
be too much overhead, but having GenerationPointerGetChunk() do a
binary search on a sorted-by-memory-address array of block pointers
might be a fast enough way to find the block that the pointer belongs
to. There should be far fewer blocks now since generation.c now grows
the block sizes. N in O(log2 N) the search complexity should never be
excessively high.
However, it would mean a binary search for every pfree, which feels
pretty horrible. My feeling is that it seems unlikely that saving 8
bytes by not storing the GenerationBlock would be a net win here.
There may be something to claw back as for the tuplesort.c case as I
think the heap_free_minimal_tuple() call in writetup_heap() is not
needed in many cases as we reset the tuple context directly afterward
writing the tuples out.
David
David Rowley <dgrowleyml@gmail.com> writes:
Handy wavy idea: It's probably too complex for now, and it also might
be too much overhead, but having GenerationPointerGetChunk() do a
binary search on a sorted-by-memory-address array of block pointers
might be a fast enough way to find the block that the pointer belongs
to. There should be far fewer blocks now since generation.c now grows
the block sizes. N in O(log2 N) the search complexity should never be
excessively high.
However, it would mean a binary search for every pfree, which feels
pretty horrible. My feeling is that it seems unlikely that saving 8
bytes by not storing the GenerationBlock would be a net win here.
I think probably that could be made to work, since a Generation
context should not contain all that many live blocks at any one time.
However, here's a different idea: how badly do we need the "size"
field in GenerationChunk? We're not going to ever recycle the
chunk, IIUC, so it doesn't matter exactly how big it is. When
doing MEMORY_CONTEXT_CHECKING we'll still want requested_size,
but that's not relevant to performance-critical cases.
regards, tom lane
On Tue, 24 May 2022 at 09:36, Tom Lane <tgl@sss.pgh.pa.us> wrote:
However, here's a different idea: how badly do we need the "size"
field in GenerationChunk? We're not going to ever recycle the
chunk, IIUC, so it doesn't matter exactly how big it is. When
doing MEMORY_CONTEXT_CHECKING we'll still want requested_size,
but that's not relevant to performance-critical cases.
Interesting idea. However, I do see a couple of usages of the "size"
field away from MEMORY_CONTEXT_CHECKING builds:
GenerationRealloc: uses "size" to figure out if the new size is
smaller than the old size. Maybe we could just always move to a new
chunk regardless of if the new size is smaller or larger than the old
size.
GenerationGetChunkSpace: Uses "size" to figure out the amount of
memory is used by the chunk. I'm not sure how we'd work around the
fact that USEMEM() uses that extensively in tuplesort.c to figure out
how much memory we're using.
David
Em seg., 23 de mai. de 2022 às 18:01, Tomas Vondra <
tomas.vondra@enterprisedb.com> escreveu:
FYI not sure why, but your responses seem to break threading quite
often, due to missing headers identifying the message you're responding
to (In-Reply-To, References). Not sure why or how to fix it, but this
makes it much harder to follow the discussion.On 5/22/22 21:11, Ranier Vilela wrote:
Hi David,
Over the past few days I've been gathering some benchmark results
together to show the sort performance improvements in PG15 [1].One of the test cases I did was to demonstrate Heikki's change to use
a k-way merge (65014000b).The test I did to try this out was along the lines of:
set max_parallel_workers_per_gather = 0;
create table t (a bigint not null, b bigint not null, c bigint not
null, d bigint not null, e bigint not null, f bigint not null);insert into t select x,x,x,x,x,x from generate_Series(1,140247142) x;
-- 10GB!
vacuum freeze t;
The query I ran was:
select * from t order by a offset 140247142;
I redid this test here:
Windows 10 64 bits
msvc 2019 64 bits
RAM 8GB
SSD 256 GBHEAD (default configuration)
Time: 229396,551 ms (03:49,397)
PATCHED:
Time: 220887,346 ms (03:40,887)This is 10x longer than reported by David. Presumably David used a
machine a lot of RAM, while your system has 8GB and so is I/O bound.
Probably, but Windows is slower than Linux, certainly.
Also, what exactly does "patched" mean? The patch you attached?
It means the results of the benchmark with the patch applied.
I tested various sizes of work_mem starting at 4MB and doubled that
all the way to 16GB. For many of the smaller values of work_mem the
performance is vastly improved by Heikki's change, however for
work_mem = 64MB I detected quite a large slowdown. PG14 took 20.9
seconds and PG15 beta 1 took 29 seconds!I've been trying to get to the bottom of this today and finally have
discovered this is due to the tuple size allocations in the sort being
exactly 64 bytes. Prior to 40af10b57 (Use Generation memory contexts
to store tuples in sorts) the tuple for the sort would be stored in an
aset context. After 40af10b57 we'll use a generation context. The
idea with that change is that the generation context does no
power-of-2 round ups for allocations, so we save memory in most cases.
However, due to this particular test having a tuple size of 64-bytes,
there was no power-of-2 wastage with aset.The problem is that generation chunks have a larger chunk header than
aset do due to having to store the block pointer that the chunk
belongs to so that GenerationFree() can increment the nfree chunks in
the block. aset.c does not require this as freed chunks just go onto a
freelist that's global to the entire context.Basically, for my test query, the slowdown is because instead of being
able to store 620702 tuples per tape over 226 tapes with an aset
context, we can now only store 576845 tuples per tape resulting in
requiring 244 tapes when using the generation context.If I had added column "g" to make the tuple size 72 bytes causing
aset's code to round allocations up to 128 bytes and generation.c to
maintain the 72 bytes then the sort would have stored 385805 tuples
over 364 batches for aset and 538761 tuples over 261 batches using the
generation context. That would have been a huge win.So it basically looks like I discovered a very bad case that causes a
significant slowdown. Yet other cases that are not an exact power of
2 stand to gain significantly from this change.One thing 40af10b57 does is stops those terrible performance jumps
when the tuple size crosses a power-of-2 boundary. The performance
should be more aligned to the size of the data being sorted now...
Unfortunately, that seems to mean regressions for large sorts with
power-of-2 sized tuples.It seems to me that the solution would be to use aset allocations
when the size of the tuples is power-of-2?
if (state->sortopt & TUPLESORT_ALLOWBOUNDED ||
(state->memtupsize & (state->memtupsize - 1)) == 0)
state->tuplecontext = AllocSetContextCreate(state->sortcontext,
"Caller tuples", ALLOCSET_DEFAULT_SIZES);
else
state->tuplecontext = GenerationContextCreate(state->sortcontext,
"Caller tuples", ALLOCSET_DEFAULT_SIZES);I'm pretty sure this is pointless, because memtupsize is the size of the
memtuples array. But the issue is about size of the tuples. After all,
David was talking about 64B chunks, but the array is always at least
1024 elements, so it obviously can't be the same thing.
It was more of a guessing attempt.
How would we even know how large the tuples will be at this point,
before we even see the first of them?
I don't know how.
I took a look and tried some improvements to see if I had a better
result.
IMHO special-casing should be the last resort, because it makes the
behavior much harder to follow.
Probably, but in my tests, there has been some gain.
Also, we're talking about sort, but
don't other places using Generation context have the same issue?
Well, for PG15 is what was addressed, just sort.
Treating prefferrable to find a fix addressing all those places,
For now, I'm just focused on sorts.
regards,
Ranier Vilela
I wrote:
However, here's a different idea: how badly do we need the "size"
field in GenerationChunk? We're not going to ever recycle the
chunk, IIUC, so it doesn't matter exactly how big it is. When
doing MEMORY_CONTEXT_CHECKING we'll still want requested_size,
but that's not relevant to performance-critical cases.
Refining that a bit: we could provide the size field only when
MEMORY_CONTEXT_CHECKING and/or CLOBBER_FREED_MEMORY are defined.
That would leave us with GenerationRealloc and GenerationGetChunkSpace
not being supportable operations, but I wonder how much we need either.
BTW, shouldn't GenerationCheck be ifdef'd out if MEMORY_CONTEXT_CHECKING
isn't set? aset.c does things that way.
regards, tom lane
David Rowley <dgrowleyml@gmail.com> writes:
GenerationRealloc: uses "size" to figure out if the new size is
smaller than the old size. Maybe we could just always move to a new
chunk regardless of if the new size is smaller or larger than the old
size.
I had the same idea ... but we need to know the old size to know how much
to copy.
GenerationGetChunkSpace: Uses "size" to figure out the amount of
memory is used by the chunk. I'm not sure how we'd work around the
fact that USEMEM() uses that extensively in tuplesort.c to figure out
how much memory we're using.
Ugh, that seems like a killer.
regards, tom lane
On Tue, 24 May 2022 at 10:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
BTW, shouldn't GenerationCheck be ifdef'd out if MEMORY_CONTEXT_CHECKING
isn't set? aset.c does things that way.
Isn't it done in generation.c:954?
David
David Rowley <dgrowleyml@gmail.com> writes:
On Tue, 24 May 2022 at 10:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
BTW, shouldn't GenerationCheck be ifdef'd out if MEMORY_CONTEXT_CHECKING
isn't set? aset.c does things that way.
Isn't it done in generation.c:954?
Ah, sorry, didn't look that far up ...
regards, tom lane
I had another, possibly-crazy idea. I think that the API requirement
that the word before a chunk's start point to a MemoryContext is
overly strong. What we need is that it point to something in which
a MemoryContextMethods pointer can be found (at a predefined offset).
Thus, if generation.c is willing to add the overhead of a
MemoryContextMethods pointer in GenerationBlock, it could dispense
with the per-chunk context field and just have the GenerationBlock
link there. I guess most likely we'd also need a back-link to
the GenerationContext from GenerationBlock. Still, two more
pointers per GenerationBlock is an easy tradeoff to make to save
one pointer per GenerationChunk.
I've not trawled the code to make sure that *only* the methods
pointer is touched by context-type-independent code, but it
seems like we could get to that even if we're not there today.
Whether this idea is something we could implement post-beta,
I'm not sure. I'm inclined to think that we can't change the layout
of MemoryContextData post-beta, as people may already be building
extensions for production use. We could bloat GenerationBlock even
more so that the methods pointer is at the right offset for today's
layout of MemoryContextData, though obviously going forward it'd be
better if there wasn't extra overhead needed to make this happen.
regards, tom lane
On 5/20/22 1:56 AM, David Rowley wrote:
Hackers,
Over the past few days I've been gathering some benchmark results
together to show the sort performance improvements in PG15 [1].
So it basically looks like I discovered a very bad case that causes a
significant slowdown. Yet other cases that are not an exact power of
2 stand to gain significantly from this change.I'm unsure exactly what I should do about this right now.
While this is being actively investigated, I added this into open items.
Thanks,
Jonathan
On Tue, 24 May 2022 at 09:36, Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Rowley <dgrowleyml@gmail.com> writes:
Handy wavy idea: It's probably too complex for now, and it also might
be too much overhead, but having GenerationPointerGetChunk() do a
binary search on a sorted-by-memory-address array of block pointers
might be a fast enough way to find the block that the pointer belongs
to. There should be far fewer blocks now since generation.c now grows
the block sizes. N in O(log2 N) the search complexity should never be
excessively high.However, it would mean a binary search for every pfree, which feels
pretty horrible. My feeling is that it seems unlikely that saving 8
bytes by not storing the GenerationBlock would be a net win here.I think probably that could be made to work, since a Generation
context should not contain all that many live blocks at any one time.
I've done a rough cut implementation of this and attached it here.
I've not done that much testing yet, but it does seem to fix the
performance regression that I mentioned in the blog post that I linked
in the initial post on this thread.
There are a couple of things to note about the patch:
1. I quickly realised that there's no good place to store the
sorted-by-memory-address array of GenerationBlocks. In the patch, I've
had to malloc() this array and also had to use a special case so that
I didn't try to do another malloc() inside GenerationContextCreate().
It's possible that the malloc() / repalloc of this array fails. In
which case, I think I've coded things in such a way that there will be
no memory leaks of the newly added block.
2. I did see GenerationFree() pop up in perf top a little more than it
used to. I considered that we might want to have
GenerationGetBlockFromChunk() cache the last found block for the set
and then check that one first. We expect generation contexts to
pfree() in an order that would likely make us hit this case most of
the time. I added a few lines to the attached v2 patch to add a
last_pfree_block field to the context struct. However, this seems to
hinder performance more than it helps. It can easily be removed from
the v2 patch.
In the results you can see the PG14 + PG15 results the same as I
reported on the blog post I linked earlier. It seems that for
PG15_patched virtually all work_mem sizes produce results that are
faster than PG14. The exception here is the 16GB test where
PG15_patched is 0.8% slower, which seems within the noise threshold.
David
David Rowley <dgrowleyml@gmail.com> writes:
On Tue, 24 May 2022 at 09:36, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I think probably that could be made to work, since a Generation
context should not contain all that many live blocks at any one time.
I've done a rough cut implementation of this and attached it here.
I've not done that much testing yet, but it does seem to fix the
performance regression that I mentioned in the blog post that I linked
in the initial post on this thread.
Here's a draft patch for the other way of doing it. I'd first tried
to make the side-effects completely local to generation.c, but that
fails in view of code like
MemoryContextAlloc(GetMemoryChunkContext(x), ...)
Thus we pretty much have to have some explicit awareness of this scheme
in GetMemoryChunkContext(). There's more than one way it could be
done, but I thought a clean way is to invent a separate NodeTag type
to identify the indirection case.
So this imposes a distributed overhead of one additional test-and-branch
per pfree or repalloc. I'm inclined to think that that's negligible,
but I've not done performance testing to try to prove it.
For back-patching into v14, we could put the new NodeTag type at the
end of that enum list. The change in the inline GetMemoryChunkContext
is probably an acceptable hazard.
regards, tom lane