Incorrect CHUNKHDRSZ in nodeAgg.c
While reading nodeAgg.c, I noticed code that uses CHUNKHDRSZ to help
figure out how much memory a tuple uses so the code knows when to
spill to disk. CHUNKHDRSZ is currently set to 16 bytes, which was
fine when that code was added, but it's a few years out-of-date since
c6e0fe1f2 in 2022.
The attached adjusts the 16 to sizeof(MemoryChunk), which is normally
8, but 16 in assert builds.
The memory accounting should be more accurate with the patch, however,
that accounting doesn't account for blocks that the chunks are on. For
that reason, I'm thinking of not backpatching this as it will make
hash aggregates use a bit more memory than unpatched before spilling,
albeit, for most cases, closer to the hash_mem limit, which is when
the spilling should be happening.
Sound ok?
David
Attachments:
David Rowley <dgrowleyml@gmail.com> writes:
While reading nodeAgg.c, I noticed code that uses CHUNKHDRSZ to help
figure out how much memory a tuple uses so the code knows when to
spill to disk. CHUNKHDRSZ is currently set to 16 bytes, which was
fine when that code was added, but it's a few years out-of-date since
c6e0fe1f2 in 2022.
The attached adjusts the 16 to sizeof(MemoryChunk), which is normally
8, but 16 in assert builds.
Yeah, this is more formally correct ...
The memory accounting should be more accurate with the patch, however,
that accounting doesn't account for blocks that the chunks are on. For
that reason, I'm thinking of not backpatching this as it will make
hash aggregates use a bit more memory than unpatched before spilling,
albeit, for most cases, closer to the hash_mem limit, which is when
the spilling should be happening.
Agreed that back-patching isn't appropriate.
I thought for a bit about whether we shouldn't try to account for
palloc power-of-2-block-size overhead here. That omission would
typically be a far larger error than the one you are fixing. However,
given that the inputs to hash_agg_entry_size are only estimates,
I'm not sure that we can hope to do better than the current behavior.
Should tuple hash tables be using a different memory context type
that doesn't impose that power-of-2 overhead? It's only useful
when we expect a fair amount of pfree-and-recycle behavior, but
I think we don't have much retail entry removal in tuple hash
tables. Could we use a generation or even bump context?
regards, tom lane
On Thu, 2 Jan 2025 at 12:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I thought for a bit about whether we shouldn't try to account for
palloc power-of-2-block-size overhead here. That omission would
typically be a far larger error than the one you are fixing. However,
given that the inputs to hash_agg_entry_size are only estimates,
I'm not sure that we can hope to do better than the current behavior.
Likely the most correct way would be to use GetMemoryChunkSpace(), but
there might be some additional overhead to consider there. I looked at
[1]: /messages/by-id/20200325220936.il3ni2fj2j2b45y5@alap3.anarazel.de
purpose. There likely is a small overhead to doing so, which is
something to consider.
Should tuple hash tables be using a different memory context type
that doesn't impose that power-of-2 overhead? It's only useful
when we expect a fair amount of pfree-and-recycle behavior, but
I think we don't have much retail entry removal in tuple hash
tables. Could we use a generation or even bump context?
Bump wouldn't work due to the SH_FREE() in SH_GROW() when resizing the
table. If sizeof(TupleHashEntryData) were a power-of-two, then there'd
be no wastage as the hash table always has a power-of-two bucket count
and two powers-of-two multiplied are always a power-of-two value.
Unfortunately, TupleHashEntryData is 24 bytes and I don't see any easy
way to shrink it to 16 bytes.
I think what would be more interesting is seeing if we can store the
TupleHashEntryData.firstTuple in a bump context. I'd need to check if
we ever pfree() those minimal tuples or if we just throw the entire
batch of tuples away with a context reset. Since these minimal tuples
only contain the GROUP BY columns, for most cases they should be very
narrow tuples indeed, so not having a MemoryChunk header would save
quite a bit of memory in many cases. I think that's basically
1083f94da plus 6ed83d5fa for nodeAgg.c.
David
[1]: /messages/by-id/20200325220936.il3ni2fj2j2b45y5@alap3.anarazel.de
David Rowley <dgrowleyml@gmail.com> writes:
On Thu, 2 Jan 2025 at 12:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I thought for a bit about whether we shouldn't try to account for
palloc power-of-2-block-size overhead here. That omission would
typically be a far larger error than the one you are fixing. However,
given that the inputs to hash_agg_entry_size are only estimates,
I'm not sure that we can hope to do better than the current behavior.
Likely the most correct way would be to use GetMemoryChunkSpace(), but
there might be some additional overhead to consider there.
Nah, you've got the wrong mental model. hash_agg_entry_size is
trying to predict the average hash entry size in advance of seeing
any actual data, so that we can estimate how many entries will fit
in work_mem. By the time we can use GetMemoryChunkSpace on an
actual entry, it's too late for that.
Could we use a generation or even bump context?
Bump wouldn't work due to the SH_FREE() in SH_GROW() when resizing the
table.
Meh. I guess we'd have to keep that structure in a context separate
from the tuples. Might not be worth the trouble.
I think what would be more interesting is seeing if we can store the
TupleHashEntryData.firstTuple in a bump context.
Are you saying the same as above, or something different?
regards, tom lane
On Thu, 2 Jan 2025 at 13:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Rowley <dgrowleyml@gmail.com> writes:
I think what would be more interesting is seeing if we can store the
TupleHashEntryData.firstTuple in a bump context.Are you saying the same as above, or something different?
I thought you only meant store the hash buckets in a bump context. I
didn't realise you meant the tuples too. Seems we were talking about
the same thing.
David
On Thu, Jan 2, 2025 at 7:24 AM David Rowley <dgrowleyml@gmail.com> wrote:
Bump wouldn't work due to the SH_FREE() in SH_GROW() when resizing the
table. If sizeof(TupleHashEntryData) were a power-of-two, then there'd
be no wastage as the hash table always has a power-of-two bucket count
and two powers-of-two multiplied are always a power-of-two value.
Unfortunately, TupleHashEntryData is 24 bytes and I don't see any easy
way to shrink it to 16 bytes.
FYI, there is a proposal for that at
/messages/by-id/817d244237878cebdff0bc363718feaf49a1ea7d.camel@j-davis.com
--
John Naylor
Amazon Web Services
On Thu, 2025-01-02 at 13:38 +1300, David Rowley wrote:
On Thu, 2 Jan 2025 at 13:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Rowley <dgrowleyml@gmail.com> writes:
I think what would be more interesting is seeing if we can store
the
TupleHashEntryData.firstTuple in a bump context.Are you saying the same as above, or something different?
I thought you only meant store the hash buckets in a bump context. I
didn't realise you meant the tuples too. Seems we were talking about
the same thing.
There are several things to keep track of:
1. The bucket array, which is already in its own context (called the
metacxt). AllocSet is probably fine here. I agree that there is some
room for improvement, but remember that if the bucket array gets to any
interesting size, it will be allocated in its own block.
2. The grouping keys (called firstTuple), which are in the tablecxt.
These could benefit from the Bump allocator.
3. The pergroup states, which are also stored in the tablecxt, and
these can also benefit from the Bump allocator.
4. If the transition type is by-ref, the transition value.
If we separate out 4, we can use the Bump allocator for 2 & 3.
Regards,
Jeff Davis
On Sat, 2025-01-04 at 09:24 +0700, John Naylor wrote:
FYI, there is a proposal for that at
/messages/by-id/817d244237878cebdff0bc363718feaf49a1ea7d.camel@j-davis.com
I had intended to commit some of those patches soon, so if someone sees
a conflict with the ideas in this thread, please let me know.
Regards,
Jeff Davis
On Mon, 2025-01-06 at 12:34 -0800, Jeff Davis wrote:
If we separate out 4, we can use the Bump allocator for 2 & 3.
Attached POC patch, which reduces memory usage by ~15% for a simple
distinct query on an integer key. Performance is the same or perhaps a
hair faster.
It's not many lines of code, but the surrounding code might benefit
from some refactoring which would make it a bit simpler.
Regards,
Jeff Davis
Attachments:
On Thu, 9 Jan 2025 at 09:50, Jeff Davis <pgsql@j-davis.com> wrote:
Attached POC patch, which reduces memory usage by ~15% for a simple
distinct query on an integer key. Performance is the same or perhaps a
hair faster.It's not many lines of code, but the surrounding code might benefit
from some refactoring which would make it a bit simpler.
Thanks for working on this. Here's a preliminary review:
Since bump.c does not add headers to the palloc'd chunks, I think the
following code from hash_agg_entry_size() shouldn't be using
CHUNKHDRSZ anymore.
tupleChunkSize = CHUNKHDRSZ + tupleSize;
if (pergroupSize > 0)
pergroupChunkSize = CHUNKHDRSZ + pergroupSize;
else
pergroupChunkSize = 0;
You should be able to get rid of pergroupChunkSize and just use
pergroupSize in the return.
I did some benchmarking using the attached script. There's a general
speedup, but I saw some unexpected increase in the number of batches
with the patched version on certain tests. See the attached results.
For example, the work_mem = 8MB with 10 million rows shows "Batches:
129" on master but "Batches: 641" with the patched version. I didn't
check why.
David
Attachments:
On Fri, 2025-01-10 at 23:30 +1300, David Rowley wrote:
Since bump.c does not add headers to the palloc'd chunks, I think the
following code from hash_agg_entry_size() shouldn't be using
CHUNKHDRSZ anymore.
Fixed.
I also tried to account for the power-of-two allocations for the
transition values. We don't do that in other places, but now that we
have the bump allocator which does not do that, it seems reasonable to
account for it here.
I did some benchmarking using the attached script. There's a general
speedup, but I saw some unexpected increase in the number of batches
with the patched version on certain tests. See the attached results.
For example, the work_mem = 8MB with 10 million rows shows "Batches:
129" on master but "Batches: 641" with the patched version. I didn't
check why.
Somewhat counter-intuitively, HashAgg can use more batches when there
is more memory available. If already spilling, creating more small
batches is good, because it reduces the chances of recursing. The
limiting factor for creating a lot of tiny batches is that each
partition requires a logtape with its own write buffer, so if there's
more memory available, that allows creating more logtapes and a higher
partitioning fanout.
The runtimes I got running your tests are mixed. I'm still analyzing
whether test noise is a factor, or whether the increased number of
partitions is a factor. But any runtime regressions are minor in
comparison to the memory savings, so I think we are on the right track.
Attached v2.
Regards,
Jeff Davis
Attachments:
On Tue, 14 Jan 2025 at 09:09, Jeff Davis <pgsql@j-davis.com> wrote:
Attached v2.
This needs to be rebased due to b4a07f532.
The following repalloc was removed by that commit.
- totalsize = MAXALIGN(firstTuple->t_len) + hashtable->additionalsize;
- firstTuple = repalloc(firstTuple, totalsize);
+ totalsize = MAXALIGN(mtup->t_len) + hashtable->additionalsize;
+ firstTuple = MemoryContextAlloc(hashtable->tablecxt, totalsize);
+ memcpy(firstTuple, mtup, mtup->t_len);
David
On Wed, 2025-01-15 at 21:01 +1300, David Rowley wrote:
On Tue, 14 Jan 2025 at 09:09, Jeff Davis <pgsql@j-davis.com> wrote:
Attached v2.
This needs to be rebased due to b4a07f532.
I moved the patch to the other thread here:
/messages/by-id/09b325921e50bc3a3217fb01d8eb512c89ee36f1.camel@j-davis.com
because they are all related to reducing the memory usage of HashAgg.
Regards,
Jeff Davis