HashAgg degenerate case

Started by Jeff Davisover 1 year ago9 messagesbugs
Jump to latest
#1Jeff Davis
pgsql@j-davis.com

HashAgg tracks the hash table bucket array (only TupleHashEntryData)
memory differently than the overall hash table (including the
firstTuple) memory.

It can run into a degenerate case when the bucket array grows larger
than the allotted memory for the overall hash table. That, by itself,
shouldn't be a major problem because it would just spill any new groups
and then start over in the next batch.

However, ResetTupleHashTable() calls tuplehash_reset(), and that
preserves the bucket array (just clearing it). That means that the next
time around it's already over the limit, and degenerates to holding a
single tuple.

Unfortunately I don't have a small reproducible case yet. I believe it
would need to be a case where the bucket array is more than half of the
memory limit, and then the hash table needs to grow again (doubling).
The actual case involves parallelism, so maybe that throws off the
estimates in an interesting way.

Fixing it seems fairly easy though: we just need to completely destroy
the hash table each time and recreate it. Something close to the
attached patch (rough).

We might also want to make the degenerate case less painful -- should
we raise the minimum number of groups per batch? One possible problem
would be around groups where the state is large, like ARRAY_AGG(), but
we could have different limits.

Regards,
Jeff Davis

Attachments:

v1-0001-HashAgg-completely-rebuild-hash-tables-each-itera.patchtext/x-patch; charset=UTF-8; name=v1-0001-HashAgg-completely-rebuild-hash-tables-each-itera.patchDownload+2-9
#2David Rowley
dgrowleyml@gmail.com
In reply to: Jeff Davis (#1)
Re: HashAgg degenerate case

On Wed, 6 Nov 2024 at 14:00, Jeff Davis <pgsql@j-davis.com> wrote:

Fixing it seems fairly easy though: we just need to completely destroy
the hash table each time and recreate it. Something close to the
attached patch (rough).

I don't think it could be that exactly though as that could lead to
JIT compilation over and over again from the following chain of
function calls: build_hash_tables() -> build_hash_table() ->
BuildTupleHashTableExt() -> ExecBuildGroupingEqual() ->
ExecReadyExpr() -> jit_compile_expr()

David

#3Jeff Davis
pgsql@j-davis.com
In reply to: David Rowley (#2)
Re: HashAgg degenerate case

On Wed, 2024-11-06 at 15:50 +1300, David Rowley wrote:

I don't think it could be that exactly though as that could lead to
JIT compilation over and over again from the following chain of
function calls: build_hash_tables() -> build_hash_table() ->
BuildTupleHashTableExt() -> ExecBuildGroupingEqual() ->
ExecReadyExpr() -> jit_compile_expr()

New patch attached. It extends ResetTupleHashTable() so that the caller
can force setting a new nbuckets, which recreates the tuplehash_hash.

agg_refill_hash_table() now uses that to recalculate and set the
nbuckets for the current grouping set's hashtab, and frees the hashtab
for all other grouping sets (it "frees" them by setting to a minimal
size bucket array).

Regards,
Jeff Davis

Attachments:

v2-0001-HashAgg-free-hash-table-memory-each-iteration.patchtext/x-patch; charset=UTF-8; name=v2-0001-HashAgg-free-hash-table-memory-each-iteration.patchDownload+51-12
#4Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#1)
Re: HashAgg degenerate case

Hi,

On 2024-11-05 16:59:56 -0800, Jeff Davis wrote:

Fixing it seems fairly easy though: we just need to completely destroy
the hash table each time and recreate it. Something close to the
attached patch (rough).

That'll often be *way* slower though. Both because acquiring and faulting-in
memory is far from free and because it'd often lead to starting to grow the
hashtable from a small size again.

I think this patch would lead to way bigger regressions than the occasionally
too large hashtable does. I'm not saying that we shouldn't do something about
that, but I don't think it can be this.

Greetings,

Andres Freund

#5Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#4)
Re: HashAgg degenerate case

On Fri, 2024-11-08 at 11:41 -0500, Andres Freund wrote:

That'll often be *way* slower though. Both because acquiring and
faulting-in
memory is far from free and because it'd often lead to starting to
grow the
hashtable from a small size again.

I assume this statement also applies to my more recent message?

/messages/by-id/4e9bfa724b301b55a2a242ebebcddea62b2e1292.camel@j-davis.com

I think this patch would lead to way bigger regressions than the
occasionally
too large hashtable does. I'm not saying that we shouldn't do
something about
that, but I don't think it can be this.

The case I observed had a bucket array of ~8M, which took about 200MB,
while the hash_mem_limit was only 128MB. I'm not quite sure how it got
into that state (probably related to the doubling behavior of the
bucket array), but once it did, it was incredibly slow because the
first tuple always triggered spilling to disk.

I can think of two approaches to solve it:

1. Detect the case when there is an obvious imbalance, like the metacxt
using 90+% of the memory limit before a batch even begins, and then
destroy/recreate the hash table.

2. Change the rules in hash_agg_check_limits() so that reasonable
progress can be made regardless of the size of metacxt. For instance,
don't count meta_mem as taking more than 75% of the limit. Or always
permit a new group if there are fewer than 25% of the
hash_ngroups_limit.

Thoughts?

Regards,
Jeff Davis

#6Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#5)
Re: HashAgg degenerate case

On Fri, 2024-11-08 at 10:48 -0800, Jeff Davis wrote:

I can think of two approaches to solve it:

Another thought: the point of spilling is to avoid using additional
memory by adding a group.

If the bucket array is 89.8% full, adding a new group doesn't require
new buckets need to be allocated, so we only have the firstTuple and
pergroup data to worry about. If that causes the memory limit to be
exceeded, it's the perfect time to switch to spill mode.

But if the bucket array is 89.9% full, then adding a new group will
cause the bucket array to double. If that causes the memory limit to be
exceeded, then we can switch to spill mode, but it's wasteful to do so
because (a) we won't be using most of those new buckets; (b) the new
buckets will crowd out space for subsequent batches and even fewer of
the buckets will be used; and (c) the memory accounting can be off by
quite a bit.

What if we have a check where, if the metacxt is using more than 40% of
the memory, and if adding a new group would reach the grow_threshold,
then enter spill mode immediately? To make this work, I think we either
need to use a tuplehash_lookup() followed by a tuplehash_insert() (two
lookups for each new group), or we would need a new API into simplehash
like tuplehash_insert_without_growing() that would return NULL instead
of growing. This approach might not be backportable, but it might be a
good approach for 18+.

Regards,
Jeff Davis

#7Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#6)
Re: HashAgg degenerate case

Hi,

On 2024-11-08 12:40:39 -0800, Jeff Davis wrote:

On Fri, 2024-11-08 at 10:48 -0800, Jeff Davis wrote:

I can think of two approaches to solve it:

Another thought: the point of spilling is to avoid using additional
memory by adding a group.

If the bucket array is 89.8% full, adding a new group doesn't require
new buckets need to be allocated, so we only have the firstTuple and
pergroup data to worry about. If that causes the memory limit to be
exceeded, it's the perfect time to switch to spill mode.

But if the bucket array is 89.9% full, then adding a new group will
cause the bucket array to double. If that causes the memory limit to be
exceeded, then we can switch to spill mode, but it's wasteful to do so
because (a) we won't be using most of those new buckets; (b) the new
buckets will crowd out space for subsequent batches and even fewer of
the buckets will be used; and (c) the memory accounting can be off by
quite a bit.

What if we have a check where, if the metacxt is using more than 40% of
the memory, and if adding a new group would reach the grow_threshold,
then enter spill mode immediately?

That assumes the bucket array is a significant portion of the memory usage -
but that's not a given, right? We also might need to grow to keep the number
of conflict chains at a manageable level.

To make this work, I think we either need to use a tuplehash_lookup()
followed by a tuplehash_insert() (two lookups for each new group), or we
would need a new API into simplehash like tuplehash_insert_without_growing()
that would return NULL instead of growing.

The former seems entirely non-viable on performance grounds. The latter might
be ok, but I'm vary on code complexity grounds.

What about simply checking whether the bucket array was a significant portion
of the memory usage at spill time and recreating the hashtable instead of
resetting IFF it was?

This approach might not be backportable, but it might be a good approach for
18+.

I doubt any change around this ought to be backpatched. The likelihood of
regressions seems to high.

Greetings,

Andres Freund

#8Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#7)
Re: HashAgg degenerate case

On Tue, 2024-11-12 at 10:48 -0500, Andres Freund wrote:

That assumes the bucket array is a significant portion of the memory
usage -
but that's not a given, right?

TupleHashEntryData is 24 bytes, and a the smallest firstTuple (an
integer key and that's it) is 20 bytes plus palloc overhead. So even if
the hash table is perfectly full, it can still be ~50% of total memory
usage. The fraction can of course be smaller with compound grouping
keys and multiple aggregates, but it is still likely to be a
significant fraction.

For example:

create table big(i int, j int);
insert into big select g, 1 from generate_series(1,10000000) g;
vacuum freeze analyze big; checkpoint;
set hash_mem_multiplier = 1;
set work_mem = '430MB';
explain analyze select count(*) from
(select i from big group by i) s;

At peak, it uses ~630MB total (exceeding the memory limit by 40%), of
which ~400MB is in the metacxt. The hashtab has a size of 16777216 and
5896580 members.

Due to the cases that cause grow_threshold to be set to zero, the
bucket array is often about 3X the number of members just after
growing. And if the bucket array consumes most or all of the available
memory, that crowds out memory for the next batch, which may
recursively spill with a very sparse bucket array. That makes it all
the more important that we get that memory usage of the bucket array
under control.

A silver lining is that smaller hash tables often perform better, so
restricting batch size doesn't necessarily cause a regression and may
even improve things. I've known this since I originally wrote the disk-
based HashAgg, and perhaps we should explore this more.

While I'm looking, I also see a couple other areas of potential
improvement:

* can we reduce the size of TupleHashEntryData by having only one
pointer?
* it looks like we should be even more aggressive about partitioning
widely. Sometimes a larger work mem creates a smaller number of
partitions, and it can end up recursing more and producing more total
batches

The former seems entirely non-viable on performance grounds.

Yeah, I think that was discussed during the original feature
development.

  The latter might
be ok, but I'm vary on code complexity grounds.

I attached a proof of concept patch, and it does add some complexity,
but it's not terrible. If tuplehash is going to be used for large hash
tables, I don't think we can hide the doubling nature from the caller
forever, assuming that the caller can do something about it (like
spill).

What about simply checking whether the bucket array was a significant
portion
of the memory usage at spill time and recreating the hashtable
instead of
resetting IFF it was?

I think that would basically solve the degenerate case, but:

(a) if "significant" means ~40%, then it will trigger recreation for
a lot of cases, and it re-raises the questions about performance
regressions; and
(b) it doesn't correct the problem where we grow the hashtable just
to hold one additional element, causing it to exceed the available
memory and not be able to actually use those new buckets.
(c) it doesn't prevent overshooting the memory limit by a significant
amount

A slightly better approach might be to recreate the hashtable if the
last batch had a hashtable that is less than 10% filled or something,
which will hopefully avoid problem (a).

I doubt any change around this ought to be backpatched. The
likelihood of
regressions seems to high.

Perhaps there's some low-risk change we can make, like having a higher
minimum number of groups before spilling, or something like that?

I can move this discussion to -hackers until we sort it out.

Regards,
Jeff Davis

Attachments:

v1-0001-simplehash-add-grow_ok-to-API.patchtext/x-patch; charset=UTF-8; name=v1-0001-simplehash-add-grow_ok-to-API.patchDownload+30-6
v1-0002-HashAgg-avoid-unexpected-hash-table-growth-near-m.patchtext/x-patch; charset=UTF-8; name=v1-0002-HashAgg-avoid-unexpected-hash-table-growth-near-m.patchDownload+111-26
#9Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#8)
Re: HashAgg degenerate case

On Tue, 2024-11-12 at 15:15 -0800, Jeff Davis wrote:

I attached a proof of concept patch, and it does add some complexity,
but it's not terrible. If tuplehash is going to be used for large
hash
tables, I don't think we can hide the doubling nature from the caller
forever, assuming that the caller can do something about it (like
spill).

Previous version was incorrect, v2 attached, which is still rough, but
hopefully better.

Regards,
Jeff Davis

Attachments:

v2-0001-simplehash-add-grow_ok-to-API.patchtext/x-patch; charset=UTF-8; name=v2-0001-simplehash-add-grow_ok-to-API.patchDownload+34-7
v2-0002-HashAgg-avoid-unexpected-hash-table-growth-near-m.patchtext/x-patch; charset=UTF-8; name=v2-0002-HashAgg-avoid-unexpected-hash-table-growth-near-m.patchDownload+111-26