Hash aggregate collisions cause excessive spilling

Started by Ants Aasmaabout 2 months ago13 messageshackers
Jump to latest
#1Ants Aasma
ants.aasma@cybertec.at

Investigating a customer complaint I ran into an issue with the hash
aggregate code. The problem was that a query that usually completes in
less than a minute sometimes gets stuck indefinitely (hours+). I
tracked it down to a hash aggregate node returning one tuple from a
batch and spilling the rest.

The reason for the behavior is that aggstate->hash_metacxt was 100M,
which is larger than work_mem*hash_mem_multiplier of 64M. This makes
hash_agg_check_limits() always spill after the first tuple. I think
that ends up having a n² overhead, with n being almost 4M here.

I don't have a simple reproducer yet, because the live problem was on
a parallel query where looking at the backend wrong caused the problem
to disappear. After some retries I was able to catch an instance of
growing past work_mem with gdb. After that growth the simplehash was
{size = 4194304, members = 409839, ..}, i.e. the table was only 20%
full before growing. So the cause seems to be a run of hash collisions
bigger than SH_GROW_MAX_MOVE (150).

AFAICT there is nothing in simplehash that would stop it growing past
work_mem, and once it does the spilling logic in
agg_refill_hash_table() enters this degenerate state until the end of
the plan node.

I think the correct fix would be to have a way to insert into
simplehash with a limit on size, which means that the insert might
fail. I haven't yet looked at how complicated this would be to
implement.

I also haven't checked what is the cause for such a long run of
collisions. But I think it's related to it being a HashAggregate on
top of Gather on top of HashAggregate.

Regards,
Ants Aasma

#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ants Aasma (#1)
Re: Hash aggregate collisions cause excessive spilling

On 2/19/26 15:49, Ants Aasma wrote:

Investigating a customer complaint I ran into an issue with the hash
aggregate code. The problem was that a query that usually completes in
less than a minute sometimes gets stuck indefinitely (hours+). I
tracked it down to a hash aggregate node returning one tuple from a
batch and spilling the rest.

The reason for the behavior is that aggstate->hash_metacxt was 100M,
which is larger than work_mem*hash_mem_multiplier of 64M. This makes
hash_agg_check_limits() always spill after the first tuple. I think
that ends up having a n² overhead, with n being almost 4M here.

Not great :-(

I don't have a simple reproducer yet, because the live problem was on
a parallel query where looking at the backend wrong caused the problem
to disappear. After some retries I was able to catch an instance of
growing past work_mem with gdb. After that growth the simplehash was
{size = 4194304, members = 409839, ..}, i.e. the table was only 20%
full before growing. So the cause seems to be a run of hash collisions
bigger than SH_GROW_MAX_MOVE (150).

AFAICT there is nothing in simplehash that would stop it growing past
work_mem, and once it does the spilling logic in
agg_refill_hash_table() enters this degenerate state until the end of
the plan node.

I think the correct fix would be to have a way to insert into
simplehash with a limit on size, which means that the insert might
fail. I haven't yet looked at how complicated this would be to
implement.

Wouldn't it be easier to just start ignoring SH_GROW_MAX_MOVE? That'd
have a little bit of performance impact on that one key, but that seems
acceptable. And easier to do than dealing with failing inserts.

I also haven't checked what is the cause for such a long run of
collisions. But I think it's related to it being a HashAggregate on
top of Gather on top of HashAggregate.

So it's a parallel aggregate? Partial + Finalize? I wonder if that might
be "correlating" the data in a way that makes it more likely to hit
SH_GROW_MAX_MOVE. But If that was the case, wouldn't we see this issue
more often?

regards

--
Tomas Vondra

#3Ants Aasma
ants.aasma@cybertec.at
In reply to: Tomas Vondra (#2)
Re: Hash aggregate collisions cause excessive spilling

On Thu, 19 Feb 2026 at 17:33, Tomas Vondra <tomas@vondra.me> wrote:

I think the correct fix would be to have a way to insert into
simplehash with a limit on size, which means that the insert might
fail. I haven't yet looked at how complicated this would be to
implement.

Wouldn't it be easier to just start ignoring SH_GROW_MAX_MOVE? That'd
have a little bit of performance impact on that one key, but that seems
acceptable. And easier to do than dealing with failing inserts.

I am not sure if that is a 100% fix, but it would definitely be a
whole lot better than the current behavior. I'm not sure if it's
guaranteed that the two other memory contexts end up being more than
the size of the hash table. If they aren't then the hashtable could
still grow too big once a hashtable bigger than half the allowed size
hits grow_threshold.

I am more worried about having a huge run of tuples in the hash table,
which will happen if inserts are correlated by hash id. That will also
cause quadratic slowdown.

I also haven't checked what is the cause for such a long run of
collisions. But I think it's related to it being a HashAggregate on
top of Gather on top of HashAggregate.

So it's a parallel aggregate? Partial + Finalize? I wonder if that might
be "correlating" the data in a way that makes it more likely to hit
SH_GROW_MAX_MOVE. But If that was the case, wouldn't we see this issue
more often?

Interestingly the plan doesn't have partial and final on those hash agg nodes:

-> HashAggregate (cost=142400.87..142800.87
rows=40000 width=16) (actual time=7978.262..9591.682 rows=3698243
loops=1)
Group Key: "*SELECT* 2_4".vehicle_id,
"*SELECT* 2_4".day
Batches: 21 Memory Usage: 65593kB Disk
Usage: 118256kB
-> Gather (cost=133600.87..142000.87
rows=80000 width=16) (actual time=1898.473..4772.296 rows=3698243
loops=1)
Workers Planned: 2
Workers Launched: 2
-> HashAggregate
(cost=132600.87..133000.87 rows=40000 width=16) (actual
time=1586.697..2040.368 rows=1232748 loops=3)
Group Key: "*SELECT*
2_4".vehicle_id, "*SELECT* 2_4".day
Batches: 1 Memory Usage: 5137kB
Worker 0: Batches: 5 Memory
Usage: 79921kB Disk Usage: 40024kB
Worker 1: Batches: 5 Memory
Usage: 81969kB Disk Usage: 36112kB

There are timescale tables involved in the plan, so I think timescale
might be behind that.

There is this comment above the simplehash growing logic:

* To avoid negative consequences from overly imbalanced
* hashtables, grow the hashtable if collisions would require
* us to move a lot of entries. The most likely cause of such
* imbalance is filling a (currently) small table, from a
* currently big one, in hash-table order.

The problem disappears if I have a breakpoint on tuplehash_grow, so
apparently triggering the problem requires that the lower hashtable
scans interleave in a particular manner to trigger the excess growth
of the upper node.

I'm wondering if some way to decorrelate the hashtables would help.
For example a hashtable specific (pseudo)random salt.

Regards,
Ants Aasma

#4Andres Freund
andres@anarazel.de
In reply to: Ants Aasma (#3)
Re: Hash aggregate collisions cause excessive spilling

Hi,

On 2026-02-19 18:16:57 +0200, Ants Aasma wrote:

So it's a parallel aggregate? Partial + Finalize? I wonder if that might
be "correlating" the data in a way that makes it more likely to hit
SH_GROW_MAX_MOVE. But If that was the case, wouldn't we see this issue
more often?

Interestingly the plan doesn't have partial and final on those hash agg nodes:

-> HashAggregate (cost=142400.87..142800.87
rows=40000 width=16) (actual time=7978.262..9591.682 rows=3698243
loops=1)
Group Key: "*SELECT* 2_4".vehicle_id,
"*SELECT* 2_4".day
Batches: 21 Memory Usage: 65593kB Disk
Usage: 118256kB
-> Gather (cost=133600.87..142000.87
rows=80000 width=16) (actual time=1898.473..4772.296 rows=3698243
loops=1)
Workers Planned: 2
Workers Launched: 2
-> HashAggregate
(cost=132600.87..133000.87 rows=40000 width=16) (actual
time=1586.697..2040.368 rows=1232748 loops=3)
Group Key: "*SELECT*
2_4".vehicle_id, "*SELECT* 2_4".day
Batches: 1 Memory Usage: 5137kB
Worker 0: Batches: 5 Memory
Usage: 79921kB Disk Usage: 40024kB
Worker 1: Batches: 5 Memory
Usage: 81969kB Disk Usage: 36112kB

There are timescale tables involved in the plan, so I think timescale
might be behind that.

Hm, so timescale creates a plan that we would not?

There is this comment above the simplehash growing logic:

* To avoid negative consequences from overly imbalanced
* hashtables, grow the hashtable if collisions would require
* us to move a lot of entries. The most likely cause of such
* imbalance is filling a (currently) small table, from a
* currently big one, in hash-table order.

The problem disappears if I have a breakpoint on tuplehash_grow, so
apparently triggering the problem requires that the lower hashtable
scans interleave in a particular manner to trigger the excess growth
of the upper node.

I'm wondering if some way to decorrelate the hashtables would help.
For example a hashtable specific (pseudo)random salt.

We do try to add a hash-IV that's different for each worker:

/*
* If parallelism is in use, even if the leader backend is performing the
* scan itself, we don't want to create the hashtable exactly the same way
* in all workers. As hashtables are iterated over in keyspace-order,
* doing so in all processes in the same way is likely to lead to
* "unbalanced" hashtables when the table size initially is
* underestimated.
*/
if (use_variable_hash_iv)
hash_iv = murmurhash32(ParallelWorkerNumber);

I don't remember enough of how the parallel aggregate stuff works. Perhaps the
issue is that the leader is also building a hashtable and it's being inserted
into the post-gather hashtable, using the same IV?

In which case parallel_leader_participation=off should make a difference.

Greetings,

Andres Freund

#5Ants Aasma
ants.aasma@cybertec.at
In reply to: Andres Freund (#4)
Re: Hash aggregate collisions cause excessive spilling

On Thu, 19 Feb 2026 at 18:32, Andres Freund <andres@anarazel.de> wrote:

Interestingly the plan doesn't have partial and final on those hash agg nodes:
...
There are timescale tables involved in the plan, so I think timescale
might be behind that.

Hm, so timescale creates a plan that we would not?

No, after poking around in the query, it's actually just that the
aggregate is for implementing DISTINCT, which means there is no
aggregate state.

I'm wondering if some way to decorrelate the hashtables would help.
For example a hashtable specific (pseudo)random salt.

We do try to add a hash-IV that's different for each worker:

/*
* If parallelism is in use, even if the leader backend is performing the
* scan itself, we don't want to create the hashtable exactly the same way
* in all workers. As hashtables are iterated over in keyspace-order,
* doing so in all processes in the same way is likely to lead to
* "unbalanced" hashtables when the table size initially is
* underestimated.
*/
if (use_variable_hash_iv)
hash_iv = murmurhash32(ParallelWorkerNumber);

I don't remember enough of how the parallel aggregate stuff works. Perhaps the
issue is that the leader is also building a hashtable and it's being inserted
into the post-gather hashtable, using the same IV?

In which case parallel_leader_participation=off should make a difference.

After turning leader participation off the problem no longer
reproduced even after 10 iterations, turning it back on it reproduced
on the 4th iteration. Is there any reason why the hash table couldn't
have an unconditional iv that includes the plan node?

Regards,
Ants Aasma

#6Andres Freund
andres@anarazel.de
In reply to: Ants Aasma (#5)
Re: Hash aggregate collisions cause excessive spilling

Hi,

On 2026-02-19 19:06:04 +0200, Ants Aasma wrote:

/*
* If parallelism is in use, even if the leader backend is performing the
* scan itself, we don't want to create the hashtable exactly the same way
* in all workers. As hashtables are iterated over in keyspace-order,
* doing so in all processes in the same way is likely to lead to
* "unbalanced" hashtables when the table size initially is
* underestimated.
*/
if (use_variable_hash_iv)
hash_iv = murmurhash32(ParallelWorkerNumber);

I don't remember enough of how the parallel aggregate stuff works. Perhaps the
issue is that the leader is also building a hashtable and it's being inserted
into the post-gather hashtable, using the same IV?

In which case parallel_leader_participation=off should make a difference.

After turning leader participation off the problem no longer
reproduced even after 10 iterations, turning it back on it reproduced
on the 4th iteration. Is there any reason why the hash table couldn't
have an unconditional iv that includes the plan node?

You mean just use the numerical value of the pointer? I think that'd be pretty
likely to be the same between parallel workers. And I think it's not great for
benchmarking / debugging if every run ends up with a different IV.

But we certainly should do something about the IV for the leader in these
cases.

Greetings,

Andres Freund

#7Ants Aasma
ants.aasma@cybertec.at
In reply to: Andres Freund (#6)
Re: Hash aggregate collisions cause excessive spilling

On Thu, 19 Feb 2026 at 19:30, Andres Freund <andres@anarazel.de> wrote:

On 2026-02-19 19:06:04 +0200, Ants Aasma wrote:

After turning leader participation off the problem no longer
reproduced even after 10 iterations, turning it back on it reproduced
on the 4th iteration. Is there any reason why the hash table couldn't
have an unconditional iv that includes the plan node?

You mean just use the numerical value of the pointer? I think that'd be pretty
likely to be the same between parallel workers. And I think it's not great for
benchmarking / debugging if every run ends up with a different IV.

But we certainly should do something about the IV for the leader in these
cases.

I was thinking more along the lines of hashing together the pointer
value and worker number. But something more deterministic would indeed
be better. How about this?

--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -201,3 +201,3 @@ BuildTupleHashTable(PlanState *parent,
        MemoryContext oldcontext;
-       uint32          hash_iv = 0;
+       uint32          hash_iv = parent->plan->plan_node_id;

I also figured out why this is not a more common issue.

The iv randomization is predicated on the aggregate node not having a
final function. Normally the partial hash aggregate will use
randomized iv and the finalize will use 0. But in this case because
it's implementing distinct, there is no finalize function on the upper
hash aggregate so on the leader the upper hash aggregate gets the same
iv as the one below gather.

Are there any other cases where a hash aggregate would be fed from
another hash aggregate using the same group key? Those could run into
the same problem.

Regards,
Ants Aasma

#8Ants Aasma
ants.aasma@cybertec.at
In reply to: Ants Aasma (#7)
Re: Hash aggregate collisions cause excessive spilling

On Thu, 19 Feb 2026 at 20:07, Ants Aasma <ants.aasma@cybertec.at> wrote:

I was thinking more along the lines of hashing together the pointer
value and worker number. But something more deterministic would indeed
be better. How about this?

--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -201,3 +201,3 @@ BuildTupleHashTable(PlanState *parent,
MemoryContext oldcontext;
-       uint32          hash_iv = 0;
+       uint32          hash_iv = parent->plan->plan_node_id;

I can confirm that this fixes the issue. A standalone reproducer is here:

create table data as select random(1,1000000) from generate_series(1,10000000);
vacuum analyze data;
set enable_gathermerge = off;
explain analyze select distinct random from data;

Regards,
Ants Aasma

#9Andres Freund
andres@anarazel.de
In reply to: Ants Aasma (#8)
Re: Hash aggregate collisions cause excessive spilling

Hi,

On 2026-02-19 20:24:17 +0200, Ants Aasma wrote:

On Thu, 19 Feb 2026 at 20:07, Ants Aasma <ants.aasma@cybertec.at> wrote:

I was thinking more along the lines of hashing together the pointer
value and worker number. But something more deterministic would indeed
be better. How about this?

--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -201,3 +201,3 @@ BuildTupleHashTable(PlanState *parent,
MemoryContext oldcontext;
-       uint32          hash_iv = 0;
+       uint32          hash_iv = parent->plan->plan_node_id;

I can confirm that this fixes the issue. A standalone reproducer is here:

I think this needs should use something that smears the bits from the plan_id
more widely. The hash functions of some types are ... peculiar (partially due
to cross-type hashjoin support).

I'd also combine it with use_variable_hash_iv, rather than just have
use_variable_hash_iv overwrite it.

create table data as select random(1,1000000) from generate_series(1,10000000);
vacuum analyze data;
set enable_gathermerge = off;
explain analyze select distinct random from data;

Took me a moment to reproduce with that. I also needed parallel_tuple_cost=0
and work_mem=4MB (something fairly small at least) to reproduce it.

With those I can reproduce that the query only terminates in a reasonable time
with the hash_iv = parent->plan->plan_node_id thing.

Greetings,

Andres Freund

#10Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#9)
Re: Hash aggregate collisions cause excessive spilling

Hi,

On 2026-02-19 14:01:15 -0500, Andres Freund wrote:

The hash functions of some types are ... peculiar (partially due to
cross-type hashjoin support).

In case you want a reference for said peculiarity, look at this:

=# SELECT lower, upper, lower | upper as V, hashint8(lower | upper) FROM (SELECT (i) as lower, ((i)::int8 << 32) as upper FROM generate_series(0, 5) g(i));
┌───────┬─────────────┬─────────────┬────────────┐
│ lower │ upper │ v │ hashint8 │
├───────┼─────────────┼─────────────┼────────────┤
│ 0 │ 0 │ 0 │ -272711505 │
│ 1 │ 4294967296 │ 4294967297 │ -272711505 │
│ 2 │ 8589934592 │ 8589934594 │ -272711505 │
│ 3 │ 12884901888 │ 12884901891 │ -272711505 │
│ 4 │ 17179869184 │ 17179869188 │ -272711505 │
│ 5 │ 21474836480 │ 21474836485 │ -272711505 │
└───────┴─────────────┴─────────────┴────────────┘
(6 rows)

Which obviously could have some ... fun consequences.

Whoever chose to implement the cross type compatibility by
a) using int4 as the hash "domain", instead of always hashing 8 bytes
b) implementing int8 by doubling up the high 32bits into the lower 32 bits,
instead of hashint4 iff the value value fits into 32bits and the full 8
bytes otherwise

did us really no service.

Greetings,

Andres Freund

#11Ants Aasma
ants.aasma@cybertec.at
In reply to: Andres Freund (#9)
Re: Hash aggregate collisions cause excessive spilling

On Thu, 19 Feb 2026 at 21:01, Andres Freund <andres@anarazel.de> wrote:

On 2026-02-19 20:24:17 +0200, Ants Aasma wrote:

On Thu, 19 Feb 2026 at 20:07, Ants Aasma <ants.aasma@cybertec.at> wrote:

I was thinking more along the lines of hashing together the pointer
value and worker number. But something more deterministic would indeed
be better. How about this?

--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -201,3 +201,3 @@ BuildTupleHashTable(PlanState *parent,
MemoryContext oldcontext;
-       uint32          hash_iv = 0;
+       uint32          hash_iv = parent->plan->plan_node_id;

I can confirm that this fixes the issue. A standalone reproducer is here:

I think this needs should use something that smears the bits from the plan_id
more widely. The hash functions of some types are ... peculiar (partially due
to cross-type hashjoin support).

I'd also combine it with use_variable_hash_iv, rather than just have
use_variable_hash_iv overwrite it.

This problem should only affect cases where multiple hash tables worth
of tuples are trying to fit into one. I think right now that limits it
to hash aggregates with parallel query. ExecBuildHash32FromAttrs
comment says that we should be using non-zero init_value only when
needed for a marginal performance gain. So I limited the plan node id
inclusion to use_variable_hash_iv, but made it unconditional for
aggregates.

I used hash_combine to smear together the bits of the two. I think
that should be good enough. Alternatively a xor of murmurhash32 of the
two should be better.

There is also a call to ExecBuildHash32FromAttrs from ExecInitSubPlan
that specifies a hash_iv for a hashed sub plan execution. This must
match the value BuildTupleHashTable decides on. Worth adding a
comment?

Regards,
Ants Aasma

Attachments:

v1-0001-Decorrelate-nested-hash-tables.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Decorrelate-nested-hash-tables.patchDownload+2-3
#12Andres Freund
andres@anarazel.de
In reply to: Ants Aasma (#11)
Re: Hash aggregate collisions cause excessive spilling

Hi,

On 2026-02-28 17:21:28 +0200, Ants Aasma wrote:

On Thu, 19 Feb 2026 at 21:01, Andres Freund <andres@anarazel.de> wrote:

On 2026-02-19 20:24:17 +0200, Ants Aasma wrote:

On Thu, 19 Feb 2026 at 20:07, Ants Aasma <ants.aasma@cybertec.at> wrote:

I was thinking more along the lines of hashing together the pointer
value and worker number. But something more deterministic would indeed
be better. How about this?

--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -201,3 +201,3 @@ BuildTupleHashTable(PlanState *parent,
MemoryContext oldcontext;
-       uint32          hash_iv = 0;
+       uint32          hash_iv = parent->plan->plan_node_id;

I can confirm that this fixes the issue. A standalone reproducer is here:

I think this needs should use something that smears the bits from the plan_id
more widely. The hash functions of some types are ... peculiar (partially due
to cross-type hashjoin support).

I'd also combine it with use_variable_hash_iv, rather than just have
use_variable_hash_iv overwrite it.

This problem should only affect cases where multiple hash tables worth
of tuples are trying to fit into one.

Is that entirely true? If we are feeding one hash table from other hash table,
1:1, and the initial size of the target table is wrong, we probably also won't
handle it in the best way.

That could happen with a group by over a group by (with perhaps some joins
above the first group by). Probably not that common, but ...

I guess another case could be a CTAS of a group by query that is then grouped
by again.

I think right now that limits it to hash aggregates with parallel
query. ExecBuildHash32FromAttrs comment says that we should be using
non-zero init_value only when needed for a marginal performance gain. So I
limited the plan node id inclusion to use_variable_hash_iv, but made it
unconditional for aggregates.

That's an argument... I guess we could mostly avoid that though, by just
always combining in EEOP_HASHDATUM_FIRST.

I used hash_combine to smear together the bits of the two. I think
that should be good enough. Alternatively a xor of murmurhash32 of the
two should be better.

Probably it's good enough, but i'd probably still go for
hash_combine(murmurhash32(ParallelWorkerNumber),
murmurhash32(parent->plan->plan_node_id))
or such.

There is also a call to ExecBuildHash32FromAttrs from ExecInitSubPlan
that specifies a hash_iv for a hashed sub plan execution.

When you're saying it specifies a hash IV you mean the 0 it passes, not some
more complicated value, right?

This must match the value BuildTupleHashTable decides on. Worth adding a
comment?

Yes, I think so.

From 624b4827e0d480fc16e016c1ad7c5b26f358b6f3 Mon Sep 17 00:00:00 2001
From: Ants Aasma <ants@cybertec.at>
Date: Sat, 28 Feb 2026 15:08:50 +0200
Subject: [PATCH v1] Decorrelate nested hash tables

Because hash tables are iterated in hash order, using the same hash
function in nested hash tables can lead to excessive collisions. If the
parent hash table can be the same size as the child hash tables it is
not a problem as the parent will quickly grow to the same size as the
child, eliminating further collisions. But if there are more than one
child the collisions can cause the table to quickly grow until it is
below 10% fillfactor.

The problem is made worse by nodeAgg devolving into building single
entry batches once hash table size exceeds working memory.

To hit the problem an aggregate node without a final function above
multiple other aggregate nodes is needed. Because hash iv is already
initialized using parallel worker number when there is no final fn
the typical cases do not hit this problem. A hash aggregate implementing
DISTINCT above multiple parallel workers with more groups than fit into
memory is needed.

Initializing the hash function based on plan node id decorrelates hash
tables within a plan while still keeping behavior deterministic.

Author: Ants Aasma <ants.aasma@cybertec.at>
Discussion: /messages/by-id/CANwKhkPOZupu3PYQVdkMmYjquYVqG2v8XmCAuuVM9Eu13-Zw3g@mail.gmail.com
---
src/backend/executor/execGrouping.c | 2 +-
src/backend/executor/nodeAgg.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index c107514a85d..21f5e4cabcc 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -246,7 +246,7 @@ BuildTupleHashTable(PlanState *parent,
* underestimated.
*/
if (use_variable_hash_iv)
-		hash_iv = murmurhash32(ParallelWorkerNumber);
+		hash_iv = hash_combine(ParallelWorkerNumber, parent->plan->plan_node_id);

hashtable->hashtab = tuplehash_create(metacxt, nbuckets, hashtable);

I think some of the reasoning from your commit messages needs to be in a
comment somewhere too, otherwise the next person tackling this code will have
a hell of a time.

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 7d487a165fa..5c947fd151f 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1535,7 +1535,7 @@ build_hash_table(AggState *aggstate, int setno, double nbuckets)
metacxt,
tuplescxt,
tmpcxt,
-											 DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+											 true);
}

Hm, won't that mean we'll now always use the slower path? I think there are
non-partial cases that could be problematic, but if we do this we probaly
should do the optimization of the hash expression stuff I mentioned above.

Greetings,

Andres Freund

#13Ants Aasma
ants.aasma@cybertec.at
In reply to: Andres Freund (#12)
Re: Hash aggregate collisions cause excessive spilling

On Sat, 28 Feb 2026 at 22:03, Andres Freund <andres@anarazel.de> wrote:

This problem should only affect cases where multiple hash tables worth
of tuples are trying to fit into one.

Is that entirely true? If we are feeding one hash table from other hash table,
1:1, and the initial size of the target table is wrong, we probably also won't
handle it in the best way.

That could happen with a group by over a group by (with perhaps some joins
above the first group by). Probably not that common, but ...

I guess another case could be a CTAS of a group by query that is then grouped
by again.

With 1:1 tables, if the target table is smaller there is going to be a
ton of collisions which should quite quickly trigger growth via
SH_GROW_MAX_MOVE (150). But once the target table is the same size as
source the growth will stop. I think this ends up being slightly more
efficient in the correlated case because the table will be growing at
10% fill factor, having less entries to move around. And the access to
the target hash table will be more correlated over time for better
cache efficiency.

However I think it might be possible to have different hash table size
limits in a single plan. At least setting hash_mem_multiplier on a
function would do it. But I wasn't able to trigger any bad behavior
with this for some reason. I will experiment some more.

I think right now that limits it to hash aggregates with parallel
query. ExecBuildHash32FromAttrs comment says that we should be using
non-zero init_value only when needed for a marginal performance gain. So I
limited the plan node id inclusion to use_variable_hash_iv, but made it
unconditional for aggregates.

That's an argument... I guess we could mostly avoid that though, by just
always combining in EEOP_HASHDATUM_FIRST.

I think that makes it exactly the same as EEOP_HASHDATUM_NEXT32,
making the special op code pointless. I assume David observed some
speedup there to go through the effort to include it.

I used hash_combine to smear together the bits of the two. I think

that should be good enough. Alternatively a xor of murmurhash32 of the
two should be better.

Probably it's good enough, but i'd probably still go for
hash_combine(murmurhash32(ParallelWorkerNumber),
murmurhash32(parent->plan->plan_node_id))
or such.

Used this approach.

There is also a call to ExecBuildHash32FromAttrs from ExecInitSubPlan
that specifies a hash_iv for a hashed sub plan execution.

When you're saying it specifies a hash IV you mean the 0 it passes, not some
more complicated value, right?

Right.

This must match the value BuildTupleHashTable decides on. Worth adding a
comment?

Yes, I think so.

Done.

*/
if (use_variable_hash_iv)
-             hash_iv = murmurhash32(ParallelWorkerNumber);
+             hash_iv = hash_combine(ParallelWorkerNumber, parent->plan->plan_node_id);

hashtable->hashtab = tuplehash_create(metacxt, nbuckets, hashtable);

I think some of the reasoning from your commit messages needs to be in a
comment somewhere too, otherwise the next person tackling this code will have
a hell of a time.

I gave it a shot.

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 7d487a165fa..5c947fd151f 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1535,7 +1535,7 @@ build_hash_table(AggState *aggstate, int setno, double nbuckets)
metacxt,
tuplescxt,
tmpcxt,
-                                                                                      DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+                                                                                      true);
}

Hm, won't that mean we'll now always use the slower path? I think there are
non-partial cases that could be problematic, but if we do this we probaly
should do the optimization of the hash expression stuff I mentioned above.

If you mean EEOP_HASHDATUM_FIRST, then I think that would just be
removing an optimization. I think getting rid of the quicker growth
and correlated table filling in the 1:1 case might be more significant
than the slightly slower hash calculation. However the downside of the
bad case is pretty nasty. Do you think it's worth the effort of trying
to figure out if we are in a dangerous looking structure?

Decorrelating the hash tables still doesn't fix the terrible edge case
of hash aggregate exceeding memory limit with hash table alone. Just
makes it exceedingly improbable to hit with non-antagonistic
workloads. Would it be a good idea to fix that too? A simple answer
would be to always allow hash aggregate to use some fraction like 25%
of allowed memory for tuples, even if the hash table is too big.

Regards,
Ants Aasma

Attachments:

v2-0001-Decorrelate-nested-hash-tables.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Decorrelate-nested-hash-tables.patchDownload+13-8