Fix overflow of nbatch

Started by Vaibhav Jain4 months ago31 messages
#1Vaibhav Jain
jainva@google.com
1 attachment(s)

Hi Everyone,

With a1b4f28, to compute current_space, nbatch is being multiplied
by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
easily overflow the int limit.To keep the calculation safe for
current_space, convert nbatch to size_t.

Please find a patch for the same.

Thanks,
Vaibhav

Attachments:

0001-Fix-overflow-of-nbatch.patchapplication/octet-stream; name=0001-Fix-overflow-of-nbatch.patchDownload
From 9af567f3654190c7ce71a77690c45f6fb29611fb Mon Sep 17 00:00:00 2001
From: Vaibhav Jain <jainva@google.com>
Date: Mon, 22 Sep 2025 18:11:08 +0530
Subject: Fix overflow of nbatch

With a1b4f28, to compute current_space, nbatch is being multiplied
by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
easily overflow the int limit. To keep the calculation safe for
current_space, convert nbatch to size_t.
---
 src/backend/executor/nodeHash.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index a3415db4e20..c31beb540fd 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -667,7 +667,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	size_t		hash_table_bytes;
 	size_t		bucket_bytes;
 	size_t		max_pointers;
-	int			nbatch = 1;
+	size_t		nbatch = 1;
 	int			nbuckets;
 	double		dbuckets;

#2David Rowley
dgrowleyml@gmail.com
In reply to: Vaibhav Jain (#1)
Re: Fix overflow of nbatch

On Tue, 23 Sept 2025 at 01:57, Vaibhav Jain <jainva@google.com> wrote:

With a1b4f28, to compute current_space, nbatch is being multiplied
by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
easily overflow the int limit.To keep the calculation safe for
current_space, convert nbatch to size_t.

Thanks for finding and reporting this.

I think a1b4f289b mistakenly thought that there'd be size_t arithmetic
in the following two lines because the final result is a size_t:

size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);

I'd rather see this fixed by adding a cast, i.e.: "(size_t) nbatch" on
the above two lines. All that code is new in a1b4f289b, and if you
change the nbatch to a size_t it affects the code that existed before
a1b4f289b, and from looking over that code, it seems to ensure that
nbatch does not overflow int by doing "dbatch = Min(dbatch,
max_pointers);", where max_pointers is constrained by MaxAllocSize /
sizeof(HashJoinTuple).

Also, making nbatches size_t makes the final line of " *numbatches =
nbatch;" somewhat questionable since numbatches is an int pointer.

David

#3Tomas Vondra
tomas@vondra.me
In reply to: David Rowley (#2)
Re: Fix overflow of nbatch

On 9/22/25 22:45, David Rowley wrote:

On Tue, 23 Sept 2025 at 01:57, Vaibhav Jain <jainva@google.com> wrote:

With a1b4f28, to compute current_space, nbatch is being multiplied
by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
easily overflow the int limit.To keep the calculation safe for
current_space, convert nbatch to size_t.

Thanks for finding and reporting this.

I think a1b4f289b mistakenly thought that there'd be size_t arithmetic
in the following two lines because the final result is a size_t:

size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);

Yeah, I failed to notice this part of the formula can overflow.

I'd rather see this fixed by adding a cast, i.e.: "(size_t) nbatch" on
the above two lines. All that code is new in a1b4f289b, and if you
change the nbatch to a size_t it affects the code that existed before
a1b4f289b, and from looking over that code, it seems to ensure that
nbatch does not overflow int by doing "dbatch = Min(dbatch,
max_pointers);", where max_pointers is constrained by MaxAllocSize /
sizeof(HashJoinTuple).

Also, making nbatches size_t makes the final line of " *numbatches =
nbatch;" somewhat questionable since numbatches is an int pointer.

It seems cleaner to add a the cast to the formula, if only because of
the assignment to numbatches.

thanks

--
Tomas Vondra

#4Chao Li
li.evan.chao@gmail.com
In reply to: Vaibhav Jain (#1)
Re: Fix overflow of nbatch

On Sep 22, 2025, at 21:20, Vaibhav Jain <jainva@google.com> wrote:

Hi Everyone,

With a1b4f28, to compute current_space, nbatch is being multiplied
by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
easily overflow the int limit.To keep the calculation safe for
current_space, convert nbatch to size_t.

Please find a patch for the same.

Thanks,
Vaibhav
<0001-Fix-overflow-of-nbatch.patch>

I guess that because earlier in the function, nbatch is always clamped with:

nbatch = pg_nextpower2_32(Max(2, minbatch));
So, in practice, nbatch won’t grow to very big. But yes, if nbatch reaches to, say 1 million, it will overflow.

A simple program proves that changing nbatch to size_t will prevent from overflowing:

```
#include <stdio.h>

int main(){
size_t nbatch = 1000000; // 1 million
int BLCKSZ = 8192;
size_t result = 2 * nbatch * BLCKSZ;
printf("%zu\n", result); // will output 16384000000
return 0;
}
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#5David Rowley
dgrowleyml@gmail.com
In reply to: Chao Li (#4)
Re: Fix overflow of nbatch

On Tue, 23 Sept 2025 at 11:21, Chao Li <li.evan.chao@gmail.com> wrote:

I guess that because earlier in the function, nbatch is always clamped with:

nbatch = pg_nextpower2_32(Max(2, minbatch));

I don't follow which part of that line could be constituted as
clamping. Maybe you've confused Max with Min?

David

#6David Rowley
dgrowleyml@gmail.com
In reply to: Tomas Vondra (#3)
Re: Fix overflow of nbatch

On Tue, 23 Sept 2025 at 09:31, Tomas Vondra <tomas@vondra.me> wrote:

On 9/22/25 22:45, David Rowley wrote:

I think a1b4f289b mistakenly thought that there'd be size_t arithmetic
in the following two lines because the final result is a size_t:

size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);

Yeah, I failed to notice this part of the formula can overflow.

Ok cool. We're just in the freeze for 18.0 at the moment. Once that's
over, should I take care of this, or do you want to?

David

#7Tomas Vondra
tomas@vondra.me
In reply to: David Rowley (#6)
Re: Fix overflow of nbatch

On 9/23/25 02:02, David Rowley wrote:

On Tue, 23 Sept 2025 at 09:31, Tomas Vondra <tomas@vondra.me> wrote:

On 9/22/25 22:45, David Rowley wrote:

I think a1b4f289b mistakenly thought that there'd be size_t arithmetic
in the following two lines because the final result is a size_t:

size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);

Yeah, I failed to notice this part of the formula can overflow.

Ok cool. We're just in the freeze for 18.0 at the moment. Once that's
over, should I take care of this, or do you want to?

Feel free to fix, but I can take care of it once 18 is out the door.
It's my bug, after all.

BTW ExecHashIncreaseBatchSize needs the same fix, I think.

I wonder how likely the overflow is. AFAICS we'd need nbatch=256k (with
8KB blocks), which is a lot. But with the balancing logic, it'd also
mean each batch is about ~2GB. So the whole "hash table" would be about
500GB. Possible, but unlikely.

However, looking at the code now, I think the code should adjust the
hash_table_bytes value, not just space_allowed. It's meant to be the
same thing. Will check tomorrow.

thanks

--
Tomas Vondra

#8David Rowley
dgrowleyml@gmail.com
In reply to: Tomas Vondra (#7)
Re: Fix overflow of nbatch

On Tue, 23 Sept 2025 at 13:01, Tomas Vondra <tomas@vondra.me> wrote:

On 9/23/25 02:02, David Rowley wrote:

Ok cool. We're just in the freeze for 18.0 at the moment. Once that's
over, should I take care of this, or do you want to?

Feel free to fix, but I can take care of it once 18 is out the door.
It's my bug, after all.

BTW ExecHashIncreaseBatchSize needs the same fix, I think.

I think it's probably best you handle this. I didn't notice that one.
You know this area much better than I do.

I wonder how likely the overflow is. AFAICS we'd need nbatch=256k (with
8KB blocks), which is a lot. But with the balancing logic, it'd also
mean each batch is about ~2GB. So the whole "hash table" would be about
500GB. Possible, but unlikely.

I think no matter how low the chances of overflow are, the code isn't
written the way it was intended to be, so it should just be put the
way it was intended to be without question of the likelihood of
overflow. Otherwise, we'll just end up with harder to hit bugs which
could take much longer to [re]discover. Also, in these terms, what's
unlikely today may not be in the future.

David

#9Chao Li
li.evan.chao@gmail.com
In reply to: David Rowley (#5)
Re: Fix overflow of nbatch

On Sep 23, 2025, at 07:35, David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 23 Sept 2025 at 11:21, Chao Li <li.evan.chao@gmail.com> wrote:

I guess that because earlier in the function, nbatch is always clamped with:

nbatch = pg_nextpower2_32(Max(2, minbatch));

I don't follow which part of that line could be constituted as
clamping. Maybe you've confused Max with Min?

David

Sorry for the misleading. I actually meant “minbatch”.

I remember I ever traced the function several times. First, with a normal (not much data involved) query,

if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
{

Is hard to meet, then nbatch will be just 1.

With big data involved, it will enter the “if” clause, but minbatch is also hard to go very high.

To clarify, I just created a test:

```
evantest=# SET enable_nestloop = off;
SET
evantest=# SET enable_mergejoin = off;
SET
evantest=# SET enable_hashjoin = on;
SET
evantest=# CREATE TEMP TABLE inner_tbl AS
evantest-# SELECT g AS id, repeat('x', 2000) AS filler
evantest-# FROM generate_series(1, 200000) g;
SELECT 200000
evantest=# CREATE TEMP TABLE outer_tbl AS
evantest-# SELECT g AS id FROM generate_series(1, 1000000) g;
SELECT 1000000
evantest=#
evantest=#
evantest=# EXPLAIN ANALYZE
evantest-# SELECT *
evantest-# FROM outer_tbl o
evantest-# JOIN inner_tbl i
evantest-# ON o.id = i.id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=34647.00..1270978635.52 rows=36306020352 width=40) (actual time=353.908..1355.735 rows=200000.00 loops=1)
Hash Cond: (i.id = o.id)
Buffers: local read=54528 dirtied=54425 written=54418, temp read=45853 written=45853
-> Seq Scan on inner_tbl i (cost=0.00..113608.96 rows=6356096 width=36) (actual time=1.132..460.711 rows=200000.00 loops=1)
Buffers: local read=50048 dirtied=50000 written=49993
-> Hash (cost=15904.00..15904.00 rows=1142400 width=4) (actual time=351.280..351.282 rows=1000000.00 loops=1)
Buckets: 262144 Batches: 8 Memory Usage: 6446kB
Buffers: local read=4480 dirtied=4425 written=4425, temp written=2560
-> Seq Scan on outer_tbl o (cost=0.00..15904.00 rows=1142400 width=4) (actual time=0.760..162.229 rows=1000000.00 loops=1)
Buffers: local read=4480 dirtied=4425 written=4425
Planning:
Buffers: shared hit=14
Planning Time: 389649.420 ms
Execution Time: 1362.392 ms
(14 rows)
```

In this test, minbatch is just 64.

But I agree, I did never test with large amount of data. I don’t actually know how much data can make nbatch to reach to ~130K (the value will lead to overflow if nbatch is of int type).

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#10Vaibhav Jain
jainva@google.com
In reply to: Chao Li (#9)
1 attachment(s)
Re: Fix overflow of nbatch

Thank you everyone for the reviews, feedback and the test.

Please find attached a new version of the patch.

On Tue, Sep 23, 2025 at 7:11 AM Chao Li <li.evan.chao@gmail.com> wrote:

Show quoted text

On Sep 23, 2025, at 07:35, David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 23 Sept 2025 at 11:21, Chao Li <li.evan.chao@gmail.com> wrote:

I guess that because earlier in the function, nbatch is always clamped
with:

nbatch = pg_nextpower2_32(Max(2, minbatch));

I don't follow which part of that line could be constituted as
clamping. Maybe you've confused Max with Min?

David

Sorry for the misleading. I actually meant “minbatch”.

I remember I ever traced the function several times. First, with a normal
(not much data involved) query,

if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
{

Is hard to meet, then nbatch will be just 1.

With big data involved, it will enter the “if” clause, but minbatch is
also hard to go very high.

To clarify, I just created a test:

```
evantest=# SET enable_nestloop = off;
SET
evantest=# SET enable_mergejoin = off;
SET
evantest=# SET enable_hashjoin = on;
SET
evantest=# CREATE TEMP TABLE inner_tbl AS
evantest-# SELECT g AS id, repeat('x', 2000) AS filler
evantest-# FROM generate_series(1, 200000) g;
SELECT 200000
evantest=# CREATE TEMP TABLE outer_tbl AS
evantest-# SELECT g AS id FROM generate_series(1, 1000000) g;
SELECT 1000000
evantest=#
evantest=#
evantest=# EXPLAIN ANALYZE
evantest-# SELECT *
evantest-# FROM outer_tbl o
evantest-# JOIN inner_tbl i
evantest-# ON o.id = i.id;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=34647.00..1270978635.52 rows=36306020352 width=40)
(actual time=353.908..1355.735 rows=200000.00 loops=1)
Hash Cond: (i.id = o.id)
Buffers: local read=54528 dirtied=54425 written=54418, temp read=45853
written=45853
-> Seq Scan on inner_tbl i (cost=0.00..113608.96 rows=6356096
width=36) (actual time=1.132..460.711 rows=200000.00 loops=1)
Buffers: local read=50048 dirtied=50000 written=49993
-> Hash (cost=15904.00..15904.00 rows=1142400 width=4) (actual
time=351.280..351.282 rows=1000000.00 loops=1)
Buckets: 262144 Batches: 8 Memory Usage: 6446kB
Buffers: local read=4480 dirtied=4425 written=4425, temp
written=2560
-> Seq Scan on outer_tbl o (cost=0.00..15904.00 rows=1142400
width=4) (actual time=0.760..162.229 rows=1000000.00 loops=1)
Buffers: local read=4480 dirtied=4425 written=4425
Planning:
Buffers: shared hit=14
Planning Time: 389649.420 ms
Execution Time: 1362.392 ms
(14 rows)
```

In this test, minbatch is just 64.

But I agree, I did never test with large amount of data. I don’t actually
know how much data can make nbatch to reach to ~130K (the value will lead
to overflow if nbatch is of int type).

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Attachments:

0002-Fix-overflow-of-nbatch.patchapplication/octet-stream; name=0002-Fix-overflow-of-nbatch.patchDownload
From 2805b09ebb5e71908ca34e4c355f1eaccab97ef4 Mon Sep 17 00:00:00 2001
From: Vaibhav Jain <jainva@google.com>
Date: Mon, 22 Sep 2025 18:11:08 +0530
Subject: Fix overflow of nbatch

With a1b4f28, to compute current_space, nbatch is being multiplied
by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
easily overflow the int limit. To keep the calculation safe for
current_space, convert nbatch to size_t.

---
 src/backend/executor/nodeHash.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index a3415db4e20..4e0e18314d1 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -912,10 +912,10 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	while (nbatch > 0)
 	{
 		/* how much memory are we using with current nbatch value */
-		size_t		current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
+		size_t		current_space = hash_table_bytes + (2 * (size_t) nbatch * BLCKSZ);
 
 		/* how much memory would we use with half the batches */
-		size_t		new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
+		size_t		new_space = hash_table_bytes * 2 + ((size_t) nbatch * BLCKSZ);
 
 		/* If the memory usage would not decrease, we found the optimum. */
 		if (current_space < new_space)
@@ -994,7 +994,7 @@ ExecHashIncreaseBatchSize(HashJoinTable hashtable)
 	 * How much additional memory would doubling nbatch use? Each batch may
 	 * require two buffered files (inner/outer), with a BLCKSZ buffer.
 	 */
-	size_t		batchSpace = (hashtable->nbatch * 2 * BLCKSZ);
+	size_t		batchSpace = ((size_t) hashtable->nbatch * 2 * BLCKSZ);
 
 	/*
 	 * Compare the new space needed for doubling nbatch and for enlarging the

#11Tomas Vondra
tomas@vondra.me
In reply to: David Rowley (#8)
Re: Fix overflow of nbatch

On 9/23/25 03:20, David Rowley wrote:

On Tue, 23 Sept 2025 at 13:01, Tomas Vondra <tomas@vondra.me> wrote:

On 9/23/25 02:02, David Rowley wrote:

Ok cool. We're just in the freeze for 18.0 at the moment. Once that's
over, should I take care of this, or do you want to?

Feel free to fix, but I can take care of it once 18 is out the door.
It's my bug, after all.

BTW ExecHashIncreaseBatchSize needs the same fix, I think.

I think it's probably best you handle this. I didn't notice that one.
You know this area much better than I do.

OK, will do.

I wonder how likely the overflow is. AFAICS we'd need nbatch=256k (with
8KB blocks), which is a lot. But with the balancing logic, it'd also
mean each batch is about ~2GB. So the whole "hash table" would be about
500GB. Possible, but unlikely.

I think no matter how low the chances of overflow are, the code isn't
written the way it was intended to be, so it should just be put the
way it was intended to be without question of the likelihood of
overflow. Otherwise, we'll just end up with harder to hit bugs which
could take much longer to [re]discover. Also, in these terms, what's
unlikely today may not be in the future.

I wasn't disputing the validity of the bug. I was just thinking alund
about how likely it's to hit.

regards

--
Tomas Vondra

#12Tomas Vondra
tomas@vondra.me
In reply to: Tomas Vondra (#11)
Re: Fix overflow of nbatch

Hi,

I kept looking at this, and unfortunately the it seems a bit worse :-(

Fixing the overflow is easy enough - adding the cast does the trick.

But then there's the second issue I mentioned - the loop does not adjust
the hash_table_bytes. It updates the *space_allowed, but that's not what
the current_space/new_space formulas use.

This breaks the "balancing" as the nbatch gets decreased until

nbatch <= (work_mem / BLCKSZ)

while the hash table "space_allowed" is increasing. This may result in
an *increased* memory usage :-(

I also noticed the code does not clamp nbuckets properly as it should.

The question what to do about this. If we got this a week ago, I'd just
probably just revert a1b4f289, and then try again for PG19. After all,
the issue it meant to address is somewhat rare.

But with 18 already stamped ...

I've shared these findings with the rest of the RMT, I'll see what their
thoughts are. Of course, other opinions/suggestions are welcome.

regards

--
Tomas Vondra

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#12)
Re: Fix overflow of nbatch

Tomas Vondra <tomas@vondra.me> writes:

The question what to do about this. If we got this a week ago, I'd just
probably just revert a1b4f289, and then try again for PG19. After all,
the issue it meant to address is somewhat rare.
But with 18 already stamped ...

18.0 is what it is. If this is the most serious bug in it, I'll be
a bit surprised. Take your time, fix it properly.

regards, tom lane

#14Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#13)
Re: Fix overflow of nbatch

On Tue, Sep 23, 2025 at 11:34:56AM -0400, Tom Lane wrote:

18.0 is what it is. If this is the most serious bug in it, I'll be
a bit surprised. Take your time, fix it properly.

+1

--
nathan

#15Konstantin Knizhnik
knizhnik@garret.ru
In reply to: Tomas Vondra (#12)
Re: Fix overflow of nbatch

On 23/09/2025 6:11 PM, Tomas Vondra wrote:

Hi,

I kept looking at this, and unfortunately the it seems a bit worse :-(

Fixing the overflow is easy enough - adding the cast does the trick.

But then there's the second issue I mentioned - the loop does not adjust
the hash_table_bytes. It updates the *space_allowed, but that's not what
the current_space/new_space formulas use.

This breaks the "balancing" as the nbatch gets decreased until

nbatch <= (work_mem / BLCKSZ)

while the hash table "space_allowed" is increasing. This may result in
an *increased* memory usage :-(

I also noticed the code does not clamp nbuckets properly as it should.

The question what to do about this. If we got this a week ago, I'd just
probably just revert a1b4f289, and then try again for PG19. After all,
the issue it meant to address is somewhat rare.

But with 18 already stamped ...

I've shared these findings with the rest of the RMT, I'll see what their
thoughts are. Of course, other opinions/suggestions are welcome.

regards

Hi Tomas,

If you are going to investigate this problem more, can you also look at
the related problem:

/messages/by-id/52b94d5b-a135-489d-9833-2991a69ec623@garret.ru

I proposed the patch but got no feedback.

Best regards,
Konstantin

#16Tomas Vondra
tomas@vondra.me
In reply to: Konstantin Knizhnik (#15)
Re: Fix overflow of nbatch

On 9/23/25 21:13, Konstantin Knizhnik wrote:

On 23/09/2025 6:11 PM, Tomas Vondra wrote:

Hi,

I kept looking at this, and unfortunately the it seems a bit worse :-(

Fixing the overflow is easy enough - adding the cast does the trick.

But then there's the second issue I mentioned - the loop does not adjust
the hash_table_bytes. It updates the *space_allowed, but that's not what
the current_space/new_space formulas use.

This breaks the "balancing" as the nbatch gets decreased until

     nbatch <= (work_mem / BLCKSZ)

while the hash table "space_allowed" is increasing. This may result in
an *increased* memory usage :-(

I also noticed the code does not clamp nbuckets properly as it should.

The question what to do about this. If we got this a week ago, I'd just
probably just revert a1b4f289, and then try again for PG19. After all,
the issue it meant to address is somewhat rare.

But with 18 already stamped ...

I've shared these findings with the rest of the RMT, I'll see what their
thoughts are. Of course, other opinions/suggestions are welcome.

regards

Hi Tomas,

If you are going to investigate this problem more, can you also look at
the related problem:

/messages/by-id/52b94d5b-
a135-489d-9833-2991a69ec623%40garret.ru#ebe4151f1d505bbcc32cf93b2e8a1936

I proposed the patch but got no feedback.

Thanks, I'll take a look. I wasn't aware of that thread (or rather I
didn't realize it might have been related to this). And AFAICS the
max_batches business is separate.

But the last bit seems very relevant:

-    while (nbatch > 0)
+    while (nbatch > 0 &&
+       nbuckets * 2 <= max_pointers) /* prevent allocation limit
overflow */

I believe this is the missing "nbucket claiming" that I mentioned above.
But I think it needs to work a bit differently. The max_pointers is
calculated from work_mem, but the whole point here is to grow the hash
table beyond that. I think it makes sense to relax that limit, and allow
up to MaxAllocSize, or something like that.

regards

--
Tomas Vondra

#17Tomas Vondra
tomas@vondra.me
In reply to: Nathan Bossart (#14)
4 attachment(s)
Re: Fix overflow of nbatch

Hi,

Here's a couple draft patches fixing the bug:

- 0001 adds the missing size_t cast, to fix the overflow

- 0002 fixes the balancing, by adjusting the hash table size limit

- 0003 adds the missing overflow protection for nbuckets and the hash
table limit

- 0004 rewords the comment explaining how the balancing works. Reading
it after a couple months, I found it overly verbose / not very clear.
I'm sure it could be improved even further.

0001 and 0002 are pretty trivial, 0003 is a bit bigger, but most of the
code is simply how we clamp nbuckets elsewhere (more or less).

At some point I started wondering if this would be simpler if it'd have
been better to use the algerbraic solution posted by James Hunter back
in February [1]/messages/by-id/CAJVSvF6290rJF2MtgSx_SuT9Kn2amZ_+zecoZYMU+dn3BVVaZg@mail.gmail.com. It'd not need the loop, but it'd still need all this
new overflow protection etc.

I wanted to make sure the patches actually make it work correctly, so I
created a table with 4B rows:

create table t (a bigint, b text);
insert into t select i, md5(i::text)
from generate_series(1,4000000000) s(i);

and I added this log message at the end of ExecChooseHashTableSize:

elog(WARNING, "wm %d nbatch %d nbucket %d space %ld total %ld",
work_mem, nbatch, nbuckets, (*space_allowed)/1024,
(*space_allowed + 2 * nbatch * (Size) BLCKSZ)/1024);

and I ran an explain on a self-join

set enable_mergejoin = off;
set max_parallel_workers_per_gather = 0;
set work_mem = '...';

explain select * from t t1 join t t2 on (t1.a = t2.a);

with work_mem set to values between 64kB and 1GB.

On 18.0 I got this:

wm 64 nbatch 8 nbucket 2097152 hash 131072 total 131200
wm 128 nbatch 16 nbucket 4194304 hash 262144 total 262400
wm 256 nbatch 32 nbucket 8388608 hash 524288 total 524800
wm 512 nbatch 64 nbucket 16777216 hash 1048576 total 1049600
wm 1024 nbatch 128 nbucket 33554432 hash 2097152 total 2099200
wm 2048 nbatch 256 nbucket 33554432 hash 2097152 total 2101248
wm 4096 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
wm 8192 nbatch 1024 nbucket 8388608 hash 524288 total 540672
wm 16384 nbatch 2048 nbucket 4194304 hash 262144 total 294912
wm 32768 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248

I wanted to know how serious the issue is, compared to what would happen
without the balancing. I disabled the balancing (by skipping the loop),
and then I get this:

wm 64 nbatch 8192 nbucket 2048 hash 128 total 131200
wm 128 nbatch 16384 nbucket 4096 hash 256 total 262400
wm 256 nbatch 32768 nbucket 8192 hash 512 total 524800
wm 512 nbatch 65536 nbucket 16384 hash 1024 total 1049600
wm 1024 nbatch 131072 nbucket 32768 hash 2048 total 2099200
wm 2048 nbatch 131072 nbucket 65536 hash 4096 total 2101248
wm 4096 nbatch 65536 nbucket 131072 hash 8192 total 1056768
wm 8192 nbatch 32768 nbucket 262144 hash 16384 total 540672
wm 16384 nbatch 16384 nbucket 524288 hash 32768 total 294912
wm 32768 nbatch 8192 nbucket 1048576 hash 65536 total 196608
wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248

The interesting bit is that the expected total memory usage (the last
number in the log line) is exactly the same as for 18.0 with and without
balancing. IIUC this is due to the "stop" condition using the initial
hash table size. It makes me a bit less worried about this triggering
OOM crashes - it does not improve the behavior, but it doesn't use more
memory than before. Still an embarrassing bug, though.

With the attached patches, this looks like this:

wm 64 nbatch 256 nbucket 65536 hash 4096 total 8192
wm 128 nbatch 512 nbucket 131072 hash 8192 total 16384
wm 256 nbatch 1024 nbucket 262144 hash 16384 total 32768
wm 512 nbatch 2048 nbucket 524288 hash 32768 total 65536
wm 1024 nbatch 4096 nbucket 1048576 hash 65536 total 131072
wm 2048 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 4096 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 8192 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 16384 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 32768 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248

So, this time it actually seems to work correctly and significantly
reduces the memory usage ...

There's one weird thing remaining - if you look at nbatch, it actually
increases for the first couple work_mem steps. That's weird, because
after increasing work_mem we should need *fewer* batches. But this has
nothing to do with the balancing, it happens even with it disabled.

The reason is that when calculating nbatch we do this:

dbatch = Min(dbatch, max_pointers);

and max_pointers is calculated from work_mem (among other things). It's
a bit funny the logica worries about how many batch pointers we have,
and refuses to allow more. But at the same time it ignores the BufFiles.

AFAICS it's harmless - we may pick low number of batches initially, but
then later we'll ramp it up (and the balancing will work too). And if
you choose to run huge hash joins with tiny work_mem, I guess you're in
for the suffering anyway. In any case, it's unrelated to balancing.

regards

[1]: /messages/by-id/CAJVSvF6290rJF2MtgSx_SuT9Kn2amZ_+zecoZYMU+dn3BVVaZg@mail.gmail.com
/messages/by-id/CAJVSvF6290rJF2MtgSx_SuT9Kn2amZ_+zecoZYMU+dn3BVVaZg@mail.gmail.com

--
Tomas Vondra

Attachments:

vfix-0004-reword-balancing-comment.patchtext/x-patch; charset=UTF-8; name=vfix-0004-reword-balancing-comment.patchDownload
From d1f0bdc6c4a666fa9e7d4bb1836d656bfec6b4a3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 24 Sep 2025 22:03:30 +0200
Subject: [PATCH vfix 4/4] reword balancing comment

---
 src/backend/executor/nodeHash.c | 64 +++++++++++----------------------
 1 file changed, 21 insertions(+), 43 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index b2be2091fb2..145c19db8f8 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -851,59 +851,37 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/*
 	 * Optimize the total amount of memory consumed by the hash node.
 	 *
-	 * The nbatch calculation above focuses on the size of the in-memory hash
-	 * table, assuming no per-batch overhead. Now adjust the number of batches
-	 * and the size of the hash table to minimize total memory consumed by the
-	 * hash node.
-	 *
-	 * Each batch file has a BLCKSZ buffer, and we may need two files per
-	 * batch (inner and outer side). So with enough batches this can be
-	 * significantly more memory than the hashtable itself.
+	 * The nbatch calculation above focuses on the the in-memory hash table,
+	 * assuming no per-batch overhead. But each batch may have two files, each
+	 * with a BLCKSZ buffer. With enough batches these buffers may use
+	 * significantly more memory than the hash table.
 	 *
 	 * The total memory usage may be expressed by this formula:
 	 *
-	 * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ) <= hash_table_bytes
+	 * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ)
 	 *
 	 * where (inner_rel_bytes / nbatch) is the size of the in-memory hash
 	 * table and (2 * nbatch * BLCKSZ) is the amount of memory used by file
-	 * buffers. But for sufficiently large values of inner_rel_bytes value
-	 * there may not be a nbatch value that would make both parts fit into
-	 * hash_table_bytes.
-	 *
-	 * In this case we can't enforce the memory limit - we're going to exceed
-	 * it. We can however minimize the impact and use as little memory as
-	 * possible. (We haven't really enforced it before either, as we simply
-	 * ignored the batch files.)
+	 * buffers. We can't optimize both parts at the same time, and for large
+	 * inner_rel_bytes values there may not be a nbatch value that would keep
+	 * memory usage below the memory limit (work_mem * hash_mem_multiplier).
 	 *
-	 * The formula for total memory usage says that given an inner relation of
-	 * size inner_rel_bytes, we may divide it into an arbitrary number of
-	 * batches. This determines both the size of the in-memory hash table and
-	 * the amount of memory needed for batch files. These two terms work in
-	 * opposite ways - when one decreases, the other increases.
+	 * In this case we're going to exceed the memory limit no matter what. We
+	 * can at least minimize the impact and minimize the amount of memory
+	 * used. (We haven't enforced it before either, we simply ignored the
+	 * batch files.)
 	 *
-	 * For low nbatch values, the hash table takes most of the memory, but at
-	 * some point the batch files start to dominate. If you combine these two
-	 * terms, the memory consumption (for a fixed size of the inner relation)
-	 * has a u-shape, with a minimum at some nbatch value.
-	 *
-	 * Our goal is to find this nbatch value, minimizing the memory usage. We
-	 * calculate the memory usage with half the batches (i.e. nbatch/2), and
-	 * if it's lower than the current memory usage we know it's better to use
-	 * fewer batches. We repeat this until reducing the number of batches does
-	 * not reduce the memory usage - we found the optimum. We know the optimum
-	 * exists, thanks to the u-shape.
+	 * These two terms in the formula work in opposite ways - one decreases,
+	 * the other increases. The plot has a u-shape, with a minimum at some
+	 * nbatch value. We find the minimum by "walking back" - check if using
+	 * (nbatch/2) batches would lower the memory usage. We stop when the
+	 * memory usage would not improve.
 	 *
 	 * We only want to do this when exceeding the memory limit, not every
-	 * time. The goal is not to minimize memory usage in every case, but to
-	 * minimize the memory usage when we can't stay within the memory limit.
-	 *
-	 * For this reason we only consider reducing the number of batches. We
-	 * could try the opposite direction too, but that would save memory only
-	 * when most of the memory is used by the hash table. And the hash table
-	 * was used for the initial sizing, so we shouldn't be exceeding the
-	 * memory limit too much. We might save memory by using more batches, but
-	 * it would result in spilling more batch files, which does not seem like
-	 * a great trade off.
+	 * time. This is why we try only reducing number of batches. We could try
+	 * increasing nbatch too, but what would lower memory usage only with most
+	 * of the memory being used by the hash table. It might even force using
+	 * batching even when not needed. We don't want that.
 	 *
 	 * While growing the hashtable, we also adjust the number of buckets, to
 	 * not have more than one tuple per bucket (load factor 1). We can only do
-- 
2.51.0

vfix-0003-nbuckets-overflow-protection.patchtext/x-patch; charset=UTF-8; name=vfix-0003-nbuckets-overflow-protection.patchDownload
From 7f6d163d99651406d36bef712f6fc87f1e3e801c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 24 Sep 2025 14:56:14 +0200
Subject: [PATCH vfix 3/4] nbuckets overflow protection

---
 src/backend/executor/nodeHash.c | 44 ++++++++++++++++++++++++++++++---
 1 file changed, 40 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 972b5b58583..b2be2091fb2 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -925,14 +925,50 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 			break;
 
 		/*
-		 * It's better to use half the batches, so do that and adjust the
-		 * nbucket in the opposite direction, and double the allowance.
+		 * Make sure hash_table_bytes doesn't overflow. If it would, stop and
+		 * use the current nbatch value.
+		 */
+		if (hash_table_bytes > SIZE_MAX / 2)
+			break;
+
+		/*
+		 * It's better to use half the batches, so do that, and adjust the
+		 * allowance in the opposite direction. Then adjust the nbuckets to
+		 * reflect the higher allowance, but be careful about overflow.
 		 */
 		nbatch /= 2;
-		nbuckets *= 2;
 		hash_table_bytes *= 2;
-
 		*space_allowed = (*space_allowed) * 2;
+
+		/*
+		 * Adjust nbuckets, with overflow protection.
+		 *
+		 * Recalculate max_pointers, because the earlier value was based on a
+		 * lower hash_table_bytes, before we adjusted that.
+		 *
+		 * XXX Maybe this is too much, and we need to care only about
+		 * MaxAllocSize? Then we'd not need to recalculate this over and over
+		 * in every loop.
+		 *
+		 * XXX An alternative would be to not grow nbuckets, and stick to the
+		 * originally planned value. The cost is we'll increase the load
+		 * factor (tuples per bucket).
+		 */
+		max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
+		max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
+		/* If max_pointers isn't a power of 2, must round it down to one */
+		max_pointers = pg_prevpower2_size_t(max_pointers);
+
+		/* Also ensure we avoid integer overflow in nbatch and nbuckets */
+		/* (this step is redundant given the current value of MaxAllocSize) */
+		max_pointers = Min(max_pointers, INT_MAX / 2 + 1);
+
+		/* don't break, just cap the number of buckets */
+		dbuckets = (double) nbuckets * 2;
+		dbuckets = Min(dbuckets, max_pointers);
+
+		nbuckets = (int) dbuckets;
+		nbuckets = pg_nextpower2_32(nbuckets);
 	}
 
 	Assert(nbuckets > 0);
-- 
2.51.0

vfix-0002-adjust-hash_table_bytes-for-balancing.patchtext/x-patch; charset=UTF-8; name=vfix-0002-adjust-hash_table_bytes-for-balancing.patchDownload
From a7995c8e6a149e3444e7e48b618a117cda588770 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 24 Sep 2025 14:32:29 +0200
Subject: [PATCH vfix 2/4] adjust hash_table_bytes for balancing

---
 src/backend/executor/nodeHash.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index ef589b90339..972b5b58583 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -930,6 +930,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		 */
 		nbatch /= 2;
 		nbuckets *= 2;
+		hash_table_bytes *= 2;
 
 		*space_allowed = (*space_allowed) * 2;
 	}
-- 
2.51.0

vfix-0001-fix-size-overflow.patchtext/x-patch; charset=UTF-8; name=vfix-0001-fix-size-overflow.patchDownload
From 4f128c5cbdb11c4d9b1882d39b75338c854ba30e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 24 Sep 2025 14:30:31 +0200
Subject: [PATCH vfix 1/4] fix size overflow

---
 src/backend/executor/nodeHash.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8d2201ab67f..ef589b90339 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -913,10 +913,12 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	while (nbatch > 0)
 	{
 		/* how much memory are we using with current nbatch value */
-		size_t		current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
+		size_t		current_space =
+			hash_table_bytes + (2 * nbatch * (size_t) BLCKSZ);
 
 		/* how much memory would we use with half the batches */
-		size_t		new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
+		size_t		new_space =
+			hash_table_bytes * 2 + (nbatch * (size_t) BLCKSZ);
 
 		/* If the memory usage would not decrease, we found the optimum. */
 		if (current_space < new_space)
@@ -995,7 +997,7 @@ ExecHashIncreaseBatchSize(HashJoinTable hashtable)
 	 * How much additional memory would doubling nbatch use? Each batch may
 	 * require two buffered files (inner/outer), with a BLCKSZ buffer.
 	 */
-	size_t		batchSpace = (hashtable->nbatch * 2 * BLCKSZ);
+	size_t		batchSpace = (hashtable->nbatch * 2 * (size_t) BLCKSZ);
 
 	/*
 	 * Compare the new space needed for doubling nbatch and for enlarging the
-- 
2.51.0

#18Melanie Plageman
melanieplageman@gmail.com
In reply to: Tomas Vondra (#17)
1 attachment(s)
Re: Fix overflow of nbatch

On Wed, Sep 24, 2025 at 7:02 PM Tomas Vondra <tomas@vondra.me> wrote:

Here's a couple draft patches fixing the bug:

- 0001 adds the missing size_t cast, to fix the overflow
- 0002 fixes the balancing, by adjusting the hash table size limit
- 0003 adds the missing overflow protection for nbuckets and the hash
table limit

I've attached an alternative patch idea. It's shorter because I
avoided multiplication using algebraic equivalence. There are two
overflow checks for nbuckets and that max_pointers will be able to be
allocated, but otherwise, it avoids most of the overflow risk by
switching to division.

On the topic of max_pointers, I think there is no need to handle >
MaxAllocSize. We were willing to cope with the original values of
nbatch and nbuckets, we are just trying to optimize that. If doing so
would make us run out of memory for the arrays of poitners to buckets,
just use the last hashtable size that didn't have this problem. That
means we don't have to do any shenanigans to have powers of two for
nbuckets because we don't do clamping.

Also, I think the outer loop needs the condition nbatch > 1 not nbatch

0 -- when nbatch is 1, once we divide it by 2, it would end up as 0.

I'm still waiting for the 4billion row table to be created on my
machine, so I haven't verified that I get the same results as you yet.

- 0004 rewords the comment explaining how the balancing works. Reading
it after a couple months, I found it overly verbose / not very clear.
I'm sure it could be improved even further.

My attached patch does build on your revised wording.

- Melanie

Attachments:

alt-Fix-overflow-risk-in-hashtable-size-calculation.patchtext/x-patch; charset=US-ASCII; name=alt-Fix-overflow-risk-in-hashtable-size-calculation.patchDownload
From 477fad740ca9abcfcb57440f3f9d0d5474d61601 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 7 Oct 2025 13:21:14 -0400
Subject: [PATCH v1] Fix overflow risk in hashtable size calculation

---
 src/backend/executor/nodeHash.c | 108 ++++++++++++++------------------
 1 file changed, 47 insertions(+), 61 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index a3415db4e20..70723aeb558 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -850,92 +850,78 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/*
 	 * Optimize the total amount of memory consumed by the hash node.
 	 *
-	 * The nbatch calculation above focuses on the size of the in-memory hash
-	 * table, assuming no per-batch overhead. Now adjust the number of batches
-	 * and the size of the hash table to minimize total memory consumed by the
-	 * hash node.
-	 *
-	 * Each batch file has a BLCKSZ buffer, and we may need two files per
-	 * batch (inner and outer side). So with enough batches this can be
-	 * significantly more memory than the hashtable itself.
+	 * The nbatch calculation above focuses on the the in-memory hash table,
+	 * assuming no per-batch overhead. But each batch may have two files, each
+	 * with a BLCKSZ buffer. With enough batches these buffers may use
+	 * significantly more memory than the hash table.
 	 *
 	 * The total memory usage may be expressed by this formula:
 	 *
-	 * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ) <= hash_table_bytes
+	 * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ)
 	 *
 	 * where (inner_rel_bytes / nbatch) is the size of the in-memory hash
 	 * table and (2 * nbatch * BLCKSZ) is the amount of memory used by file
-	 * buffers. But for sufficiently large values of inner_rel_bytes value
-	 * there may not be a nbatch value that would make both parts fit into
-	 * hash_table_bytes.
-	 *
-	 * In this case we can't enforce the memory limit - we're going to exceed
-	 * it. We can however minimize the impact and use as little memory as
-	 * possible. (We haven't really enforced it before either, as we simply
-	 * ignored the batch files.)
+	 * buffers.
 	 *
-	 * The formula for total memory usage says that given an inner relation of
-	 * size inner_rel_bytes, we may divide it into an arbitrary number of
-	 * batches. This determines both the size of the in-memory hash table and
-	 * the amount of memory needed for batch files. These two terms work in
-	 * opposite ways - when one decreases, the other increases.
+	 * For very large inner_rel_bytes, there may be no nbatch that keeps total
+	 * memory usage under the budget (work_mem * hash_mem_multiplier). In that
+	 * case, we choose nbatch to minimize total memory consumption across both
+	 * the hashtable and file buffers.
 	 *
-	 * For low nbatch values, the hash table takes most of the memory, but at
-	 * some point the batch files start to dominate. If you combine these two
-	 * terms, the memory consumption (for a fixed size of the inner relation)
-	 * has a u-shape, with a minimum at some nbatch value.
+	 * As we increase the size of the hashtable and decrease the number of
+	 * batches, the total memory usage follows a U-shaped curve. We find the
+	 * minimum nbatch by "walking back" -- checking if halving nbatch would
+	 * lower total memory usage. We stop when it no longer helps.
 	 *
-	 * Our goal is to find this nbatch value, minimizing the memory usage. We
-	 * calculate the memory usage with half the batches (i.e. nbatch/2), and
-	 * if it's lower than the current memory usage we know it's better to use
-	 * fewer batches. We repeat this until reducing the number of batches does
-	 * not reduce the memory usage - we found the optimum. We know the optimum
-	 * exists, thanks to the u-shape.
+	 * We only want to do this when we are already over the memory budget. We
+	 * don't explore increasing nbatch since that could force batching when it
+	 * is not needed.
 	 *
-	 * We only want to do this when exceeding the memory limit, not every
-	 * time. The goal is not to minimize memory usage in every case, but to
-	 * minimize the memory usage when we can't stay within the memory limit.
+	 * While growing the hashtable, we also adjust the number of buckets to
+	 * maintain a load factor of NTUP_PER_BUCKET while squeezing tuples back
+	 * from batches into the hashtable.
 	 *
-	 * For this reason we only consider reducing the number of batches. We
-	 * could try the opposite direction too, but that would save memory only
-	 * when most of the memory is used by the hash table. And the hash table
-	 * was used for the initial sizing, so we shouldn't be exceeding the
-	 * memory limit too much. We might save memory by using more batches, but
-	 * it would result in spilling more batch files, which does not seem like
-	 * a great trade off.
-	 *
-	 * While growing the hashtable, we also adjust the number of buckets, to
-	 * not have more than one tuple per bucket (load factor 1). We can only do
-	 * this during the initial sizing - once we start building the hash,
-	 * nbucket is fixed.
+	 * Note that we can only change nbuckets during initial hashtable sizing.
+	 * Once we start building the hash, nbuckets is fixed.
 	 */
-	while (nbatch > 0)
+	while (nbatch > 1)
 	{
-		/* how much memory are we using with current nbatch value */
-		size_t		current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
-
-		/* how much memory would we use with half the batches */
-		size_t		new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
+		/* Ensure that nbuckets * 2 doesn't overflow an int */
+		if (nbuckets > INT_MAX / 2)
+			break;
 
-		/* If the memory usage would not decrease, we found the optimum. */
-		if (current_space < new_space)
+		/*
+		 * Ensure that hash_table_bytes * 2 doesn't exceed MaxAllocSize /
+		 * sizeof(HashJoinTuple)
+		 */
+		if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2)
 			break;
 
 		/*
-		 * It's better to use half the batches, so do that and adjust the
-		 * nbucket in the opposite direction, and double the allowance.
+		 * This is the same as:
+		 *
+		 * hash_table_bytes + 2 * nbatch * BLCKSZ < hash_table_bytes * 2 +
+		 * nbatch * BLCKSZ
+		 *
+		 * avoiding intermediate overflow.
+		 *
+		 * which is to say: will halving the number of batches and doubling
+		 * the size of the hashtable reduce overall memory usage?
 		 */
-		nbatch /= 2;
+		if (nbatch / 4 < hash_table_bytes / BLCKSZ)
+			break;
+
 		nbuckets *= 2;
+		hash_table_bytes *= 2;
 
-		*space_allowed = (*space_allowed) * 2;
+		nbatch /= 2;
 	}
 
 	Assert(nbuckets > 0);
 	Assert(nbatch > 0);
-
 	*numbuckets = nbuckets;
 	*numbatches = nbatch;
+	*space_allowed = hash_table_bytes;
 }
 
 
@@ -994,7 +980,7 @@ ExecHashIncreaseBatchSize(HashJoinTable hashtable)
 	 * How much additional memory would doubling nbatch use? Each batch may
 	 * require two buffered files (inner/outer), with a BLCKSZ buffer.
 	 */
-	size_t		batchSpace = (hashtable->nbatch * 2 * BLCKSZ);
+	size_t		batchSpace = (hashtable->nbatch * 2 * (size_t) BLCKSZ);
 
 	/*
 	 * Compare the new space needed for doubling nbatch and for enlarging the
-- 
2.43.0

#19Tomas Vondra
tomas@vondra.me
In reply to: Melanie Plageman (#18)
Re: Fix overflow of nbatch

On 10/7/25 23:10, Melanie Plageman wrote:

On Wed, Sep 24, 2025 at 7:02 PM Tomas Vondra <tomas@vondra.me> wrote:

Here's a couple draft patches fixing the bug:

- 0001 adds the missing size_t cast, to fix the overflow
- 0002 fixes the balancing, by adjusting the hash table size limit
- 0003 adds the missing overflow protection for nbuckets and the hash
table limit

I've attached an alternative patch idea. It's shorter because I
avoided multiplication using algebraic equivalence. There are two
overflow checks for nbuckets and that max_pointers will be able to be
allocated, but otherwise, it avoids most of the overflow risk by
switching to division.

Thanks for looking!

On the topic of max_pointers, I think there is no need to handle >
MaxAllocSize. We were willing to cope with the original values of
nbatch and nbuckets, we are just trying to optimize that. If doing so
would make us run out of memory for the arrays of poitners to buckets,
just use the last hashtable size that didn't have this problem. That
means we don't have to do any shenanigans to have powers of two for
nbuckets because we don't do clamping.

Hmm, so if I understand correctly you suggest stop when nbuckets gets
too high, while my code allowed reducing nbatches further (and just
capped nbatches). I'm fine with this change, if it makes the code
simpler, that means we allow ~130M buckets, which seems rather unlikely
to be a problem in practice. That means 1GB for buckets alone, and
tuples tend to be a multiple of that. With 4GB total, that's ~256k
batches. And multiplied by the 130M that would be 3.5e13 tuples ...

However, I don't understand why the condition bothers about INT_MAX?

/* Ensure that nbuckets * 2 doesn't overflow an int */
if (nbuckets > INT_MAX / 2)
break;

AFAIK what matters is whether the buckets fit into MaxAllocSize. But
INT_MAX (or even INT_MAX/2) would allocate way more than that. The
original code (copied from an earlier part of the function) points out
the INT_MAX check is redundant, given the current MaxAllocSize value.

It's however true that if we double nbuckets and nbatch at the same
time, we don't need to bother doing this:

max_pointers = hash_table_bytes / sizeof(HashJoinTuple);

because we can never break. Although, is that really true? When picking
the initial nbuckets value we do this:

nbuckets = Max(nbuckets, 1024);

What if this increased the nbuckets (it'd require extremely low work_mem
of course)? Could it lead to unexpectedly high nbuckets in the loop?

Similarly, why does this check care about number of buckets?

if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2)
break;

I mean, (MaxAllocSize / sizeof(HashJoinTuple)) is the number of buckets
(hj tuple pointers) we can fit into 1GB. But hash_table_bytes is the
size of the hash table in bytes, not in number of items. Also, see how
get_hash_memory_limit caps the limit by SIZE_MAX. So I think we should
continue doing that.

I like the idea of simplifying the stop condition, instead of
calculating the old/new size, and comparing those. But is this actually
correct?

if (nbatch / 4 < hash_table_bytes / BLCKSZ)
break;

If we start with the "full" condition

hash_bytes + (2 * nbatch * BLCKSZ) < hash_bytes * 2 + (nbatch * BLCKSZ)

subtract hash_bytes from both sides

(2 * nbatch * BLCKSZ) < hash_bytes + (nbatch * BLCKSZ)

subtract (nbatch * BLCKSZ)

nbatch * BLCKSZ < hash_bytes

that gives us

nbatch < (hash_bytes / BLCKSZ)

So where did the /4 come from? Also, maybe it'd be easier to just do

(nbatch * BLCKSZ) < hash_bytes

i.e. without the division.

Also, I think the outer loop needs the condition nbatch > 1 not nbatch

0 -- when nbatch is 1, once we divide it by 2, it would end up as 0.

Good point. I was wondering why we're not seeing any failures because of
this (e.g. from tests using tiny amounts of data, with a single batch).
But we immediately stop the condition, because the two batch files use
much less than the minimum work_mem value (even without the multiplier).

So it's benign, but let's not do that anyway.

I'm still waiting for the 4billion row table to be created on my
machine, so I haven't verified that I get the same results as you yet.

- 0004 rewords the comment explaining how the balancing works. Reading
it after a couple months, I found it overly verbose / not very clear.
I'm sure it could be improved even further.

My attached patch does build on your revised wording.

Thanks. I'll re-read the comment tomorrow.

regards

--
Tomas Vondra

#20Melanie Plageman
melanieplageman@gmail.com
In reply to: Tomas Vondra (#19)
1 attachment(s)
Re: Fix overflow of nbatch

On Tue, Oct 7, 2025 at 7:51 PM Tomas Vondra <tomas@vondra.me> wrote:

On 10/7/25 23:10, Melanie Plageman wrote:

Hmm, so if I understand correctly you suggest stop when nbuckets gets
too high, while my code allowed reducing nbatches further (and just
capped nbatches). I'm fine with this change, if it makes the code
simpler, that means we allow ~130M buckets, which seems rather unlikely
to be a problem in practice. That means 1GB for buckets alone, and
tuples tend to be a multiple of that. With 4GB total, that's ~256k
batches. And multiplied by the 130M that would be 3.5e13 tuples ...

However, I don't understand why the condition bothers about INT_MAX?

/* Ensure that nbuckets * 2 doesn't overflow an int */
if (nbuckets > INT_MAX / 2)
break;

You're right. This can just be
if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
break;

It's however true that if we double nbuckets and nbatch at the same
time, we don't need to bother doing this:

max_pointers = hash_table_bytes / sizeof(HashJoinTuple);

because we can never break. Although, is that really true? When picking
the initial nbuckets value we do this:

nbuckets = Max(nbuckets, 1024);

What if this increased the nbuckets (it'd require extremely low work_mem
of course)? Could it lead to unexpectedly high nbuckets in the loop?

If we are worried about nbuckets exceeding what can be allocated, then
I think the proposed condition above takes care of that

if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
break;

As for whether or not bumping nbuckets to 1024 before the loop means
we can have very high values with low work_mem: it seems like even
with the minimum work_mem, the number of buckets is larger than that.
When could we hit this? Maybe if the number of skew buckets is very
large?

Similarly, why does this check care about number of buckets?

if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2)
break;

I mean, (MaxAllocSize / sizeof(HashJoinTuple)) is the number of buckets
(hj tuple pointers) we can fit into 1GB. But hash_table_bytes is the
size of the hash table in bytes, not in number of items. Also, see how
get_hash_memory_limit caps the limit by SIZE_MAX. So I think we should
continue doing that.

Yes, this was wrong. I forgot an extra / sizeof(HashJoinTuple). I meant:

hash_table_bytes / sizeof(HashJoinTuple) > MaxAllocSize /
sizeof(HashJoinTuple) / 2

but I don't think we need this because nbuckets should always be
bigger than hash_table_bytes / sizeof(HashJoinTuple) since to start
out with, it was clamped to Max(1024, hash_table_bytes /
sizeof(HashJoinTuple)).

The only exception would be if MaxAllocSize / sizeof(HashJoinTuple)
was smaller than hash_table_bytes / sizeof(HashJoinTuple) in which
case, checking nbuckets > MaxAllocSize / sizeof(HashJoinTuple) should
cover that anyway.

I like the idea of simplifying the stop condition, instead of
calculating the old/new size, and comparing those. But is this actually
correct?

if (nbatch / 4 < hash_table_bytes / BLCKSZ)
break;

If we start with the "full" condition

hash_bytes + (2 * nbatch * BLCKSZ) < hash_bytes * 2 + (nbatch * BLCKSZ)

subtract hash_bytes from both sides

(2 * nbatch * BLCKSZ) < hash_bytes + (nbatch * BLCKSZ)

subtract (nbatch * BLCKSZ)

nbatch * BLCKSZ < hash_bytes

that gives us

nbatch < (hash_bytes / BLCKSZ)

So where did the /4 come from?

Yes, I made the mistake of doubling the batches and hash_table_bytes
again, forgetting that the original formula was already comparing the
hypothetical space to the current space.

I think it should be
if (nbatch < hash_table_bytes / BLCKSZ)
as you say

Also, maybe it'd be easier to just do
(nbatch * BLCKSZ) < hash_bytes

i.e. without the division.

I prefer the division to avoid as many potentially overflow causing
operations as possible. otherwise we would have to check that nbatch *
BLCKSZ doesn't overflow first.

I'm still waiting for the 4billion row table to be created on my
machine, so I haven't verified that I get the same results as you yet.

I have updated my patch to fix the mistakes above. I also noticed then
that I wasn't doubling space_allowed in the loop but instead setting
it to hash_table_bytes at the end. This doesn't produce a power of 2
because we subtract skew_mcvs from the hash_table_bytes. So, we have
to keep using space_allowed if we want a power of 2 in the end.

I've changed my patch to do this, but this made me wonder if we want
to be doing this or instead take hash_table_bytes at the end and round
it up to a power of 2 and set space_allowed to that. If the skew
hashtable is large, we may be allocating way more space_allowed than
we need for new hash_table_bytes + skew hashtable buckets.

Oh, and I get the same logging output results as your patch with attached v2.

- Melanie

Attachments:

v2-0001-Fix-overflow-risk-in-hashtable-size-calculation.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Fix-overflow-risk-in-hashtable-size-calculation.patchDownload
From 716b7c939019294c2e006e55da44180f61072fe2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 7 Oct 2025 13:21:14 -0400
Subject: [PATCH v2] Fix overflow risk in hashtable size calculation

ci-os-only:
---
 src/backend/executor/nodeHash.c | 115 +++++++++++++++-----------------
 1 file changed, 55 insertions(+), 60 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index a3415db4e20..e8be61e40ed 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -850,90 +850,85 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/*
 	 * Optimize the total amount of memory consumed by the hash node.
 	 *
-	 * The nbatch calculation above focuses on the size of the in-memory hash
-	 * table, assuming no per-batch overhead. Now adjust the number of batches
-	 * and the size of the hash table to minimize total memory consumed by the
-	 * hash node.
-	 *
-	 * Each batch file has a BLCKSZ buffer, and we may need two files per
-	 * batch (inner and outer side). So with enough batches this can be
-	 * significantly more memory than the hashtable itself.
+	 * The nbatch calculation above focuses on the the in-memory hash table,
+	 * assuming no per-batch overhead. But each batch may have two files, each
+	 * with a BLCKSZ buffer. With enough batches these buffers may use
+	 * significantly more memory than the hash table.
 	 *
 	 * The total memory usage may be expressed by this formula:
 	 *
-	 * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ) <= hash_table_bytes
+	 * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ)
 	 *
 	 * where (inner_rel_bytes / nbatch) is the size of the in-memory hash
 	 * table and (2 * nbatch * BLCKSZ) is the amount of memory used by file
-	 * buffers. But for sufficiently large values of inner_rel_bytes value
-	 * there may not be a nbatch value that would make both parts fit into
-	 * hash_table_bytes.
-	 *
-	 * In this case we can't enforce the memory limit - we're going to exceed
-	 * it. We can however minimize the impact and use as little memory as
-	 * possible. (We haven't really enforced it before either, as we simply
-	 * ignored the batch files.)
-	 *
-	 * The formula for total memory usage says that given an inner relation of
-	 * size inner_rel_bytes, we may divide it into an arbitrary number of
-	 * batches. This determines both the size of the in-memory hash table and
-	 * the amount of memory needed for batch files. These two terms work in
-	 * opposite ways - when one decreases, the other increases.
+	 * buffers.
 	 *
-	 * For low nbatch values, the hash table takes most of the memory, but at
-	 * some point the batch files start to dominate. If you combine these two
-	 * terms, the memory consumption (for a fixed size of the inner relation)
-	 * has a u-shape, with a minimum at some nbatch value.
+	 * For very large inner_rel_bytes, there may be no nbatch that keeps total
+	 * memory usage under the budget (work_mem * hash_mem_multiplier). In that
+	 * case, we choose nbatch to minimize total memory consumption across both
+	 * the hashtable and file buffers.
 	 *
-	 * Our goal is to find this nbatch value, minimizing the memory usage. We
-	 * calculate the memory usage with half the batches (i.e. nbatch/2), and
-	 * if it's lower than the current memory usage we know it's better to use
-	 * fewer batches. We repeat this until reducing the number of batches does
-	 * not reduce the memory usage - we found the optimum. We know the optimum
-	 * exists, thanks to the u-shape.
+	 * As we increase the size of the hashtable and decrease the number of
+	 * batches, the total memory usage follows a U-shaped curve. We find the
+	 * minimum nbatch by "walking back" -- checking if halving nbatch would
+	 * lower total memory usage. We stop when it no longer helps.
 	 *
-	 * We only want to do this when exceeding the memory limit, not every
-	 * time. The goal is not to minimize memory usage in every case, but to
-	 * minimize the memory usage when we can't stay within the memory limit.
+	 * We only want to do this when we are already over the memory budget. We
+	 * don't explore increasing nbatch since that could force batching when it
+	 * is not needed.
 	 *
-	 * For this reason we only consider reducing the number of batches. We
-	 * could try the opposite direction too, but that would save memory only
-	 * when most of the memory is used by the hash table. And the hash table
-	 * was used for the initial sizing, so we shouldn't be exceeding the
-	 * memory limit too much. We might save memory by using more batches, but
-	 * it would result in spilling more batch files, which does not seem like
-	 * a great trade off.
+	 * While growing the hashtable, we also adjust the number of buckets to
+	 * maintain a load factor of NTUP_PER_BUCKET while squeezing tuples back
+	 * from batches into the hashtable.
 	 *
-	 * While growing the hashtable, we also adjust the number of buckets, to
-	 * not have more than one tuple per bucket (load factor 1). We can only do
-	 * this during the initial sizing - once we start building the hash,
-	 * nbucket is fixed.
+	 * Note that we can only change nbuckets during initial hashtable sizing.
+	 * Once we start building the hash, nbuckets is fixed.
 	 */
-	while (nbatch > 0)
+	while (nbatch > 1)
 	{
-		/* how much memory are we using with current nbatch value */
-		size_t		current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
+		/*
+		 * Ensure that nbuckets can be allocated. nbuckets may have been
+		 * increased above hash_table_size / sizeof(HashJoinTuple) if work_mem
+		 * was very small, so checking if MaxAllocSize is enough for the
+		 * buckets is the safest check.
+		 */
+		if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
+			break;
 
-		/* how much memory would we use with half the batches */
-		size_t		new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
+		/*
+		 * We shouldn't overflow a size_t if we break before hash_table_bytes
+		 * would exceed MaxAllocSize. We use space_allowed because it will be
+		 * larger than hash_table_bytes when there is a skew hashtable.
+		 */
+		Assert((*space_allowed) < SIZE_MAX / 2);
 
-		/* If the memory usage would not decrease, we found the optimum. */
-		if (current_space < new_space)
+		/*
+		 * This is the same as:
+		 *
+		 * hash_table_bytes + 2 * nbatch * BLCKSZ < hash_table_bytes * 2 +
+		 * nbatch * BLCKSZ
+		 *
+		 * avoiding intermediate overflow.
+		 *
+		 * which is to say: will halving the number of batches and doubling
+		 * the size of the hashtable reduce overall memory usage?
+		 */
+		if (nbatch < hash_table_bytes / BLCKSZ)
 			break;
 
 		/*
-		 * It's better to use half the batches, so do that and adjust the
-		 * nbucket in the opposite direction, and double the allowance.
+		 * MaxAllocSize is sufficiently small that we are not worried about
+		 * overflowing nbuckets.
 		 */
-		nbatch /= 2;
 		nbuckets *= 2;
-
+		hash_table_bytes *= 2;
 		*space_allowed = (*space_allowed) * 2;
+
+		nbatch /= 2;
 	}
 
 	Assert(nbuckets > 0);
 	Assert(nbatch > 0);
-
 	*numbuckets = nbuckets;
 	*numbatches = nbatch;
 }
@@ -994,7 +989,7 @@ ExecHashIncreaseBatchSize(HashJoinTable hashtable)
 	 * How much additional memory would doubling nbatch use? Each batch may
 	 * require two buffered files (inner/outer), with a BLCKSZ buffer.
 	 */
-	size_t		batchSpace = (hashtable->nbatch * 2 * BLCKSZ);
+	size_t		batchSpace = (hashtable->nbatch * 2 * (size_t) BLCKSZ);
 
 	/*
 	 * Compare the new space needed for doubling nbatch and for enlarging the
-- 
2.43.0

#21Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#20)
Re: Fix overflow of nbatch

On Wed, Oct 8, 2025 at 1:37 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I have updated my patch to fix the mistakes above. I also noticed then
that I wasn't doubling space_allowed in the loop but instead setting
it to hash_table_bytes at the end. This doesn't produce a power of 2
because we subtract skew_mcvs from the hash_table_bytes. So, we have
to keep using space_allowed if we want a power of 2 in the end.

I've changed my patch to do this, but this made me wonder if we want
to be doing this or instead take hash_table_bytes at the end and round
it up to a power of 2 and set space_allowed to that. If the skew
hashtable is large, we may be allocating way more space_allowed than
we need for new hash_table_bytes + skew hashtable buckets.

Oh wait, that doesn't make sense because each batch could have a skew hashtable.

- Melanie

#22Tomas Vondra
tomas@vondra.me
In reply to: Melanie Plageman (#20)
Re: Fix overflow of nbatch

On 10/8/25 19:37, Melanie Plageman wrote:

On Tue, Oct 7, 2025 at 7:51 PM Tomas Vondra <tomas@vondra.me> wrote:

On 10/7/25 23:10, Melanie Plageman wrote:

Hmm, so if I understand correctly you suggest stop when nbuckets gets
too high, while my code allowed reducing nbatches further (and just
capped nbatches). I'm fine with this change, if it makes the code
simpler, that means we allow ~130M buckets, which seems rather unlikely
to be a problem in practice. That means 1GB for buckets alone, and
tuples tend to be a multiple of that. With 4GB total, that's ~256k
batches. And multiplied by the 130M that would be 3.5e13 tuples ...

However, I don't understand why the condition bothers about INT_MAX?

/* Ensure that nbuckets * 2 doesn't overflow an int */
if (nbuckets > INT_MAX / 2)
break;

You're right. This can just be
if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
break;

It's however true that if we double nbuckets and nbatch at the same
time, we don't need to bother doing this:

max_pointers = hash_table_bytes / sizeof(HashJoinTuple);

because we can never break. Although, is that really true? When picking
the initial nbuckets value we do this:

nbuckets = Max(nbuckets, 1024);

What if this increased the nbuckets (it'd require extremely low work_mem
of course)? Could it lead to unexpectedly high nbuckets in the loop?

If we are worried about nbuckets exceeding what can be allocated, then
I think the proposed condition above takes care of that

if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
break;

As for whether or not bumping nbuckets to 1024 before the loop means
we can have very high values with low work_mem: it seems like even
with the minimum work_mem, the number of buckets is larger than that.
When could we hit this? Maybe if the number of skew buckets is very
large?

Similarly, why does this check care about number of buckets?

if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2)
break;

I mean, (MaxAllocSize / sizeof(HashJoinTuple)) is the number of buckets
(hj tuple pointers) we can fit into 1GB. But hash_table_bytes is the
size of the hash table in bytes, not in number of items. Also, see how
get_hash_memory_limit caps the limit by SIZE_MAX. So I think we should
continue doing that.

Yes, this was wrong. I forgot an extra / sizeof(HashJoinTuple). I meant:

hash_table_bytes / sizeof(HashJoinTuple) > MaxAllocSize /
sizeof(HashJoinTuple) / 2

But the hash table is not allocated as a single chunk of memory, so I
think MaxAllocSize would be the wrong thing to use here, no? Hash table
is a collection of tuples (allocated one by one), that's why
get_hash_memory_limit() uses SIZE_MAX. MaxAllocSize matters for
nbuckets, because that indeed is allocated as a contiguous array, ofc.

Also, why bother with /sizeof(HashJoinTuple) on both sides? Without it
we get

hash_table_bytes > MaxAllocSize / 2

but again, that doesn't make much sense - the hash table can be larger,
it's not a single palloc.

but I don't think we need this because nbuckets should always be
bigger than hash_table_bytes / sizeof(HashJoinTuple) since to start
out with, it was clamped to Max(1024, hash_table_bytes /
sizeof(HashJoinTuple)).

The only exception would be if MaxAllocSize / sizeof(HashJoinTuple)

was smaller than hash_table_bytes / sizeof(HashJoinTuple) in which
case, checking nbuckets > MaxAllocSize / sizeof(HashJoinTuple) should
cover that anyway.

I'm confused about what this check was meant to do. The check said this

/*
* Ensure that hash_table_bytes * 2 doesn't exceed MaxAllocSize /
* sizeof(HashJoinTuple)
*/
if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2)
break;

And I think it should be just

if (hash_table_bytes > SIZE_MAX / 2)
break;

as a protection against hash_table_bytes overflowing SIZE_MAX (which on
64-bits seems implausible, but could happen on 32-bit builds).

Also, how is this related to nbuckets?

I like the idea of simplifying the stop condition, instead of
calculating the old/new size, and comparing those. But is this actually
correct?

if (nbatch / 4 < hash_table_bytes / BLCKSZ)
break;

If we start with the "full" condition

hash_bytes + (2 * nbatch * BLCKSZ) < hash_bytes * 2 + (nbatch * BLCKSZ)

subtract hash_bytes from both sides

(2 * nbatch * BLCKSZ) < hash_bytes + (nbatch * BLCKSZ)

subtract (nbatch * BLCKSZ)

nbatch * BLCKSZ < hash_bytes

that gives us

nbatch < (hash_bytes / BLCKSZ)

So where did the /4 come from?

Yes, I made the mistake of doubling the batches and hash_table_bytes
again, forgetting that the original formula was already comparing the
hypothetical space to the current space.

I think it should be
if (nbatch < hash_table_bytes / BLCKSZ)
as you say

Also, maybe it'd be easier to just do
(nbatch * BLCKSZ) < hash_bytes

i.e. without the division.

I prefer the division to avoid as many potentially overflow causing
operations as possible. otherwise we would have to check that nbatch *
BLCKSZ doesn't overflow first.

Ah, I forgot about overflow again! Which is ironic, because this whole
thing is about overflows.

I really wish we had an infinite-precision-int ;-)

I'm still waiting for the 4billion row table to be created on my
machine, so I haven't verified that I get the same results as you yet.

I have updated my patch to fix the mistakes above. I also noticed then
that I wasn't doubling space_allowed in the loop but instead setting
it to hash_table_bytes at the end. This doesn't produce a power of 2
because we subtract skew_mcvs from the hash_table_bytes. So, we have
to keep using space_allowed if we want a power of 2 in the end.

I've changed my patch to do this, but this made me wonder if we want
to be doing this or instead take hash_table_bytes at the end and round
it up to a power of 2 and set space_allowed to that. If the skew
hashtable is large, we may be allocating way more space_allowed than
we need for new hash_table_bytes + skew hashtable buckets.

Oh, and I get the same logging output results as your patch with attached v2.

Cool, in the worst case we're both wrong in the same way ;-)

regards

--
Tomas Vondra

#23Tomas Vondra
tomas@vondra.me
In reply to: Melanie Plageman (#21)
Re: Fix overflow of nbatch

On 10/8/25 21:16, Melanie Plageman wrote:

On Wed, Oct 8, 2025 at 1:37 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I have updated my patch to fix the mistakes above. I also noticed then
that I wasn't doubling space_allowed in the loop but instead setting
it to hash_table_bytes at the end. This doesn't produce a power of 2
because we subtract skew_mcvs from the hash_table_bytes. So, we have
to keep using space_allowed if we want a power of 2 in the end.

I've changed my patch to do this, but this made me wonder if we want
to be doing this or instead take hash_table_bytes at the end and round
it up to a power of 2 and set space_allowed to that. If the skew
hashtable is large, we may be allocating way more space_allowed than
we need for new hash_table_bytes + skew hashtable buckets.

I don't think there's any promise hash_table_bytes being a power of 2.
You can make hash_table_bytes an almost arbitrary value by setting
work_mem and hash_mem_multiplier. Or am I missing something?

But you're right hash_table_bytes and space_allowed may not be equal if
useskew=true. So setting space_allowed to hash_table_bytes at the end
does not seem right. I think we don't actually need hash_table_bytes at
this point, we can just ignore it, and use/double *space_allowed.

I kept using hash_table_bytes mostly because it didn't require the
pointer dereferencing, but I failed to consider the useskew=true thing.

However, this means there's probably a bug - the loop should probably
double num_skew_mcvs too. We simply reserve SKEW_HASH_MEM_PERCENT of
space_allowed for skew hashtable, so should we adjust it the same way?

Oh wait, that doesn't make sense because each batch could have a skew hashtable.

Not sure I understand. Is this the same issue I just described?

regards

--
Tomas Vondra

#24Melanie Plageman
melanieplageman@gmail.com
In reply to: Tomas Vondra (#22)
1 attachment(s)
Re: Fix overflow of nbatch

On Wed, Oct 8, 2025 at 4:46 PM Tomas Vondra <tomas@vondra.me> wrote:

On 10/8/25 19:37, Melanie Plageman wrote:

Yes, this was wrong. I forgot an extra / sizeof(HashJoinTuple). I meant:

hash_table_bytes / sizeof(HashJoinTuple) > MaxAllocSize /
sizeof(HashJoinTuple) / 2

But the hash table is not allocated as a single chunk of memory, so I
think MaxAllocSize would be the wrong thing to use here, no? Hash table
is a collection of tuples (allocated one by one), that's why
get_hash_memory_limit() uses SIZE_MAX. MaxAllocSize matters for
nbuckets, because that indeed is allocated as a contiguous array, ofc.

Also, why bother with /sizeof(HashJoinTuple) on both sides? Without it
we get

hash_table_bytes > MaxAllocSize / 2

but again, that doesn't make much sense - the hash table can be larger,
it's not a single palloc.

It came from the earlier clamping of nbuckets:

max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
dbuckets = Min(dbuckets, max_pointers);

I don't really get why this divides hash_table_bytes by
sizeof(HashJoinTuple) since, as you say, hash_table_bytes is supposed
to accommodate both the bucket_bytes and inner_rel_bytes.

And I think it should be just

if (hash_table_bytes > SIZE_MAX / 2)
break;

as a protection against hash_table_bytes overflowing SIZE_MAX (which on
64-bits seems implausible, but could happen on 32-bit builds).

That's roughly the check I ended up with -- except I used
space_allowed because it will be larger than hash_table_bytes if there
is a skew hashtable. I've updated the comment and such in attached v3.

- Melanie

Attachments:

v3-0001-Fix-overflow-risk-in-hashtable-size-calculation.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Fix-overflow-risk-in-hashtable-size-calculation.patchDownload
From 91cd15ac95e7a5943213bf6d88c08f36b4c0c17b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 7 Oct 2025 13:21:14 -0400
Subject: [PATCH v3] Fix overflow risk in hashtable size calculation

ci-os-only:
---
 src/backend/executor/nodeHash.c | 115 +++++++++++++++-----------------
 1 file changed, 55 insertions(+), 60 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index a3415db4e20..0f815674c9f 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -850,90 +850,85 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/*
 	 * Optimize the total amount of memory consumed by the hash node.
 	 *
-	 * The nbatch calculation above focuses on the size of the in-memory hash
-	 * table, assuming no per-batch overhead. Now adjust the number of batches
-	 * and the size of the hash table to minimize total memory consumed by the
-	 * hash node.
-	 *
-	 * Each batch file has a BLCKSZ buffer, and we may need two files per
-	 * batch (inner and outer side). So with enough batches this can be
-	 * significantly more memory than the hashtable itself.
+	 * The nbatch calculation above focuses on the the in-memory hash table,
+	 * assuming no per-batch overhead. But each batch may have two files, each
+	 * with a BLCKSZ buffer. With enough batches these buffers may use
+	 * significantly more memory than the hash table.
 	 *
 	 * The total memory usage may be expressed by this formula:
 	 *
-	 * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ) <= hash_table_bytes
+	 * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ)
 	 *
 	 * where (inner_rel_bytes / nbatch) is the size of the in-memory hash
 	 * table and (2 * nbatch * BLCKSZ) is the amount of memory used by file
-	 * buffers. But for sufficiently large values of inner_rel_bytes value
-	 * there may not be a nbatch value that would make both parts fit into
-	 * hash_table_bytes.
-	 *
-	 * In this case we can't enforce the memory limit - we're going to exceed
-	 * it. We can however minimize the impact and use as little memory as
-	 * possible. (We haven't really enforced it before either, as we simply
-	 * ignored the batch files.)
-	 *
-	 * The formula for total memory usage says that given an inner relation of
-	 * size inner_rel_bytes, we may divide it into an arbitrary number of
-	 * batches. This determines both the size of the in-memory hash table and
-	 * the amount of memory needed for batch files. These two terms work in
-	 * opposite ways - when one decreases, the other increases.
+	 * buffers.
 	 *
-	 * For low nbatch values, the hash table takes most of the memory, but at
-	 * some point the batch files start to dominate. If you combine these two
-	 * terms, the memory consumption (for a fixed size of the inner relation)
-	 * has a u-shape, with a minimum at some nbatch value.
+	 * For very large inner_rel_bytes, there may be no nbatch that keeps total
+	 * memory usage under the budget (work_mem * hash_mem_multiplier). In that
+	 * case, we choose nbatch to minimize total memory consumption across both
+	 * the hashtable and file buffers.
 	 *
-	 * Our goal is to find this nbatch value, minimizing the memory usage. We
-	 * calculate the memory usage with half the batches (i.e. nbatch/2), and
-	 * if it's lower than the current memory usage we know it's better to use
-	 * fewer batches. We repeat this until reducing the number of batches does
-	 * not reduce the memory usage - we found the optimum. We know the optimum
-	 * exists, thanks to the u-shape.
+	 * As we increase the size of the hashtable and decrease the number of
+	 * batches, the total memory usage follows a U-shaped curve. We find the
+	 * minimum nbatch by "walking back" -- checking if halving nbatch would
+	 * lower total memory usage. We stop when it no longer helps.
 	 *
-	 * We only want to do this when exceeding the memory limit, not every
-	 * time. The goal is not to minimize memory usage in every case, but to
-	 * minimize the memory usage when we can't stay within the memory limit.
+	 * We only want to do this when we are already over the memory budget. We
+	 * don't explore increasing nbatch since that could force batching when it
+	 * is not needed.
 	 *
-	 * For this reason we only consider reducing the number of batches. We
-	 * could try the opposite direction too, but that would save memory only
-	 * when most of the memory is used by the hash table. And the hash table
-	 * was used for the initial sizing, so we shouldn't be exceeding the
-	 * memory limit too much. We might save memory by using more batches, but
-	 * it would result in spilling more batch files, which does not seem like
-	 * a great trade off.
+	 * While growing the hashtable, we also adjust the number of buckets to
+	 * maintain a load factor of NTUP_PER_BUCKET while squeezing tuples back
+	 * from batches into the hashtable.
 	 *
-	 * While growing the hashtable, we also adjust the number of buckets, to
-	 * not have more than one tuple per bucket (load factor 1). We can only do
-	 * this during the initial sizing - once we start building the hash,
-	 * nbucket is fixed.
+	 * Note that we can only change nbuckets during initial hashtable sizing.
+	 * Once we start building the hash, nbuckets is fixed.
 	 */
-	while (nbatch > 0)
+	while (nbatch > 1)
 	{
-		/* how much memory are we using with current nbatch value */
-		size_t		current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
+		/*
+		 * Ensure that nbuckets can be allocated. nbuckets may have been
+		 * increased above hash_table_size / sizeof(HashJoinTuple) if work_mem
+		 * was very small, so checking if MaxAllocSize is enough for the
+		 * buckets is the safest check.
+		 */
+		if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
+			break;
 
-		/* how much memory would we use with half the batches */
-		size_t		new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
+		/*
+		 * We check that space_allowed wouldn't overflow because it will be
+		 * larger than hash_table_bytes if there is a skew hashtable.
+		 */
+		if ((*space_allowed) > SIZE_MAX / 2)
+			break;
 
-		/* If the memory usage would not decrease, we found the optimum. */
-		if (current_space < new_space)
+		/*
+		 * This is the same as:
+		 *
+		 * hash_table_bytes + 2 * nbatch * BLCKSZ < hash_table_bytes * 2 +
+		 * nbatch * BLCKSZ
+		 *
+		 * avoiding intermediate overflow.
+		 *
+		 * which is to say: will halving the number of batches and doubling
+		 * the size of the hashtable reduce overall memory usage?
+		 */
+		if (nbatch < hash_table_bytes / BLCKSZ)
 			break;
 
 		/*
-		 * It's better to use half the batches, so do that and adjust the
-		 * nbucket in the opposite direction, and double the allowance.
+		 * MaxAllocSize is sufficiently small that we are not worried about
+		 * overflowing nbuckets.
 		 */
-		nbatch /= 2;
 		nbuckets *= 2;
-
+		hash_table_bytes *= 2;
 		*space_allowed = (*space_allowed) * 2;
+
+		nbatch /= 2;
 	}
 
 	Assert(nbuckets > 0);
 	Assert(nbatch > 0);
-
 	*numbuckets = nbuckets;
 	*numbatches = nbatch;
 }
@@ -994,7 +989,7 @@ ExecHashIncreaseBatchSize(HashJoinTable hashtable)
 	 * How much additional memory would doubling nbatch use? Each batch may
 	 * require two buffered files (inner/outer), with a BLCKSZ buffer.
 	 */
-	size_t		batchSpace = (hashtable->nbatch * 2 * BLCKSZ);
+	size_t		batchSpace = (hashtable->nbatch * 2 * (size_t) BLCKSZ);
 
 	/*
 	 * Compare the new space needed for doubling nbatch and for enlarging the
-- 
2.43.0

#25Tomas Vondra
tomas@vondra.me
In reply to: Melanie Plageman (#24)
Re: Fix overflow of nbatch

On 10/9/25 16:16, Melanie Plageman wrote:

On Wed, Oct 8, 2025 at 4:46 PM Tomas Vondra <tomas@vondra.me> wrote:

On 10/8/25 19:37, Melanie Plageman wrote:

Yes, this was wrong. I forgot an extra / sizeof(HashJoinTuple). I meant:

hash_table_bytes / sizeof(HashJoinTuple) > MaxAllocSize /
sizeof(HashJoinTuple) / 2

But the hash table is not allocated as a single chunk of memory, so I
think MaxAllocSize would be the wrong thing to use here, no? Hash table
is a collection of tuples (allocated one by one), that's why
get_hash_memory_limit() uses SIZE_MAX. MaxAllocSize matters for
nbuckets, because that indeed is allocated as a contiguous array, ofc.

Also, why bother with /sizeof(HashJoinTuple) on both sides? Without it
we get

hash_table_bytes > MaxAllocSize / 2

but again, that doesn't make much sense - the hash table can be larger,
it's not a single palloc.

It came from the earlier clamping of nbuckets:

max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
dbuckets = Min(dbuckets, max_pointers);

I don't really get why this divides hash_table_bytes by
sizeof(HashJoinTuple) since, as you say, hash_table_bytes is supposed
to accommodate both the bucket_bytes and inner_rel_bytes.

I think that's simply to allocate enough buckets for the the expected
number of tuples, not more. And cap it to MaxAllocSize. Some of this may
be redundant, I suppose.

And I think it should be just

if (hash_table_bytes > SIZE_MAX / 2)
break;

as a protection against hash_table_bytes overflowing SIZE_MAX (which on
64-bits seems implausible, but could happen on 32-bit builds).

That's roughly the check I ended up with -- except I used
space_allowed because it will be larger than hash_table_bytes if there
is a skew hashtable. I've updated the comment and such in attached v3.

Ah, thanks. I was just hacking on this too, but I'll switch to your v3.

regards

--
Tomas Vondra

#26Melanie Plageman
melanieplageman@gmail.com
In reply to: Tomas Vondra (#23)
Re: Fix overflow of nbatch

On Wed, Oct 8, 2025 at 5:08 PM Tomas Vondra <tomas@vondra.me> wrote:

On 10/8/25 21:16, Melanie Plageman wrote:

On Wed, Oct 8, 2025 at 1:37 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I have updated my patch to fix the mistakes above. I also noticed then
that I wasn't doubling space_allowed in the loop but instead setting
it to hash_table_bytes at the end. This doesn't produce a power of 2
because we subtract skew_mcvs from the hash_table_bytes. So, we have
to keep using space_allowed if we want a power of 2 in the end.

I've changed my patch to do this, but this made me wonder if we want
to be doing this or instead take hash_table_bytes at the end and round
it up to a power of 2 and set space_allowed to that. If the skew
hashtable is large, we may be allocating way more space_allowed than
we need for new hash_table_bytes + skew hashtable buckets.

I don't think there's any promise hash_table_bytes being a power of 2.
You can make hash_table_bytes an almost arbitrary value by setting
work_mem and hash_mem_multiplier. Or am I missing something?

Right. Your example used powers of 2 for work_mem, which is why I saw
power of 2 output for space_allowed, but that is arbitrary.

But you're right hash_table_bytes and space_allowed may not be equal if
useskew=true. So setting space_allowed to hash_table_bytes at the end
does not seem right. I think we don't actually need hash_table_bytes at
this point, we can just ignore it, and use/double *space_allowed.

I kept using hash_table_bytes mostly because it didn't require the
pointer dereferencing, but I failed to consider the useskew=true thing.

We could make some local variable of space_allowed if we want.

However, this means there's probably a bug - the loop should probably
double num_skew_mcvs too. We simply reserve SKEW_HASH_MEM_PERCENT of
space_allowed for skew hashtable, so should we adjust it the same way?

We do need to recalculate num_skew_mcvs once at the end before returning.

But I don't think we need to do it on each iteration of the loop. We
want the total size of what is in memory -- the regular and skew
hashtable -- to be the condition we break on. I think that's
space_allowed.

Oh wait, that doesn't make sense because each batch could have a skew hashtable.

Not sure I understand. Is this the same issue I just described?

Yea, what I meant is that we are increasing the size of the hashtable
and decreasing the number of batches on the assumption that we might
end up with tuples from multiple batches consolidated down to one
batch and one larger hashtable. Each of those batches could have skew
buckets so we now end up needing a larger skew hashtable too, not just
a larger regular hashtable.

- Melanie

#27Tomas Vondra
tomas@vondra.me
In reply to: Tomas Vondra (#25)
2 attachment(s)
Re: Fix overflow of nbatch

On 10/9/25 16:30, Tomas Vondra wrote:

On 10/9/25 16:16, Melanie Plageman wrote:

On Wed, Oct 8, 2025 at 4:46 PM Tomas Vondra <tomas@vondra.me> wrote:

On 10/8/25 19:37, Melanie Plageman wrote:

Yes, this was wrong. I forgot an extra / sizeof(HashJoinTuple). I meant:

hash_table_bytes / sizeof(HashJoinTuple) > MaxAllocSize /
sizeof(HashJoinTuple) / 2

But the hash table is not allocated as a single chunk of memory, so I
think MaxAllocSize would be the wrong thing to use here, no? Hash table
is a collection of tuples (allocated one by one), that's why
get_hash_memory_limit() uses SIZE_MAX. MaxAllocSize matters for
nbuckets, because that indeed is allocated as a contiguous array, ofc.

Also, why bother with /sizeof(HashJoinTuple) on both sides? Without it
we get

hash_table_bytes > MaxAllocSize / 2

but again, that doesn't make much sense - the hash table can be larger,
it's not a single palloc.

It came from the earlier clamping of nbuckets:

max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
dbuckets = Min(dbuckets, max_pointers);

I don't really get why this divides hash_table_bytes by
sizeof(HashJoinTuple) since, as you say, hash_table_bytes is supposed
to accommodate both the bucket_bytes and inner_rel_bytes.

I think that's simply to allocate enough buckets for the the expected
number of tuples, not more. And cap it to MaxAllocSize. Some of this may
be redundant, I suppose.

And I think it should be just

if (hash_table_bytes > SIZE_MAX / 2)
break;

as a protection against hash_table_bytes overflowing SIZE_MAX (which on
64-bits seems implausible, but could happen on 32-bit builds).

That's roughly the check I ended up with -- except I used
space_allowed because it will be larger than hash_table_bytes if there
is a skew hashtable. I've updated the comment and such in attached v3.

Ah, thanks. I was just hacking on this too, but I'll switch to your v3.

Here's a v4, which is your v3 with a couple minor adjustments:

1) A couple comments adjusted. It feels a bit too audacious to correct
comments written by native speaker, but it seems cleaner to me like this.

2) I think the (nbatch < hash_table_bytes / BLCKSZ) condition really
needs to check space_allowed, because that's the whole space, before
subtracting the skew buckets. And we also check the overflow for
space_allowed earlier, not hash_table_bytes.

3) removes the hash_table_bytes doubling, because it's not needed by the
loop anymore (or after that)

4) added the doubling of num_skew_mcvs, and also the overflow protection
for that

You suggested this in another message:

We do need to recalculate num_skew_mcvs once at the end before
returning.

But I think the doubling has the same effect, right? I don't want to
redo the whole "if (useskew) { ... }" block at the end.

I wonder if maybe it'd be better to just merge all the overflow checks
into a single "if (a || b || c)" check. These separate checks seem quite
verbose. I'll see tomorrow.

I've done a bit more testing today. I went a bit overboard and created a
table with 20B rows, which actually pushes nbatch high enough to hit the
initial issue (overflow in the size calculation). And it works sanely
with the v4 patch (and v3 too).

I guess I could have tweaked the table stats, or just manipulate the
values at the beginning of the function. But that wouldn't be so fun.

regards

--
Tomas Vondra

Attachments:

v4-0002-adjustments.patchtext/x-patch; charset=UTF-8; name=v4-0002-adjustments.patchDownload
From d3c67d91f68cadec7c395da933d6e56ba600b94d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Fri, 10 Oct 2025 00:38:05 +0200
Subject: [PATCH v4 2/2] adjustments

---
 src/backend/executor/nodeHash.c | 43 ++++++++++++++++++++++-----------
 1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 0f815674c9f..264c5b7ed40 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -883,37 +883,51 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	 *
 	 * Note that we can only change nbuckets during initial hashtable sizing.
 	 * Once we start building the hash, nbuckets is fixed.
+	 *
+	 * We will double a number of parameters (space_allowed, nbuckets and
+	 * num_skew_mcvs), which brings the overflow risk. We simply stop the loop
+	 * if that happens. We could do something smarter (e.g. cap nbuckets and
+	 * continue), but the complexity is not worth it. It's extremely rare and
+	 * this is best-effort attempt to reduce memory usage.
 	 */
 	while (nbatch > 1)
 	{
 		/*
-		 * Ensure that nbuckets can be allocated. nbuckets may have been
-		 * increased above hash_table_size / sizeof(HashJoinTuple) if work_mem
-		 * was very small, so checking if MaxAllocSize is enough for the
-		 * buckets is the safest check.
+		 * Check that nbuckets wont't overflow MaxAllocSize.
+		 *
+		 * With extremely low work_mem values, nbuckets may have been set
+		 * higher than hash_table_size / sizeof(HashJoinTuple). We don't try
+		 * to correct that here, we accept nbuckets to be oversized.
 		 */
 		if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
 			break;
 
 		/*
-		 * We check that space_allowed wouldn't overflow because it will be
-		 * larger than hash_table_bytes if there is a skew hashtable.
+		 * Check that space_allowed won't overflow.
+		 *
+		 * We don't check hash_table_bytes, because we subtracted space for
+		 * skew hashtable from it (so it may not be equal to space_allowed).
 		 */
 		if ((*space_allowed) > SIZE_MAX / 2)
 			break;
 
 		/*
-		 * This is the same as:
+		 * Check that num_skew_mcvs won't overflow.
+		 */
+		if ((*num_skew_mcvs) > INT_MAX / 2)
+			break;
+
+		/*
+		 * Will halving the number of batches and doubling the size of the
+		 * hashtable reduce overall memory usage?
 		 *
-		 * hash_table_bytes + 2 * nbatch * BLCKSZ < hash_table_bytes * 2 +
-		 * nbatch * BLCKSZ
+		 * This is the same as (S = space_allowed):
 		 *
-		 * avoiding intermediate overflow.
+		 * (S + 2 * nbatch * BLCKSZ) < (S * 2 + nbatch * BLCKSZ)
 		 *
-		 * which is to say: will halving the number of batches and doubling
-		 * the size of the hashtable reduce overall memory usage?
+		 * but avoiding intermediate overflow.
 		 */
-		if (nbatch < hash_table_bytes / BLCKSZ)
+		if (nbatch < *space_allowed / BLCKSZ)
 			break;
 
 		/*
@@ -921,7 +935,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		 * overflowing nbuckets.
 		 */
 		nbuckets *= 2;
-		hash_table_bytes *= 2;
+
+		*num_skew_mcvs = (*num_skew_mcvs) * 2;
 		*space_allowed = (*space_allowed) * 2;
 
 		nbatch /= 2;
-- 
2.51.0

v4-0001-Fix-overflow-risk-in-hashtable-size-calculation.patchtext/x-patch; charset=UTF-8; name=v4-0001-Fix-overflow-risk-in-hashtable-size-calculation.patchDownload
From aee3da1fe907ba837a45825a158ab2e2507d3ed2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 7 Oct 2025 13:21:14 -0400
Subject: [PATCH v4 1/2] Fix overflow risk in hashtable size calculation

ci-os-only:
---
 src/backend/executor/nodeHash.c | 115 +++++++++++++++-----------------
 1 file changed, 55 insertions(+), 60 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index a3415db4e20..0f815674c9f 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -850,90 +850,85 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/*
 	 * Optimize the total amount of memory consumed by the hash node.
 	 *
-	 * The nbatch calculation above focuses on the size of the in-memory hash
-	 * table, assuming no per-batch overhead. Now adjust the number of batches
-	 * and the size of the hash table to minimize total memory consumed by the
-	 * hash node.
-	 *
-	 * Each batch file has a BLCKSZ buffer, and we may need two files per
-	 * batch (inner and outer side). So with enough batches this can be
-	 * significantly more memory than the hashtable itself.
+	 * The nbatch calculation above focuses on the the in-memory hash table,
+	 * assuming no per-batch overhead. But each batch may have two files, each
+	 * with a BLCKSZ buffer. With enough batches these buffers may use
+	 * significantly more memory than the hash table.
 	 *
 	 * The total memory usage may be expressed by this formula:
 	 *
-	 * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ) <= hash_table_bytes
+	 * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ)
 	 *
 	 * where (inner_rel_bytes / nbatch) is the size of the in-memory hash
 	 * table and (2 * nbatch * BLCKSZ) is the amount of memory used by file
-	 * buffers. But for sufficiently large values of inner_rel_bytes value
-	 * there may not be a nbatch value that would make both parts fit into
-	 * hash_table_bytes.
-	 *
-	 * In this case we can't enforce the memory limit - we're going to exceed
-	 * it. We can however minimize the impact and use as little memory as
-	 * possible. (We haven't really enforced it before either, as we simply
-	 * ignored the batch files.)
-	 *
-	 * The formula for total memory usage says that given an inner relation of
-	 * size inner_rel_bytes, we may divide it into an arbitrary number of
-	 * batches. This determines both the size of the in-memory hash table and
-	 * the amount of memory needed for batch files. These two terms work in
-	 * opposite ways - when one decreases, the other increases.
+	 * buffers.
 	 *
-	 * For low nbatch values, the hash table takes most of the memory, but at
-	 * some point the batch files start to dominate. If you combine these two
-	 * terms, the memory consumption (for a fixed size of the inner relation)
-	 * has a u-shape, with a minimum at some nbatch value.
+	 * For very large inner_rel_bytes, there may be no nbatch that keeps total
+	 * memory usage under the budget (work_mem * hash_mem_multiplier). In that
+	 * case, we choose nbatch to minimize total memory consumption across both
+	 * the hashtable and file buffers.
 	 *
-	 * Our goal is to find this nbatch value, minimizing the memory usage. We
-	 * calculate the memory usage with half the batches (i.e. nbatch/2), and
-	 * if it's lower than the current memory usage we know it's better to use
-	 * fewer batches. We repeat this until reducing the number of batches does
-	 * not reduce the memory usage - we found the optimum. We know the optimum
-	 * exists, thanks to the u-shape.
+	 * As we increase the size of the hashtable and decrease the number of
+	 * batches, the total memory usage follows a U-shaped curve. We find the
+	 * minimum nbatch by "walking back" -- checking if halving nbatch would
+	 * lower total memory usage. We stop when it no longer helps.
 	 *
-	 * We only want to do this when exceeding the memory limit, not every
-	 * time. The goal is not to minimize memory usage in every case, but to
-	 * minimize the memory usage when we can't stay within the memory limit.
+	 * We only want to do this when we are already over the memory budget. We
+	 * don't explore increasing nbatch since that could force batching when it
+	 * is not needed.
 	 *
-	 * For this reason we only consider reducing the number of batches. We
-	 * could try the opposite direction too, but that would save memory only
-	 * when most of the memory is used by the hash table. And the hash table
-	 * was used for the initial sizing, so we shouldn't be exceeding the
-	 * memory limit too much. We might save memory by using more batches, but
-	 * it would result in spilling more batch files, which does not seem like
-	 * a great trade off.
+	 * While growing the hashtable, we also adjust the number of buckets to
+	 * maintain a load factor of NTUP_PER_BUCKET while squeezing tuples back
+	 * from batches into the hashtable.
 	 *
-	 * While growing the hashtable, we also adjust the number of buckets, to
-	 * not have more than one tuple per bucket (load factor 1). We can only do
-	 * this during the initial sizing - once we start building the hash,
-	 * nbucket is fixed.
+	 * Note that we can only change nbuckets during initial hashtable sizing.
+	 * Once we start building the hash, nbuckets is fixed.
 	 */
-	while (nbatch > 0)
+	while (nbatch > 1)
 	{
-		/* how much memory are we using with current nbatch value */
-		size_t		current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
+		/*
+		 * Ensure that nbuckets can be allocated. nbuckets may have been
+		 * increased above hash_table_size / sizeof(HashJoinTuple) if work_mem
+		 * was very small, so checking if MaxAllocSize is enough for the
+		 * buckets is the safest check.
+		 */
+		if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
+			break;
 
-		/* how much memory would we use with half the batches */
-		size_t		new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
+		/*
+		 * We check that space_allowed wouldn't overflow because it will be
+		 * larger than hash_table_bytes if there is a skew hashtable.
+		 */
+		if ((*space_allowed) > SIZE_MAX / 2)
+			break;
 
-		/* If the memory usage would not decrease, we found the optimum. */
-		if (current_space < new_space)
+		/*
+		 * This is the same as:
+		 *
+		 * hash_table_bytes + 2 * nbatch * BLCKSZ < hash_table_bytes * 2 +
+		 * nbatch * BLCKSZ
+		 *
+		 * avoiding intermediate overflow.
+		 *
+		 * which is to say: will halving the number of batches and doubling
+		 * the size of the hashtable reduce overall memory usage?
+		 */
+		if (nbatch < hash_table_bytes / BLCKSZ)
 			break;
 
 		/*
-		 * It's better to use half the batches, so do that and adjust the
-		 * nbucket in the opposite direction, and double the allowance.
+		 * MaxAllocSize is sufficiently small that we are not worried about
+		 * overflowing nbuckets.
 		 */
-		nbatch /= 2;
 		nbuckets *= 2;
-
+		hash_table_bytes *= 2;
 		*space_allowed = (*space_allowed) * 2;
+
+		nbatch /= 2;
 	}
 
 	Assert(nbuckets > 0);
 	Assert(nbatch > 0);
-
 	*numbuckets = nbuckets;
 	*numbatches = nbatch;
 }
@@ -994,7 +989,7 @@ ExecHashIncreaseBatchSize(HashJoinTable hashtable)
 	 * How much additional memory would doubling nbatch use? Each batch may
 	 * require two buffered files (inner/outer), with a BLCKSZ buffer.
 	 */
-	size_t		batchSpace = (hashtable->nbatch * 2 * BLCKSZ);
+	size_t		batchSpace = (hashtable->nbatch * 2 * (size_t) BLCKSZ);
 
 	/*
 	 * Compare the new space needed for doubling nbatch and for enlarging the
-- 
2.51.0

#28Melanie Plageman
melanieplageman@gmail.com
In reply to: Tomas Vondra (#27)
3 attachment(s)
Re: Fix overflow of nbatch

On Thu, Oct 9, 2025 at 7:36 PM Tomas Vondra <tomas@vondra.me> wrote:

1) A couple comments adjusted. It feels a bit too audacious to correct
comments written by native speaker, but it seems cleaner to me like this.

I attached a patch with a few more suggested adjustments (0003). The
more substantive tweaks are:

I don't really like that this comment says it is about nbuckets
overflowing MaxAllocSize because overflow means something specific and
this sounds like we are saying the nbuckets variable will overflow
MaxAllocSize but what we mean is that nbuckets worth of HashJoinTuples
could overflow MaxAllocSize. You don't have to use my wording, but I'm
not sure about this wording either.

/* Check that nbuckets wont't overflow MaxAllocSize */
if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
break;

Also, upon re-reading the comment we wrote together:

* With extremely low work_mem values, nbuckets may have been set
* higher than hash_table_bytes / sizeof(HashJoinTuple). We don't try
* to correct that here, we accept nbuckets to be oversized.

I'm not so sure it belongs above the nbuckets allocation check.
Perhaps we should move it to where we are doubling nbuckets. Or even
above where we clamp it to 1024.

I'm actually wondering if we want this comment at all. Does it
emphasize an edge case to a confusing point? I'm imagining myself
studying it in the future and having no idea what it means.

I've kept it in but moved it in 0003.

4) added the doubling of num_skew_mcvs, and also the overflow protection
for that

Do we really need to check if num_skew_mcvs will overflow? Shouldn't
it always be smaller than nbuckets? Maybe it can be an assert.

You suggested this in another message:

We do need to recalculate num_skew_mcvs once at the end before
returning.

But I think the doubling has the same effect, right? I don't want to
redo the whole "if (useskew) { ... }" block at the end.

Yea, it would have to be some kind of helper or something. I worried
just doubling num_skew_mcvs would drift significantly because of
integer truncation -- perhaps even a meaningful amount. But it was
just an intuition -- I didn't plug in any numbers and try.

- Melanie

Attachments:

v5-0001-Fix-overflow-risk-in-hashtable-size-calculation.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Fix-overflow-risk-in-hashtable-size-calculation.patchDownload
From 516de50b0fe2075955dab2168a7e46f0f9222278 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 7 Oct 2025 13:21:14 -0400
Subject: [PATCH v5 1/3] Fix overflow risk in hashtable size calculation

ci-os-only:
---
 src/backend/executor/nodeHash.c | 115 +++++++++++++++-----------------
 1 file changed, 55 insertions(+), 60 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index a3415db4e20..0f815674c9f 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -850,90 +850,85 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/*
 	 * Optimize the total amount of memory consumed by the hash node.
 	 *
-	 * The nbatch calculation above focuses on the size of the in-memory hash
-	 * table, assuming no per-batch overhead. Now adjust the number of batches
-	 * and the size of the hash table to minimize total memory consumed by the
-	 * hash node.
-	 *
-	 * Each batch file has a BLCKSZ buffer, and we may need two files per
-	 * batch (inner and outer side). So with enough batches this can be
-	 * significantly more memory than the hashtable itself.
+	 * The nbatch calculation above focuses on the the in-memory hash table,
+	 * assuming no per-batch overhead. But each batch may have two files, each
+	 * with a BLCKSZ buffer. With enough batches these buffers may use
+	 * significantly more memory than the hash table.
 	 *
 	 * The total memory usage may be expressed by this formula:
 	 *
-	 * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ) <= hash_table_bytes
+	 * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ)
 	 *
 	 * where (inner_rel_bytes / nbatch) is the size of the in-memory hash
 	 * table and (2 * nbatch * BLCKSZ) is the amount of memory used by file
-	 * buffers. But for sufficiently large values of inner_rel_bytes value
-	 * there may not be a nbatch value that would make both parts fit into
-	 * hash_table_bytes.
-	 *
-	 * In this case we can't enforce the memory limit - we're going to exceed
-	 * it. We can however minimize the impact and use as little memory as
-	 * possible. (We haven't really enforced it before either, as we simply
-	 * ignored the batch files.)
-	 *
-	 * The formula for total memory usage says that given an inner relation of
-	 * size inner_rel_bytes, we may divide it into an arbitrary number of
-	 * batches. This determines both the size of the in-memory hash table and
-	 * the amount of memory needed for batch files. These two terms work in
-	 * opposite ways - when one decreases, the other increases.
+	 * buffers.
 	 *
-	 * For low nbatch values, the hash table takes most of the memory, but at
-	 * some point the batch files start to dominate. If you combine these two
-	 * terms, the memory consumption (for a fixed size of the inner relation)
-	 * has a u-shape, with a minimum at some nbatch value.
+	 * For very large inner_rel_bytes, there may be no nbatch that keeps total
+	 * memory usage under the budget (work_mem * hash_mem_multiplier). In that
+	 * case, we choose nbatch to minimize total memory consumption across both
+	 * the hashtable and file buffers.
 	 *
-	 * Our goal is to find this nbatch value, minimizing the memory usage. We
-	 * calculate the memory usage with half the batches (i.e. nbatch/2), and
-	 * if it's lower than the current memory usage we know it's better to use
-	 * fewer batches. We repeat this until reducing the number of batches does
-	 * not reduce the memory usage - we found the optimum. We know the optimum
-	 * exists, thanks to the u-shape.
+	 * As we increase the size of the hashtable and decrease the number of
+	 * batches, the total memory usage follows a U-shaped curve. We find the
+	 * minimum nbatch by "walking back" -- checking if halving nbatch would
+	 * lower total memory usage. We stop when it no longer helps.
 	 *
-	 * We only want to do this when exceeding the memory limit, not every
-	 * time. The goal is not to minimize memory usage in every case, but to
-	 * minimize the memory usage when we can't stay within the memory limit.
+	 * We only want to do this when we are already over the memory budget. We
+	 * don't explore increasing nbatch since that could force batching when it
+	 * is not needed.
 	 *
-	 * For this reason we only consider reducing the number of batches. We
-	 * could try the opposite direction too, but that would save memory only
-	 * when most of the memory is used by the hash table. And the hash table
-	 * was used for the initial sizing, so we shouldn't be exceeding the
-	 * memory limit too much. We might save memory by using more batches, but
-	 * it would result in spilling more batch files, which does not seem like
-	 * a great trade off.
+	 * While growing the hashtable, we also adjust the number of buckets to
+	 * maintain a load factor of NTUP_PER_BUCKET while squeezing tuples back
+	 * from batches into the hashtable.
 	 *
-	 * While growing the hashtable, we also adjust the number of buckets, to
-	 * not have more than one tuple per bucket (load factor 1). We can only do
-	 * this during the initial sizing - once we start building the hash,
-	 * nbucket is fixed.
+	 * Note that we can only change nbuckets during initial hashtable sizing.
+	 * Once we start building the hash, nbuckets is fixed.
 	 */
-	while (nbatch > 0)
+	while (nbatch > 1)
 	{
-		/* how much memory are we using with current nbatch value */
-		size_t		current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
+		/*
+		 * Ensure that nbuckets can be allocated. nbuckets may have been
+		 * increased above hash_table_size / sizeof(HashJoinTuple) if work_mem
+		 * was very small, so checking if MaxAllocSize is enough for the
+		 * buckets is the safest check.
+		 */
+		if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
+			break;
 
-		/* how much memory would we use with half the batches */
-		size_t		new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
+		/*
+		 * We check that space_allowed wouldn't overflow because it will be
+		 * larger than hash_table_bytes if there is a skew hashtable.
+		 */
+		if ((*space_allowed) > SIZE_MAX / 2)
+			break;
 
-		/* If the memory usage would not decrease, we found the optimum. */
-		if (current_space < new_space)
+		/*
+		 * This is the same as:
+		 *
+		 * hash_table_bytes + 2 * nbatch * BLCKSZ < hash_table_bytes * 2 +
+		 * nbatch * BLCKSZ
+		 *
+		 * avoiding intermediate overflow.
+		 *
+		 * which is to say: will halving the number of batches and doubling
+		 * the size of the hashtable reduce overall memory usage?
+		 */
+		if (nbatch < hash_table_bytes / BLCKSZ)
 			break;
 
 		/*
-		 * It's better to use half the batches, so do that and adjust the
-		 * nbucket in the opposite direction, and double the allowance.
+		 * MaxAllocSize is sufficiently small that we are not worried about
+		 * overflowing nbuckets.
 		 */
-		nbatch /= 2;
 		nbuckets *= 2;
-
+		hash_table_bytes *= 2;
 		*space_allowed = (*space_allowed) * 2;
+
+		nbatch /= 2;
 	}
 
 	Assert(nbuckets > 0);
 	Assert(nbatch > 0);
-
 	*numbuckets = nbuckets;
 	*numbatches = nbatch;
 }
@@ -994,7 +989,7 @@ ExecHashIncreaseBatchSize(HashJoinTable hashtable)
 	 * How much additional memory would doubling nbatch use? Each batch may
 	 * require two buffered files (inner/outer), with a BLCKSZ buffer.
 	 */
-	size_t		batchSpace = (hashtable->nbatch * 2 * BLCKSZ);
+	size_t		batchSpace = (hashtable->nbatch * 2 * (size_t) BLCKSZ);
 
 	/*
 	 * Compare the new space needed for doubling nbatch and for enlarging the
-- 
2.43.0

v5-0002-adjustments.patchtext/x-patch; charset=US-ASCII; name=v5-0002-adjustments.patchDownload
From 27544e1504a0f014165e1120e5230e22a424b8b4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Fri, 10 Oct 2025 00:38:05 +0200
Subject: [PATCH v5 2/3] adjustments

---
 src/backend/executor/nodeHash.c | 43 ++++++++++++++++++++++-----------
 1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 0f815674c9f..264c5b7ed40 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -883,37 +883,51 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	 *
 	 * Note that we can only change nbuckets during initial hashtable sizing.
 	 * Once we start building the hash, nbuckets is fixed.
+	 *
+	 * We will double a number of parameters (space_allowed, nbuckets and
+	 * num_skew_mcvs), which brings the overflow risk. We simply stop the loop
+	 * if that happens. We could do something smarter (e.g. cap nbuckets and
+	 * continue), but the complexity is not worth it. It's extremely rare and
+	 * this is best-effort attempt to reduce memory usage.
 	 */
 	while (nbatch > 1)
 	{
 		/*
-		 * Ensure that nbuckets can be allocated. nbuckets may have been
-		 * increased above hash_table_size / sizeof(HashJoinTuple) if work_mem
-		 * was very small, so checking if MaxAllocSize is enough for the
-		 * buckets is the safest check.
+		 * Check that nbuckets wont't overflow MaxAllocSize.
+		 *
+		 * With extremely low work_mem values, nbuckets may have been set
+		 * higher than hash_table_size / sizeof(HashJoinTuple). We don't try
+		 * to correct that here, we accept nbuckets to be oversized.
 		 */
 		if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
 			break;
 
 		/*
-		 * We check that space_allowed wouldn't overflow because it will be
-		 * larger than hash_table_bytes if there is a skew hashtable.
+		 * Check that space_allowed won't overflow.
+		 *
+		 * We don't check hash_table_bytes, because we subtracted space for
+		 * skew hashtable from it (so it may not be equal to space_allowed).
 		 */
 		if ((*space_allowed) > SIZE_MAX / 2)
 			break;
 
 		/*
-		 * This is the same as:
+		 * Check that num_skew_mcvs won't overflow.
+		 */
+		if ((*num_skew_mcvs) > INT_MAX / 2)
+			break;
+
+		/*
+		 * Will halving the number of batches and doubling the size of the
+		 * hashtable reduce overall memory usage?
 		 *
-		 * hash_table_bytes + 2 * nbatch * BLCKSZ < hash_table_bytes * 2 +
-		 * nbatch * BLCKSZ
+		 * This is the same as (S = space_allowed):
 		 *
-		 * avoiding intermediate overflow.
+		 * (S + 2 * nbatch * BLCKSZ) < (S * 2 + nbatch * BLCKSZ)
 		 *
-		 * which is to say: will halving the number of batches and doubling
-		 * the size of the hashtable reduce overall memory usage?
+		 * but avoiding intermediate overflow.
 		 */
-		if (nbatch < hash_table_bytes / BLCKSZ)
+		if (nbatch < *space_allowed / BLCKSZ)
 			break;
 
 		/*
@@ -921,7 +935,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		 * overflowing nbuckets.
 		 */
 		nbuckets *= 2;
-		hash_table_bytes *= 2;
+
+		*num_skew_mcvs = (*num_skew_mcvs) * 2;
 		*space_allowed = (*space_allowed) * 2;
 
 		nbatch /= 2;
-- 
2.43.0

v5-0003-more-tweaks.patchtext/x-patch; charset=US-ASCII; name=v5-0003-more-tweaks.patchDownload
From 882905cabe42d061aad6b895602710beb5d81297 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 13 Oct 2025 12:03:54 -0400
Subject: [PATCH v5 3/3] more tweaks

---
 src/backend/executor/nodeHash.c | 39 ++++++++++++++-------------------
 1 file changed, 17 insertions(+), 22 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 264c5b7ed40..f562077df5e 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -884,37 +884,27 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	 * Note that we can only change nbuckets during initial hashtable sizing.
 	 * Once we start building the hash, nbuckets is fixed.
 	 *
-	 * We will double a number of parameters (space_allowed, nbuckets and
-	 * num_skew_mcvs), which brings the overflow risk. We simply stop the loop
-	 * if that happens. We could do something smarter (e.g. cap nbuckets and
-	 * continue), but the complexity is not worth it. It's extremely rare and
-	 * this is best-effort attempt to reduce memory usage.
+	 * We double several parameters (space_allowed, nbuckets, and
+	 * num_skew_mcvs), which introduces a risk of overflow. We avoid this by
+	 * exiting the loop. We could do something smarter (e.g. capping nbuckets
+	 * and continue), but the complexity is not worth it. Such cases are
+	 * extremely rare, and this is a best-effort attempt to reduce memory
+	 * usage.
 	 */
 	while (nbatch > 1)
 	{
-		/*
-		 * Check that nbuckets wont't overflow MaxAllocSize.
-		 *
-		 * With extremely low work_mem values, nbuckets may have been set
-		 * higher than hash_table_size / sizeof(HashJoinTuple). We don't try
-		 * to correct that here, we accept nbuckets to be oversized.
-		 */
+		/* Check that nbuckets wont't overflow MaxAllocSize */
 		if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
 			break;
 
-		/*
-		 * Check that space_allowed won't overflow.
-		 *
-		 * We don't check hash_table_bytes, because we subtracted space for
-		 * skew hashtable from it (so it may not be equal to space_allowed).
-		 */
-		if ((*space_allowed) > SIZE_MAX / 2)
-			break;
+		/* num_skew_mcvs should be less than nbuckets */
+		Assert((*num_skew_mcvs) < INT_MAX / 2);
 
 		/*
-		 * Check that num_skew_mcvs won't overflow.
+		 * space_allowed will exceed hash_table_bytes when there is a
+		 * skew_hashtable. Just exit if doubling it will overflow.
 		 */
-		if ((*num_skew_mcvs) > INT_MAX / 2)
+		if ((*space_allowed) > SIZE_MAX / 2)
 			break;
 
 		/*
@@ -933,6 +923,10 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		/*
 		 * MaxAllocSize is sufficiently small that we are not worried about
 		 * overflowing nbuckets.
+		 *
+		 * With extremely low work_mem values, nbuckets may have been set
+		 * higher than hash_table_bytes / sizeof(HashJoinTuple). We don't try
+		 * to correct that here.
 		 */
 		nbuckets *= 2;
 
@@ -944,6 +938,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 
 	Assert(nbuckets > 0);
 	Assert(nbatch > 0);
+
 	*numbuckets = nbuckets;
 	*numbatches = nbatch;
 }
-- 
2.43.0

#29Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#28)
Re: Fix overflow of nbatch

On Mon, Oct 13, 2025 at 12:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Thu, Oct 9, 2025 at 7:36 PM Tomas Vondra <tomas@vondra.me> wrote:

1) A couple comments adjusted. It feels a bit too audacious to correct
comments written by native speaker, but it seems cleaner to me like this.

I attached a patch with a few more suggested adjustments (0003)

Oh and I didn't add this but, all of the other pointer dereferences
have unnecessary (in terms of operator precedence) parentheses around
them, so, for consistency, I would put them around this (or remove
them everywhere since they are not needed)

if (nbatch < *space_allowed / BLCKSZ)
break;

- Melanie

#30Tomas Vondra
tomas@vondra.me
In reply to: Melanie Plageman (#28)
1 attachment(s)
Re: Fix overflow of nbatch

On 10/13/25 18:05, Melanie Plageman wrote:

On Thu, Oct 9, 2025 at 7:36 PM Tomas Vondra <tomas@vondra.me> wrote:

1) A couple comments adjusted. It feels a bit too audacious to correct
comments written by native speaker, but it seems cleaner to me like this.

I attached a patch with a few more suggested adjustments (0003). The
more substantive tweaks are:

I don't really like that this comment says it is about nbuckets
overflowing MaxAllocSize because overflow means something specific and
this sounds like we are saying the nbuckets variable will overflow
MaxAllocSize but what we mean is that nbuckets worth of HashJoinTuples
could overflow MaxAllocSize. You don't have to use my wording, but I'm
not sure about this wording either.

/* Check that nbuckets wont't overflow MaxAllocSize */
if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
break;

I think the wording is fine, it just needs to talk about "buckets" and
not "nbuckets". I did that in the attached v6.

Also, upon re-reading the comment we wrote together:

* With extremely low work_mem values, nbuckets may have been set
* higher than hash_table_bytes / sizeof(HashJoinTuple). We don't try
* to correct that here, we accept nbuckets to be oversized.

I'm not so sure it belongs above the nbuckets allocation check.
Perhaps we should move it to where we are doubling nbuckets. Or even
above where we clamp it to 1024.

Yeah, I'm not happy with the place either. But mentioning this above the
clamp to 1024 doesn't seem great either.

I'm actually wondering if we want this comment at all. Does it
emphasize an edge case to a confusing point? I'm imagining myself
studying it in the future and having no idea what it means.

You may be right. I thought someone might read the code in the future,
possibly while investigating a case when the loop stopped too early. And
will be puzzled, not realizing nbuckets might be too high. But frankly,
that's super unlikely. It only applies to cases with extremely low
work_mem values, that's quite unlikely on machines doing massive joins.

I'll think about it in the morning, but I'll probably remove it.

I've kept it in but moved it in 0003.

4) added the doubling of num_skew_mcvs, and also the overflow protection
for that

Do we really need to check if num_skew_mcvs will overflow? Shouldn't
it always be smaller than nbuckets? Maybe it can be an assert.

Good point. I wasn't sure if that's guaranteed, but after looking at the
skew_mcvs calculation again I think you're right. So +1 to assert.

You suggested this in another message:

We do need to recalculate num_skew_mcvs once at the end before
returning.

But I think the doubling has the same effect, right? I don't want to
redo the whole "if (useskew) { ... }" block at the end.

Yea, it would have to be some kind of helper or something. I worried
just doubling num_skew_mcvs would drift significantly because of
integer truncation -- perhaps even a meaningful amount. But it was
just an intuition -- I didn't plug in any numbers and try.

I think the integer truncation should not matter. AFAIK it could be off
by 1 on the first loop (due to rounding), and then the error gets
doubled on every loop. So with 8 loops we might be off by 127, right?
But with the 2% SKEW_HASH_MEM_PERCENT that difference is negligible
compared to the actual skew_mcvs value, I think.

All these formulas are rough guesses based on arbitrary constants (like
SKEW_HASH_MEM_PERCENT) anyway.

I'll give this a bit more testing and review tomorrow, and then I'll
push. I don't want to hold this back through pgconf.eu.

regards

--
Tomas Vondra

Attachments:

v6-0001-Fix-overflow-risk-in-hashtable-size-calculation.patchtext/x-patch; charset=UTF-8; name=v6-0001-Fix-overflow-risk-in-hashtable-size-calculation.patchDownload
From 5b0eabb40f12435d08efa2634925530ef0a8ba88 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 7 Oct 2025 13:21:14 -0400
Subject: [PATCH v6] Fix overflow risk in hashtable size calculation

ci-os-only:
---
 src/backend/executor/nodeHash.c | 125 +++++++++++++++++---------------
 1 file changed, 66 insertions(+), 59 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index a3415db4e20..3fe1fa4831c 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -850,85 +850,92 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/*
 	 * Optimize the total amount of memory consumed by the hash node.
 	 *
-	 * The nbatch calculation above focuses on the size of the in-memory hash
-	 * table, assuming no per-batch overhead. Now adjust the number of batches
-	 * and the size of the hash table to minimize total memory consumed by the
-	 * hash node.
-	 *
-	 * Each batch file has a BLCKSZ buffer, and we may need two files per
-	 * batch (inner and outer side). So with enough batches this can be
-	 * significantly more memory than the hashtable itself.
+	 * The nbatch calculation above focuses on the the in-memory hash table,
+	 * assuming no per-batch overhead. But each batch may have two files, each
+	 * with a BLCKSZ buffer. With enough batches these buffers may use
+	 * significantly more memory than the hash table.
 	 *
 	 * The total memory usage may be expressed by this formula:
 	 *
-	 * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ) <= hash_table_bytes
+	 * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ)
 	 *
 	 * where (inner_rel_bytes / nbatch) is the size of the in-memory hash
 	 * table and (2 * nbatch * BLCKSZ) is the amount of memory used by file
-	 * buffers. But for sufficiently large values of inner_rel_bytes value
-	 * there may not be a nbatch value that would make both parts fit into
-	 * hash_table_bytes.
-	 *
-	 * In this case we can't enforce the memory limit - we're going to exceed
-	 * it. We can however minimize the impact and use as little memory as
-	 * possible. (We haven't really enforced it before either, as we simply
-	 * ignored the batch files.)
+	 * buffers.
 	 *
-	 * The formula for total memory usage says that given an inner relation of
-	 * size inner_rel_bytes, we may divide it into an arbitrary number of
-	 * batches. This determines both the size of the in-memory hash table and
-	 * the amount of memory needed for batch files. These two terms work in
-	 * opposite ways - when one decreases, the other increases.
+	 * For very large inner_rel_bytes, there may be no nbatch that keeps total
+	 * memory usage under the budget (work_mem * hash_mem_multiplier). In that
+	 * case, we choose nbatch to minimize total memory consumption across both
+	 * the hashtable and file buffers.
 	 *
-	 * For low nbatch values, the hash table takes most of the memory, but at
-	 * some point the batch files start to dominate. If you combine these two
-	 * terms, the memory consumption (for a fixed size of the inner relation)
-	 * has a u-shape, with a minimum at some nbatch value.
+	 * As we increase the size of the hashtable and decrease the number of
+	 * batches, the total memory usage follows a U-shaped curve. We find the
+	 * minimum nbatch by "walking back" -- checking if halving nbatch would
+	 * lower total memory usage. We stop when it no longer helps.
 	 *
-	 * Our goal is to find this nbatch value, minimizing the memory usage. We
-	 * calculate the memory usage with half the batches (i.e. nbatch/2), and
-	 * if it's lower than the current memory usage we know it's better to use
-	 * fewer batches. We repeat this until reducing the number of batches does
-	 * not reduce the memory usage - we found the optimum. We know the optimum
-	 * exists, thanks to the u-shape.
+	 * We only want to do this when we are already over the memory budget. We
+	 * don't explore increasing nbatch since that could force batching when it
+	 * is not needed.
 	 *
-	 * We only want to do this when exceeding the memory limit, not every
-	 * time. The goal is not to minimize memory usage in every case, but to
-	 * minimize the memory usage when we can't stay within the memory limit.
+	 * While growing the hashtable, we also adjust the number of buckets to
+	 * maintain a load factor of NTUP_PER_BUCKET while squeezing tuples back
+	 * from batches into the hashtable.
 	 *
-	 * For this reason we only consider reducing the number of batches. We
-	 * could try the opposite direction too, but that would save memory only
-	 * when most of the memory is used by the hash table. And the hash table
-	 * was used for the initial sizing, so we shouldn't be exceeding the
-	 * memory limit too much. We might save memory by using more batches, but
-	 * it would result in spilling more batch files, which does not seem like
-	 * a great trade off.
+	 * Note that we can only change nbuckets during initial hashtable sizing.
+	 * Once we start building the hash, nbuckets is fixed.
 	 *
-	 * While growing the hashtable, we also adjust the number of buckets, to
-	 * not have more than one tuple per bucket (load factor 1). We can only do
-	 * this during the initial sizing - once we start building the hash,
-	 * nbucket is fixed.
-	 */
-	while (nbatch > 0)
+	 * We double several parameters (space_allowed, nbuckets, and
+	 * num_skew_mcvs), which introduces a risk of overflow. We avoid this by
+	 * exiting the loop. We could do something smarter (e.g. capping nbuckets
+	 * and continue), but the complexity is not worth it. Such cases are
+	 * extremely rare, and this is a best-effort attempt to reduce memory
+	 * usage.
+	 */
+	while (nbatch > 1)
 	{
-		/* how much memory are we using with current nbatch value */
-		size_t		current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
+		/* Check that buckets wont't overflow MaxAllocSize */
+		if (nbuckets > (MaxAllocSize / sizeof(HashJoinTuple) / 2))
+			break;
+
+		/* num_skew_mcvs should be less than nbuckets */
+		Assert((*num_skew_mcvs) < (INT_MAX / 2));
 
-		/* how much memory would we use with half the batches */
-		size_t		new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
+		/*
+		 * Check that space_allowed won't overlow SIZE_MAX.
+		 *
+		 * We don't use hash_table_bytes here, because it does not include the
+		 * skew buckets. And we want to limit the overall memory limit.
+		 */
+		if ((*space_allowed) > (SIZE_MAX / 2))
+			break;
 
-		/* If the memory usage would not decrease, we found the optimum. */
-		if (current_space < new_space)
+		/*
+		 * Will halving the number of batches and doubling the size of the
+		 * hashtable reduce overall memory usage?
+		 *
+		 * This is the same as (S = space_allowed):
+		 *
+		 * (S + 2 * nbatch * BLCKSZ) < (S * 2 + nbatch * BLCKSZ)
+		 *
+		 * but avoiding intermediate overflow.
+		 */
+		if (nbatch < (*space_allowed) / BLCKSZ)
 			break;
 
 		/*
-		 * It's better to use half the batches, so do that and adjust the
-		 * nbucket in the opposite direction, and double the allowance.
+		 * MaxAllocSize is sufficiently small that we are not worried about
+		 * overflowing nbuckets.
+		 *
+		 * With extremely low work_mem values, nbuckets may have been set to
+		 * 1024, and end up oversized here. We We don't try to correct that,
+		 * to keep the code simple. It may terminate the loop earlier.
 		 */
-		nbatch /= 2;
 		nbuckets *= 2;
 
+		*num_skew_mcvs = (*num_skew_mcvs) * 2;
 		*space_allowed = (*space_allowed) * 2;
+
+		nbatch /= 2;
 	}
 
 	Assert(nbuckets > 0);
@@ -994,14 +1001,14 @@ ExecHashIncreaseBatchSize(HashJoinTable hashtable)
 	 * How much additional memory would doubling nbatch use? Each batch may
 	 * require two buffered files (inner/outer), with a BLCKSZ buffer.
 	 */
-	size_t		batchSpace = (hashtable->nbatch * 2 * BLCKSZ);
+	size_t		batchSpace = (hashtable->nbatch * 2 * (size_t) BLCKSZ);
 
 	/*
 	 * Compare the new space needed for doubling nbatch and for enlarging the
 	 * in-memory hash table. If doubling the hash table needs less memory,
 	 * just do that. Otherwise, continue with doubling the nbatch.
 	 *
-	 * We're either doubling spaceAllowed of batchSpace, so which of those
+	 * We're either doubling spaceAllowed or batchSpace, so which of those
 	 * increases the memory usage the least is the same as comparing the
 	 * values directly.
 	 */
-- 
2.51.0

#31Tomas Vondra
tomas@vondra.me
In reply to: Tomas Vondra (#30)
Re: Fix overflow of nbatch

On 10/14/25 23:13, Tomas Vondra wrote:

...

I'll give this a bit more testing and review tomorrow, and then I'll
push. I don't want to hold this back through pgconf.eu.

Pushed and backpatched, after some minor tweaks. Thanks for the reviews
and feedback. I consider this is fixed now.

One remaining tweak I've been experimenting with (for master) is fixing
the weird behavior I described at the end of [1]/messages/by-id/244dc6c1-3b3d-4de2-b3de-b1511e6a6d10@vondra.me. The root cause is that
we cap nbuckets by max_pointers, which is the max number of pointers we
can fit into work_mem. The consequence is that increasing work_mem also
increases nbatch too, which is counter-intuitive. It's a bit strange, as
it caps the number of batch pointers, while it ignores the buffers that
are ~1000x larger.

I experimented with capping the nbatch by how many pointers fit into
MaxAllocSize (and INT_MAX).

Min(MaxAllocSize / sizeof(void *), INT_MAX / 2 + 1);

But I think this does not really matter much in practice. First, this
only happens with low work_mem values, while systems doing large joins
tend to have work_mem increased. Second, this means the nbatch is set
too low, and it'll get "fixed" by the memory balaning at runtime.

So I thinks we don't need to do anything about this.

[1]: /messages/by-id/244dc6c1-3b3d-4de2-b3de-b1511e6a6d10@vondra.me
/messages/by-id/244dc6c1-3b3d-4de2-b3de-b1511e6a6d10@vondra.me

--
Tomas Vondra