Fix overflow of nbatch
Hi Everyone,
With a1b4f28, to compute current_space, nbatch is being multiplied
by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
easily overflow the int limit.To keep the calculation safe for
current_space, convert nbatch to size_t.
Please find a patch for the same.
Thanks,
Vaibhav
Attachments:
0001-Fix-overflow-of-nbatch.patchapplication/octet-stream; name=0001-Fix-overflow-of-nbatch.patchDownload+1-1
On Tue, 23 Sept 2025 at 01:57, Vaibhav Jain <jainva@google.com> wrote:
With a1b4f28, to compute current_space, nbatch is being multiplied
by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
easily overflow the int limit.To keep the calculation safe for
current_space, convert nbatch to size_t.
Thanks for finding and reporting this.
I think a1b4f289b mistakenly thought that there'd be size_t arithmetic
in the following two lines because the final result is a size_t:
size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
I'd rather see this fixed by adding a cast, i.e.: "(size_t) nbatch" on
the above two lines. All that code is new in a1b4f289b, and if you
change the nbatch to a size_t it affects the code that existed before
a1b4f289b, and from looking over that code, it seems to ensure that
nbatch does not overflow int by doing "dbatch = Min(dbatch,
max_pointers);", where max_pointers is constrained by MaxAllocSize /
sizeof(HashJoinTuple).
Also, making nbatches size_t makes the final line of " *numbatches =
nbatch;" somewhat questionable since numbatches is an int pointer.
David
On 9/22/25 22:45, David Rowley wrote:
On Tue, 23 Sept 2025 at 01:57, Vaibhav Jain <jainva@google.com> wrote:
With a1b4f28, to compute current_space, nbatch is being multiplied
by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
easily overflow the int limit.To keep the calculation safe for
current_space, convert nbatch to size_t.Thanks for finding and reporting this.
I think a1b4f289b mistakenly thought that there'd be size_t arithmetic
in the following two lines because the final result is a size_t:size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
Yeah, I failed to notice this part of the formula can overflow.
I'd rather see this fixed by adding a cast, i.e.: "(size_t) nbatch" on
the above two lines. All that code is new in a1b4f289b, and if you
change the nbatch to a size_t it affects the code that existed before
a1b4f289b, and from looking over that code, it seems to ensure that
nbatch does not overflow int by doing "dbatch = Min(dbatch,
max_pointers);", where max_pointers is constrained by MaxAllocSize /
sizeof(HashJoinTuple).Also, making nbatches size_t makes the final line of " *numbatches =
nbatch;" somewhat questionable since numbatches is an int pointer.
It seems cleaner to add a the cast to the formula, if only because of
the assignment to numbatches.
thanks
--
Tomas Vondra
On Sep 22, 2025, at 21:20, Vaibhav Jain <jainva@google.com> wrote:
Hi Everyone,
With a1b4f28, to compute current_space, nbatch is being multiplied
by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
easily overflow the int limit.To keep the calculation safe for
current_space, convert nbatch to size_t.Please find a patch for the same.
Thanks,
Vaibhav
<0001-Fix-overflow-of-nbatch.patch>
I guess that because earlier in the function, nbatch is always clamped with:
nbatch = pg_nextpower2_32(Max(2, minbatch));
So, in practice, nbatch won’t grow to very big. But yes, if nbatch reaches to, say 1 million, it will overflow.
A simple program proves that changing nbatch to size_t will prevent from overflowing:
```
#include <stdio.h>
int main(){
size_t nbatch = 1000000; // 1 million
int BLCKSZ = 8192;
size_t result = 2 * nbatch * BLCKSZ;
printf("%zu\n", result); // will output 16384000000
return 0;
}
```
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Tue, 23 Sept 2025 at 11:21, Chao Li <li.evan.chao@gmail.com> wrote:
I guess that because earlier in the function, nbatch is always clamped with:
nbatch = pg_nextpower2_32(Max(2, minbatch));
I don't follow which part of that line could be constituted as
clamping. Maybe you've confused Max with Min?
David
On Tue, 23 Sept 2025 at 09:31, Tomas Vondra <tomas@vondra.me> wrote:
On 9/22/25 22:45, David Rowley wrote:
I think a1b4f289b mistakenly thought that there'd be size_t arithmetic
in the following two lines because the final result is a size_t:size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);Yeah, I failed to notice this part of the formula can overflow.
Ok cool. We're just in the freeze for 18.0 at the moment. Once that's
over, should I take care of this, or do you want to?
David
On 9/23/25 02:02, David Rowley wrote:
On Tue, 23 Sept 2025 at 09:31, Tomas Vondra <tomas@vondra.me> wrote:
On 9/22/25 22:45, David Rowley wrote:
I think a1b4f289b mistakenly thought that there'd be size_t arithmetic
in the following two lines because the final result is a size_t:size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);Yeah, I failed to notice this part of the formula can overflow.
Ok cool. We're just in the freeze for 18.0 at the moment. Once that's
over, should I take care of this, or do you want to?
Feel free to fix, but I can take care of it once 18 is out the door.
It's my bug, after all.
BTW ExecHashIncreaseBatchSize needs the same fix, I think.
I wonder how likely the overflow is. AFAICS we'd need nbatch=256k (with
8KB blocks), which is a lot. But with the balancing logic, it'd also
mean each batch is about ~2GB. So the whole "hash table" would be about
500GB. Possible, but unlikely.
However, looking at the code now, I think the code should adjust the
hash_table_bytes value, not just space_allowed. It's meant to be the
same thing. Will check tomorrow.
thanks
--
Tomas Vondra
On Tue, 23 Sept 2025 at 13:01, Tomas Vondra <tomas@vondra.me> wrote:
On 9/23/25 02:02, David Rowley wrote:
Ok cool. We're just in the freeze for 18.0 at the moment. Once that's
over, should I take care of this, or do you want to?Feel free to fix, but I can take care of it once 18 is out the door.
It's my bug, after all.BTW ExecHashIncreaseBatchSize needs the same fix, I think.
I think it's probably best you handle this. I didn't notice that one.
You know this area much better than I do.
I wonder how likely the overflow is. AFAICS we'd need nbatch=256k (with
8KB blocks), which is a lot. But with the balancing logic, it'd also
mean each batch is about ~2GB. So the whole "hash table" would be about
500GB. Possible, but unlikely.
I think no matter how low the chances of overflow are, the code isn't
written the way it was intended to be, so it should just be put the
way it was intended to be without question of the likelihood of
overflow. Otherwise, we'll just end up with harder to hit bugs which
could take much longer to [re]discover. Also, in these terms, what's
unlikely today may not be in the future.
David
On Sep 23, 2025, at 07:35, David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 23 Sept 2025 at 11:21, Chao Li <li.evan.chao@gmail.com> wrote:
I guess that because earlier in the function, nbatch is always clamped with:
nbatch = pg_nextpower2_32(Max(2, minbatch));
I don't follow which part of that line could be constituted as
clamping. Maybe you've confused Max with Min?David
Sorry for the misleading. I actually meant “minbatch”.
I remember I ever traced the function several times. First, with a normal (not much data involved) query,
if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
{
Is hard to meet, then nbatch will be just 1.
With big data involved, it will enter the “if” clause, but minbatch is also hard to go very high.
To clarify, I just created a test:
```
evantest=# SET enable_nestloop = off;
SET
evantest=# SET enable_mergejoin = off;
SET
evantest=# SET enable_hashjoin = on;
SET
evantest=# CREATE TEMP TABLE inner_tbl AS
evantest-# SELECT g AS id, repeat('x', 2000) AS filler
evantest-# FROM generate_series(1, 200000) g;
SELECT 200000
evantest=# CREATE TEMP TABLE outer_tbl AS
evantest-# SELECT g AS id FROM generate_series(1, 1000000) g;
SELECT 1000000
evantest=#
evantest=#
evantest=# EXPLAIN ANALYZE
evantest-# SELECT *
evantest-# FROM outer_tbl o
evantest-# JOIN inner_tbl i
evantest-# ON o.id = i.id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=34647.00..1270978635.52 rows=36306020352 width=40) (actual time=353.908..1355.735 rows=200000.00 loops=1)
Hash Cond: (i.id = o.id)
Buffers: local read=54528 dirtied=54425 written=54418, temp read=45853 written=45853
-> Seq Scan on inner_tbl i (cost=0.00..113608.96 rows=6356096 width=36) (actual time=1.132..460.711 rows=200000.00 loops=1)
Buffers: local read=50048 dirtied=50000 written=49993
-> Hash (cost=15904.00..15904.00 rows=1142400 width=4) (actual time=351.280..351.282 rows=1000000.00 loops=1)
Buckets: 262144 Batches: 8 Memory Usage: 6446kB
Buffers: local read=4480 dirtied=4425 written=4425, temp written=2560
-> Seq Scan on outer_tbl o (cost=0.00..15904.00 rows=1142400 width=4) (actual time=0.760..162.229 rows=1000000.00 loops=1)
Buffers: local read=4480 dirtied=4425 written=4425
Planning:
Buffers: shared hit=14
Planning Time: 389649.420 ms
Execution Time: 1362.392 ms
(14 rows)
```
In this test, minbatch is just 64.
But I agree, I did never test with large amount of data. I don’t actually know how much data can make nbatch to reach to ~130K (the value will lead to overflow if nbatch is of int type).
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Thank you everyone for the reviews, feedback and the test.
Please find attached a new version of the patch.
On Tue, Sep 23, 2025 at 7:11 AM Chao Li <li.evan.chao@gmail.com> wrote:
Show quoted text
On Sep 23, 2025, at 07:35, David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 23 Sept 2025 at 11:21, Chao Li <li.evan.chao@gmail.com> wrote:
I guess that because earlier in the function, nbatch is always clamped
with:nbatch = pg_nextpower2_32(Max(2, minbatch));
I don't follow which part of that line could be constituted as
clamping. Maybe you've confused Max with Min?David
Sorry for the misleading. I actually meant “minbatch”.
I remember I ever traced the function several times. First, with a normal
(not much data involved) query,if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
{Is hard to meet, then nbatch will be just 1.
With big data involved, it will enter the “if” clause, but minbatch is
also hard to go very high.To clarify, I just created a test:
```
evantest=# SET enable_nestloop = off;
SET
evantest=# SET enable_mergejoin = off;
SET
evantest=# SET enable_hashjoin = on;
SET
evantest=# CREATE TEMP TABLE inner_tbl AS
evantest-# SELECT g AS id, repeat('x', 2000) AS filler
evantest-# FROM generate_series(1, 200000) g;
SELECT 200000
evantest=# CREATE TEMP TABLE outer_tbl AS
evantest-# SELECT g AS id FROM generate_series(1, 1000000) g;
SELECT 1000000
evantest=#
evantest=#
evantest=# EXPLAIN ANALYZE
evantest-# SELECT *
evantest-# FROM outer_tbl o
evantest-# JOIN inner_tbl i
evantest-# ON o.id = i.id;
QUERY PLAN--------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=34647.00..1270978635.52 rows=36306020352 width=40)
(actual time=353.908..1355.735 rows=200000.00 loops=1)
Hash Cond: (i.id = o.id)
Buffers: local read=54528 dirtied=54425 written=54418, temp read=45853
written=45853
-> Seq Scan on inner_tbl i (cost=0.00..113608.96 rows=6356096
width=36) (actual time=1.132..460.711 rows=200000.00 loops=1)
Buffers: local read=50048 dirtied=50000 written=49993
-> Hash (cost=15904.00..15904.00 rows=1142400 width=4) (actual
time=351.280..351.282 rows=1000000.00 loops=1)
Buckets: 262144 Batches: 8 Memory Usage: 6446kB
Buffers: local read=4480 dirtied=4425 written=4425, temp
written=2560
-> Seq Scan on outer_tbl o (cost=0.00..15904.00 rows=1142400
width=4) (actual time=0.760..162.229 rows=1000000.00 loops=1)
Buffers: local read=4480 dirtied=4425 written=4425
Planning:
Buffers: shared hit=14
Planning Time: 389649.420 ms
Execution Time: 1362.392 ms
(14 rows)
```In this test, minbatch is just 64.
But I agree, I did never test with large amount of data. I don’t actually
know how much data can make nbatch to reach to ~130K (the value will lead
to overflow if nbatch is of int type).Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Attachments:
0002-Fix-overflow-of-nbatch.patchapplication/octet-stream; name=0002-Fix-overflow-of-nbatch.patchDownload+3-3
On 9/23/25 03:20, David Rowley wrote:
On Tue, 23 Sept 2025 at 13:01, Tomas Vondra <tomas@vondra.me> wrote:
On 9/23/25 02:02, David Rowley wrote:
Ok cool. We're just in the freeze for 18.0 at the moment. Once that's
over, should I take care of this, or do you want to?Feel free to fix, but I can take care of it once 18 is out the door.
It's my bug, after all.BTW ExecHashIncreaseBatchSize needs the same fix, I think.
I think it's probably best you handle this. I didn't notice that one.
You know this area much better than I do.
OK, will do.
I wonder how likely the overflow is. AFAICS we'd need nbatch=256k (with
8KB blocks), which is a lot. But with the balancing logic, it'd also
mean each batch is about ~2GB. So the whole "hash table" would be about
500GB. Possible, but unlikely.I think no matter how low the chances of overflow are, the code isn't
written the way it was intended to be, so it should just be put the
way it was intended to be without question of the likelihood of
overflow. Otherwise, we'll just end up with harder to hit bugs which
could take much longer to [re]discover. Also, in these terms, what's
unlikely today may not be in the future.
I wasn't disputing the validity of the bug. I was just thinking alund
about how likely it's to hit.
regards
--
Tomas Vondra
Hi,
I kept looking at this, and unfortunately the it seems a bit worse :-(
Fixing the overflow is easy enough - adding the cast does the trick.
But then there's the second issue I mentioned - the loop does not adjust
the hash_table_bytes. It updates the *space_allowed, but that's not what
the current_space/new_space formulas use.
This breaks the "balancing" as the nbatch gets decreased until
nbatch <= (work_mem / BLCKSZ)
while the hash table "space_allowed" is increasing. This may result in
an *increased* memory usage :-(
I also noticed the code does not clamp nbuckets properly as it should.
The question what to do about this. If we got this a week ago, I'd just
probably just revert a1b4f289, and then try again for PG19. After all,
the issue it meant to address is somewhat rare.
But with 18 already stamped ...
I've shared these findings with the rest of the RMT, I'll see what their
thoughts are. Of course, other opinions/suggestions are welcome.
regards
--
Tomas Vondra
Tomas Vondra <tomas@vondra.me> writes:
The question what to do about this. If we got this a week ago, I'd just
probably just revert a1b4f289, and then try again for PG19. After all,
the issue it meant to address is somewhat rare.
But with 18 already stamped ...
18.0 is what it is. If this is the most serious bug in it, I'll be
a bit surprised. Take your time, fix it properly.
regards, tom lane
On Tue, Sep 23, 2025 at 11:34:56AM -0400, Tom Lane wrote:
18.0 is what it is. If this is the most serious bug in it, I'll be
a bit surprised. Take your time, fix it properly.
+1
--
nathan
On 23/09/2025 6:11 PM, Tomas Vondra wrote:
Hi,
I kept looking at this, and unfortunately the it seems a bit worse :-(
Fixing the overflow is easy enough - adding the cast does the trick.
But then there's the second issue I mentioned - the loop does not adjust
the hash_table_bytes. It updates the *space_allowed, but that's not what
the current_space/new_space formulas use.This breaks the "balancing" as the nbatch gets decreased until
nbatch <= (work_mem / BLCKSZ)
while the hash table "space_allowed" is increasing. This may result in
an *increased* memory usage :-(I also noticed the code does not clamp nbuckets properly as it should.
The question what to do about this. If we got this a week ago, I'd just
probably just revert a1b4f289, and then try again for PG19. After all,
the issue it meant to address is somewhat rare.But with 18 already stamped ...
I've shared these findings with the rest of the RMT, I'll see what their
thoughts are. Of course, other opinions/suggestions are welcome.regards
Hi Tomas,
If you are going to investigate this problem more, can you also look at
the related problem:
/messages/by-id/52b94d5b-a135-489d-9833-2991a69ec623@garret.ru
I proposed the patch but got no feedback.
Best regards,
Konstantin
On 9/23/25 21:13, Konstantin Knizhnik wrote:
On 23/09/2025 6:11 PM, Tomas Vondra wrote:
Hi,
I kept looking at this, and unfortunately the it seems a bit worse :-(
Fixing the overflow is easy enough - adding the cast does the trick.
But then there's the second issue I mentioned - the loop does not adjust
the hash_table_bytes. It updates the *space_allowed, but that's not what
the current_space/new_space formulas use.This breaks the "balancing" as the nbatch gets decreased until
nbatch <= (work_mem / BLCKSZ)
while the hash table "space_allowed" is increasing. This may result in
an *increased* memory usage :-(I also noticed the code does not clamp nbuckets properly as it should.
The question what to do about this. If we got this a week ago, I'd just
probably just revert a1b4f289, and then try again for PG19. After all,
the issue it meant to address is somewhat rare.But with 18 already stamped ...
I've shared these findings with the rest of the RMT, I'll see what their
thoughts are. Of course, other opinions/suggestions are welcome.regards
Hi Tomas,
If you are going to investigate this problem more, can you also look at
the related problem:/messages/by-id/52b94d5b-
a135-489d-9833-2991a69ec623%40garret.ru#ebe4151f1d505bbcc32cf93b2e8a1936I proposed the patch but got no feedback.
Thanks, I'll take a look. I wasn't aware of that thread (or rather I
didn't realize it might have been related to this). And AFAICS the
max_batches business is separate.
But the last bit seems very relevant:
- while (nbatch > 0)
+ while (nbatch > 0 &&
+ nbuckets * 2 <= max_pointers) /* prevent allocation limit
overflow */
I believe this is the missing "nbucket claiming" that I mentioned above.
But I think it needs to work a bit differently. The max_pointers is
calculated from work_mem, but the whole point here is to grow the hash
table beyond that. I think it makes sense to relax that limit, and allow
up to MaxAllocSize, or something like that.
regards
--
Tomas Vondra
Hi,
Here's a couple draft patches fixing the bug:
- 0001 adds the missing size_t cast, to fix the overflow
- 0002 fixes the balancing, by adjusting the hash table size limit
- 0003 adds the missing overflow protection for nbuckets and the hash
table limit
- 0004 rewords the comment explaining how the balancing works. Reading
it after a couple months, I found it overly verbose / not very clear.
I'm sure it could be improved even further.
0001 and 0002 are pretty trivial, 0003 is a bit bigger, but most of the
code is simply how we clamp nbuckets elsewhere (more or less).
At some point I started wondering if this would be simpler if it'd have
been better to use the algerbraic solution posted by James Hunter back
in February [1]/messages/by-id/CAJVSvF6290rJF2MtgSx_SuT9Kn2amZ_+zecoZYMU+dn3BVVaZg@mail.gmail.com. It'd not need the loop, but it'd still need all this
new overflow protection etc.
I wanted to make sure the patches actually make it work correctly, so I
created a table with 4B rows:
create table t (a bigint, b text);
insert into t select i, md5(i::text)
from generate_series(1,4000000000) s(i);
and I added this log message at the end of ExecChooseHashTableSize:
elog(WARNING, "wm %d nbatch %d nbucket %d space %ld total %ld",
work_mem, nbatch, nbuckets, (*space_allowed)/1024,
(*space_allowed + 2 * nbatch * (Size) BLCKSZ)/1024);
and I ran an explain on a self-join
set enable_mergejoin = off;
set max_parallel_workers_per_gather = 0;
set work_mem = '...';
explain select * from t t1 join t t2 on (t1.a = t2.a);
with work_mem set to values between 64kB and 1GB.
On 18.0 I got this:
wm 64 nbatch 8 nbucket 2097152 hash 131072 total 131200
wm 128 nbatch 16 nbucket 4194304 hash 262144 total 262400
wm 256 nbatch 32 nbucket 8388608 hash 524288 total 524800
wm 512 nbatch 64 nbucket 16777216 hash 1048576 total 1049600
wm 1024 nbatch 128 nbucket 33554432 hash 2097152 total 2099200
wm 2048 nbatch 256 nbucket 33554432 hash 2097152 total 2101248
wm 4096 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
wm 8192 nbatch 1024 nbucket 8388608 hash 524288 total 540672
wm 16384 nbatch 2048 nbucket 4194304 hash 262144 total 294912
wm 32768 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248
I wanted to know how serious the issue is, compared to what would happen
without the balancing. I disabled the balancing (by skipping the loop),
and then I get this:
wm 64 nbatch 8192 nbucket 2048 hash 128 total 131200
wm 128 nbatch 16384 nbucket 4096 hash 256 total 262400
wm 256 nbatch 32768 nbucket 8192 hash 512 total 524800
wm 512 nbatch 65536 nbucket 16384 hash 1024 total 1049600
wm 1024 nbatch 131072 nbucket 32768 hash 2048 total 2099200
wm 2048 nbatch 131072 nbucket 65536 hash 4096 total 2101248
wm 4096 nbatch 65536 nbucket 131072 hash 8192 total 1056768
wm 8192 nbatch 32768 nbucket 262144 hash 16384 total 540672
wm 16384 nbatch 16384 nbucket 524288 hash 32768 total 294912
wm 32768 nbatch 8192 nbucket 1048576 hash 65536 total 196608
wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248
The interesting bit is that the expected total memory usage (the last
number in the log line) is exactly the same as for 18.0 with and without
balancing. IIUC this is due to the "stop" condition using the initial
hash table size. It makes me a bit less worried about this triggering
OOM crashes - it does not improve the behavior, but it doesn't use more
memory than before. Still an embarrassing bug, though.
With the attached patches, this looks like this:
wm 64 nbatch 256 nbucket 65536 hash 4096 total 8192
wm 128 nbatch 512 nbucket 131072 hash 8192 total 16384
wm 256 nbatch 1024 nbucket 262144 hash 16384 total 32768
wm 512 nbatch 2048 nbucket 524288 hash 32768 total 65536
wm 1024 nbatch 4096 nbucket 1048576 hash 65536 total 131072
wm 2048 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 4096 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 8192 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 16384 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 32768 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248
So, this time it actually seems to work correctly and significantly
reduces the memory usage ...
There's one weird thing remaining - if you look at nbatch, it actually
increases for the first couple work_mem steps. That's weird, because
after increasing work_mem we should need *fewer* batches. But this has
nothing to do with the balancing, it happens even with it disabled.
The reason is that when calculating nbatch we do this:
dbatch = Min(dbatch, max_pointers);
and max_pointers is calculated from work_mem (among other things). It's
a bit funny the logica worries about how many batch pointers we have,
and refuses to allow more. But at the same time it ignores the BufFiles.
AFAICS it's harmless - we may pick low number of batches initially, but
then later we'll ramp it up (and the balancing will work too). And if
you choose to run huge hash joins with tiny work_mem, I guess you're in
for the suffering anyway. In any case, it's unrelated to balancing.
regards
[1]: /messages/by-id/CAJVSvF6290rJF2MtgSx_SuT9Kn2amZ_+zecoZYMU+dn3BVVaZg@mail.gmail.com
/messages/by-id/CAJVSvF6290rJF2MtgSx_SuT9Kn2amZ_+zecoZYMU+dn3BVVaZg@mail.gmail.com
--
Tomas Vondra
Attachments:
vfix-0004-reword-balancing-comment.patchtext/x-patch; charset=UTF-8; name=vfix-0004-reword-balancing-comment.patchDownload+21-44
vfix-0003-nbuckets-overflow-protection.patchtext/x-patch; charset=UTF-8; name=vfix-0003-nbuckets-overflow-protection.patchDownload+40-5
vfix-0002-adjust-hash_table_bytes-for-balancing.patchtext/x-patch; charset=UTF-8; name=vfix-0002-adjust-hash_table_bytes-for-balancing.patchDownload+1-1
vfix-0001-fix-size-overflow.patchtext/x-patch; charset=UTF-8; name=vfix-0001-fix-size-overflow.patchDownload+5-4
On Wed, Sep 24, 2025 at 7:02 PM Tomas Vondra <tomas@vondra.me> wrote:
Here's a couple draft patches fixing the bug:
- 0001 adds the missing size_t cast, to fix the overflow
- 0002 fixes the balancing, by adjusting the hash table size limit
- 0003 adds the missing overflow protection for nbuckets and the hash
table limit
I've attached an alternative patch idea. It's shorter because I
avoided multiplication using algebraic equivalence. There are two
overflow checks for nbuckets and that max_pointers will be able to be
allocated, but otherwise, it avoids most of the overflow risk by
switching to division.
On the topic of max_pointers, I think there is no need to handle >
MaxAllocSize. We were willing to cope with the original values of
nbatch and nbuckets, we are just trying to optimize that. If doing so
would make us run out of memory for the arrays of poitners to buckets,
just use the last hashtable size that didn't have this problem. That
means we don't have to do any shenanigans to have powers of two for
nbuckets because we don't do clamping.
Also, I think the outer loop needs the condition nbatch > 1 not nbatch
0 -- when nbatch is 1, once we divide it by 2, it would end up as 0.
I'm still waiting for the 4billion row table to be created on my
machine, so I haven't verified that I get the same results as you yet.
- 0004 rewords the comment explaining how the balancing works. Reading
it after a couple months, I found it overly verbose / not very clear.
I'm sure it could be improved even further.
My attached patch does build on your revised wording.
- Melanie
Attachments:
alt-Fix-overflow-risk-in-hashtable-size-calculation.patchtext/x-patch; charset=US-ASCII; name=alt-Fix-overflow-risk-in-hashtable-size-calculation.patchDownload+47-62
On 10/7/25 23:10, Melanie Plageman wrote:
On Wed, Sep 24, 2025 at 7:02 PM Tomas Vondra <tomas@vondra.me> wrote:
Here's a couple draft patches fixing the bug:
- 0001 adds the missing size_t cast, to fix the overflow
- 0002 fixes the balancing, by adjusting the hash table size limit
- 0003 adds the missing overflow protection for nbuckets and the hash
table limitI've attached an alternative patch idea. It's shorter because I
avoided multiplication using algebraic equivalence. There are two
overflow checks for nbuckets and that max_pointers will be able to be
allocated, but otherwise, it avoids most of the overflow risk by
switching to division.
Thanks for looking!
On the topic of max_pointers, I think there is no need to handle >
MaxAllocSize. We were willing to cope with the original values of
nbatch and nbuckets, we are just trying to optimize that. If doing so
would make us run out of memory for the arrays of poitners to buckets,
just use the last hashtable size that didn't have this problem. That
means we don't have to do any shenanigans to have powers of two for
nbuckets because we don't do clamping.
Hmm, so if I understand correctly you suggest stop when nbuckets gets
too high, while my code allowed reducing nbatches further (and just
capped nbatches). I'm fine with this change, if it makes the code
simpler, that means we allow ~130M buckets, which seems rather unlikely
to be a problem in practice. That means 1GB for buckets alone, and
tuples tend to be a multiple of that. With 4GB total, that's ~256k
batches. And multiplied by the 130M that would be 3.5e13 tuples ...
However, I don't understand why the condition bothers about INT_MAX?
/* Ensure that nbuckets * 2 doesn't overflow an int */
if (nbuckets > INT_MAX / 2)
break;
AFAIK what matters is whether the buckets fit into MaxAllocSize. But
INT_MAX (or even INT_MAX/2) would allocate way more than that. The
original code (copied from an earlier part of the function) points out
the INT_MAX check is redundant, given the current MaxAllocSize value.
It's however true that if we double nbuckets and nbatch at the same
time, we don't need to bother doing this:
max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
because we can never break. Although, is that really true? When picking
the initial nbuckets value we do this:
nbuckets = Max(nbuckets, 1024);
What if this increased the nbuckets (it'd require extremely low work_mem
of course)? Could it lead to unexpectedly high nbuckets in the loop?
Similarly, why does this check care about number of buckets?
if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2)
break;
I mean, (MaxAllocSize / sizeof(HashJoinTuple)) is the number of buckets
(hj tuple pointers) we can fit into 1GB. But hash_table_bytes is the
size of the hash table in bytes, not in number of items. Also, see how
get_hash_memory_limit caps the limit by SIZE_MAX. So I think we should
continue doing that.
I like the idea of simplifying the stop condition, instead of
calculating the old/new size, and comparing those. But is this actually
correct?
if (nbatch / 4 < hash_table_bytes / BLCKSZ)
break;
If we start with the "full" condition
hash_bytes + (2 * nbatch * BLCKSZ) < hash_bytes * 2 + (nbatch * BLCKSZ)
subtract hash_bytes from both sides
(2 * nbatch * BLCKSZ) < hash_bytes + (nbatch * BLCKSZ)
subtract (nbatch * BLCKSZ)
nbatch * BLCKSZ < hash_bytes
that gives us
nbatch < (hash_bytes / BLCKSZ)
So where did the /4 come from? Also, maybe it'd be easier to just do
(nbatch * BLCKSZ) < hash_bytes
i.e. without the division.
Also, I think the outer loop needs the condition nbatch > 1 not nbatch
0 -- when nbatch is 1, once we divide it by 2, it would end up as 0.
Good point. I was wondering why we're not seeing any failures because of
this (e.g. from tests using tiny amounts of data, with a single batch).
But we immediately stop the condition, because the two batch files use
much less than the minimum work_mem value (even without the multiplier).
So it's benign, but let's not do that anyway.
I'm still waiting for the 4billion row table to be created on my
machine, so I haven't verified that I get the same results as you yet.- 0004 rewords the comment explaining how the balancing works. Reading
it after a couple months, I found it overly verbose / not very clear.
I'm sure it could be improved even further.My attached patch does build on your revised wording.
Thanks. I'll re-read the comment tomorrow.
regards
--
Tomas Vondra
On Tue, Oct 7, 2025 at 7:51 PM Tomas Vondra <tomas@vondra.me> wrote:
On 10/7/25 23:10, Melanie Plageman wrote:
Hmm, so if I understand correctly you suggest stop when nbuckets gets
too high, while my code allowed reducing nbatches further (and just
capped nbatches). I'm fine with this change, if it makes the code
simpler, that means we allow ~130M buckets, which seems rather unlikely
to be a problem in practice. That means 1GB for buckets alone, and
tuples tend to be a multiple of that. With 4GB total, that's ~256k
batches. And multiplied by the 130M that would be 3.5e13 tuples ...However, I don't understand why the condition bothers about INT_MAX?
/* Ensure that nbuckets * 2 doesn't overflow an int */
if (nbuckets > INT_MAX / 2)
break;
You're right. This can just be
if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
break;
It's however true that if we double nbuckets and nbatch at the same
time, we don't need to bother doing this:max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
because we can never break. Although, is that really true? When picking
the initial nbuckets value we do this:nbuckets = Max(nbuckets, 1024);
What if this increased the nbuckets (it'd require extremely low work_mem
of course)? Could it lead to unexpectedly high nbuckets in the loop?
If we are worried about nbuckets exceeding what can be allocated, then
I think the proposed condition above takes care of that
if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
break;
As for whether or not bumping nbuckets to 1024 before the loop means
we can have very high values with low work_mem: it seems like even
with the minimum work_mem, the number of buckets is larger than that.
When could we hit this? Maybe if the number of skew buckets is very
large?
Similarly, why does this check care about number of buckets?
if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2)
break;I mean, (MaxAllocSize / sizeof(HashJoinTuple)) is the number of buckets
(hj tuple pointers) we can fit into 1GB. But hash_table_bytes is the
size of the hash table in bytes, not in number of items. Also, see how
get_hash_memory_limit caps the limit by SIZE_MAX. So I think we should
continue doing that.
Yes, this was wrong. I forgot an extra / sizeof(HashJoinTuple). I meant:
hash_table_bytes / sizeof(HashJoinTuple) > MaxAllocSize /
sizeof(HashJoinTuple) / 2
but I don't think we need this because nbuckets should always be
bigger than hash_table_bytes / sizeof(HashJoinTuple) since to start
out with, it was clamped to Max(1024, hash_table_bytes /
sizeof(HashJoinTuple)).
The only exception would be if MaxAllocSize / sizeof(HashJoinTuple)
was smaller than hash_table_bytes / sizeof(HashJoinTuple) in which
case, checking nbuckets > MaxAllocSize / sizeof(HashJoinTuple) should
cover that anyway.
I like the idea of simplifying the stop condition, instead of
calculating the old/new size, and comparing those. But is this actually
correct?if (nbatch / 4 < hash_table_bytes / BLCKSZ)
break;If we start with the "full" condition
hash_bytes + (2 * nbatch * BLCKSZ) < hash_bytes * 2 + (nbatch * BLCKSZ)
subtract hash_bytes from both sides
(2 * nbatch * BLCKSZ) < hash_bytes + (nbatch * BLCKSZ)
subtract (nbatch * BLCKSZ)
nbatch * BLCKSZ < hash_bytes
that gives us
nbatch < (hash_bytes / BLCKSZ)
So where did the /4 come from?
Yes, I made the mistake of doubling the batches and hash_table_bytes
again, forgetting that the original formula was already comparing the
hypothetical space to the current space.
I think it should be
if (nbatch < hash_table_bytes / BLCKSZ)
as you say
Also, maybe it'd be easier to just do
(nbatch * BLCKSZ) < hash_bytesi.e. without the division.
I prefer the division to avoid as many potentially overflow causing
operations as possible. otherwise we would have to check that nbatch *
BLCKSZ doesn't overflow first.
I'm still waiting for the 4billion row table to be created on my
machine, so I haven't verified that I get the same results as you yet.
I have updated my patch to fix the mistakes above. I also noticed then
that I wasn't doubling space_allowed in the loop but instead setting
it to hash_table_bytes at the end. This doesn't produce a power of 2
because we subtract skew_mcvs from the hash_table_bytes. So, we have
to keep using space_allowed if we want a power of 2 in the end.
I've changed my patch to do this, but this made me wonder if we want
to be doing this or instead take hash_table_bytes at the end and round
it up to a power of 2 and set space_allowed to that. If the skew
hashtable is large, we may be allocating way more space_allowed than
we need for new hash_table_bytes + skew hashtable buckets.
Oh, and I get the same logging output results as your patch with attached v2.
- Melanie