Adjusting hash join memory limit to handle batch explosion

Started by Tomas Vondraover 1 year ago28 messageshackers

tomas.vondra@2ndquadrant.com

over 1 year ago

Hi,

I've been once again reminded of the batch explosion issue in hashjoin,
due to how it enforces the memory limit. This resurfaces every now and
then, when a used gets strange OOM issues - see for example these
threads from ~2019 for an example, and even some patches: [1]/messages/by-id/20190504003414.bulcbnge3rhwhcsh@development [2]/messages/by-id/20230228190643.1e368315@karst [3]/messages/by-id/bc138e9f-c89e-9147-5395-61d51a757b3b@gusw.net

Let me restart the discussion, resubmit some of the older patches, and
present a plan for what to do about this ...

Just to remind the basic details, a brief summary - the hashjoin does
not account for the spill files when enforcing the memory limit. The
hash table gets full, it decides to double the number of batches which
cuts the hash table size in half. But with enough batches the doubling
can actually make the situation much worse - the new batches simply use
more memory than was saved.

This can happen for various reasons. A simple example is that we under
estimate the size of the input relation, so the hash needs to be built
on many more tuples. This is bad, but usually not disastrous.

It's much worse when there's a batch that is not "divisible", i.e.
adding more batches does not split it roughly in half. This can happen
due to hash collisions (in the part that determines the batch),
duplicate values that didn't make it into MCV (and thus the skew
optimization does not kick in).

This is fairly rare, but when it happens it can easily lead to batch
explosion, i.e. rapidly increasing the number of batches. We add
batches, but the batch does not split, so we promptly hit the limit
again, triggering another increase. It often stops only when we exhaust
the 32-bit hash space, ending with 100s of thousands of batches.

Attached is a SQL script that reproduces something like this. It builds
a table with values with hashes that have 0s in the upper bits. And then
the hash join just spirals into a batch explosion.

Note: The script is a bit dumb and needs a lot of temp space (~50GB)
when generating the values with duplicate hashes.

In 2019 I shared a bunch of patches [4]/messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development improving this, but then I got
distracted and the discussion stalled because there were proposals to
fix this by introducing a special hash join "mode" to address these
issues [5]/messages/by-id/CAAKRu_YsWm7gc_b2nBGWFPE6wuhdOLfc1LBZ786DUzaCPUDXCA@mail.gmail.com, but we never got past a prototype and there's a lot of
outstanding questions.

So I decided to revisit the three patches from 2019. Attached are
rebased and cleaned up versions. A couple comments on each one:

1) v20241231-adjust-limit-0001-Account-for-batch-files-in-ha.patch

I believe this is the way to go, for now. The basic idea is to keep the
overall behavior, but "relax" the memory limit as the number of batches
increases to minimize the total memory use.

This may seem a bit weird, but as the number of batches grows there's no
way to not violate the limit. And the current code simply ignores this
and allocates arbitrary amounts of memory.

2) v20241231-single-spill-0001-Hash-join-with-a-single-spill.patch

The basic idea is that we keep only a small "slice" of batches in
memory, and data for later batches are spilled into a single file. This
means that even if the number of batches increases, the memory use does
not change. Which means the memory limit is enforced very strictly.

The problem is this performs *terribly* because it shuffles data many
times, always just to the next slice. So if we happen to have 128
batches in memory and the number explodes to ~128k batches, we end up
reading/writing each tuple ~500x.

3) v20241231-multi-spill-0001-Hash-join-with-a-multiple-spil.patch

This is an improvement of the "single spill", except that we keep one
spill file per slice, which greatly reduces the amount of temporary
traffic. The trouble is this means we can no longer enforce the memory
limit that strictly, because the number of files does grow with the
number of batches, although not 1:1. But with a slice of 128 batches we
get only 1 file per 128 batches, which is a nice reduction.

This means that ultimately it's either (1) or (3), and the more I've
been looking into this the more I prefer (1), for a couple reasons:

* It's much simpler (it doesn't really change anything on the basic
behavior, doesn't introduce any new files or anything like that.

* There doesn't seem to be major difference in total memory consumption
between the two approaches. Both allow allocating more memory.

* It actually helps with the "indivisible batch" case - it relaxes the
limit, so there's a chance the batch eventually fits and we stop adding
more and more batches. With spill files that's not the case - we still
keep the original limit, and we end up with the batch explosion (but
then we handle it much more efficiently).

Unless there are some objections, my plan is to get (1) cleaned up and
try to get it in for 18, possibly in the January CF. It's not a
particularly complex patch, and it already passes check-world (it only
affected three plans in join_hash, and those make sense I think).

One thing I'm not sure about yet is whether this needs to tweak the
hashjoin costing to also consider the files when deciding how many
batches to use. Maybe it should?

regards

[1]: /messages/by-id/20190504003414.bulcbnge3rhwhcsh@development
/messages/by-id/20190504003414.bulcbnge3rhwhcsh@development

[2]: /messages/by-id/20230228190643.1e368315@karst

[3]: /messages/by-id/bc138e9f-c89e-9147-5395-61d51a757b3b@gusw.net
/messages/by-id/bc138e9f-c89e-9147-5395-61d51a757b3b@gusw.net

[4]: /messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development
/messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development

[5]: /messages/by-id/CAAKRu_YsWm7gc_b2nBGWFPE6wuhdOLfc1LBZ786DUzaCPUDXCA@mail.gmail.com
/messages/by-id/CAAKRu_YsWm7gc_b2nBGWFPE6wuhdOLfc1LBZ786DUzaCPUDXCA@mail.gmail.com

--
Tomas Vondra

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Tomas Vondra (#1)

Re: Adjusting hash join memory limit to handle batch explosion

Hi,

I kept thinking about this, thinking about alternative approaches, and
also about how hashjoin limits memory in general.

First, I want to discuss one thing I tried, but I think it does not
really work. The annoying part about the "memory rebalance" patch is
that it relaxes the memory limit. However, the memory limit is a lie,
because we enforce it by adding batches, and that unfortunately is not
free - each batch is a BufFile with BLCKSZ buffer, so while we may
succeed in keeping the hash table in work_mem, we may end up with the
batches using gigabytes of memory - which is not reported in EXPLAIN,
but it's still allocated. This does happen even without the hash
explosion, the hash explosion is just an extreme version.

There's no way to work around this ... as long as we use BufFiles. What
if we used plain File(s), without the buffering? Then the per-batch cost
would be considerably lower. Of course, this would be acceptable only if
not having the buffering has acceptable impact on performance. That
seemed unlikely, but I decided to give it a try - see a PoC of this in
the attached files-poc-patches.tgz (patch 0001).

Unfortunately, the impact does not seem acceptable - it does enforce the
limit, but the lack of buffering does make a huge difference, making it
~2x slower depending on the query.

I experimented a bit with a cross-file buffer, much smaller than the sum
of BufFile buffers, but still allowing combining writes into larger
chunks. Imagine you have a buffer large enough for (nbatch * 2) tuples,
then we may expect writing two tuples at once into each batch file.

The 0002 patch in the PoC series tries to do this, but it does not
really help, and in some cases it's doing even worse than 0001, because
the cost of maintaining the shared buffer increases with the number of
batches. I'm sure some of this is my fault, and it could be improved and
optimized quite a bit.

I decided not to do that, because this experiment made me realize that:

a) The buffer would need to grow with the number of batches, to have any
chance of combining the writes. If we want to combine K tuples into a
single write (on average), we'd need the buffer to keep (nbatches * K)
tuples, and once the nbatches gets large (because who cares about
BufFile allocations with a handful of batches), that may be a lot of
memory. Say we get to 32k batches (which is not that hard), and we want
to keep 16 tuples, each 128B, that's ~64MB. Not a huge amount, and much
less than the 512MB we'd need for batches. But still, it means we're not
really enforcing the memory limit - which was the point of using files
without buffering.

b) It does enforce the limit on the hash table itself, though. And that
is actually not great, because it means it can't possibly help with the
batch explosion, caused by a single "indivisible" batch.

c) It's pretty invasive.

So I still think adjusting the memory as we're adding batches seems like
a better approach. The question is where to do the adjustments, based on
what logic ...

I think the general idea and formula explained in [1]/messages/by-id/7bed6c08-72a0-4ab9-a79c-e01fcdd0940f@vondra.me is right, but
while working on the PoC patch I started to think about how to formalize
this. And I ended up creating two tables that I think visualize is
pretty nicely.

Imagine a table (in the spreadsheet sense), with work_mem values in rows
and nbatch values in columns. And the cell is "total memory" used to
execute such hash join, i.e.

work_mem + (2 * nbatches * 8K)

(Yes, there's a multiplier for the hash table size, but I use work_mem
for simplicity.) This is what the two attached PDF files show,
highlighting two interesting patterns, so let's talk about that.

1) hash-memory-model-1.pdf

Imagine you're executing a hash join - you're in a particular cell of
the table. And we've reached the current memory limit, i.e. we've filled
the hash table, and need to do something. The only solution is to
"expand" the expected "total hash size" (nbatch * hash_table_size),
which we do by simply doubling the number of batches. And often that's
the right thing to do.

For example, let's say we're running with work_mem=4MB and nbatch=16,
we've filled the hash table and are using 4336kB of memory (a little bit
more than work_mem). If we double the number of batches, we may use up
to 4352kB of memory in the next cycle. And that's fine.

But hey, there's another way to double the "total hash size" - allowing
the in-memory hash table to be twice as large. In the above case, that
would be wrong, because doubling work_mem would use up to 8432kB.

So in this case it's clearly right to double the number of batches,
because that minimizes the total memory used in the next step.

However, consider for example the cell with work_mem=4MB, nbatch=8192.
We're using 135MB of memory, and need to decide what to do. Doubling the
batches means we'll use up to 266MB. But doubling work_mem increases the
memory use only to 139MB.

This is what the green/red in the table means. Green means "better to
double nbatch" while red is "better to double work_mem". And clearly,
the table is split into two regions, separated by the diagonal.

The diagonal is the "optimal" path - if you start in any cell, the
red/green decisions will get you to the diagonal, and then along it.

The patch [1]/messages/by-id/7bed6c08-72a0-4ab9-a79c-e01fcdd0940f@vondra.me aims to do this, but I think this visual explanation is
much clearer than anything in that patch.

2) hash-memory-model-2.pdf

I've also asked if maybe the patch should do something about the choice
of initial nbatch value, which gets me to the second PDF.

Imagine we know the total amount of table in the Hash node is 1GB. There
are different ways to split that into batches. If we have enough memory,
we could do hash join without batching. With work_mem=1MB we'll need to
split this into 1024 batches, or we might do work_mem=2MB with only 512
batches. And so on - we're moving along the anti-diagonal.

The point is that while this changes the work_mem, this can have pretty
non-intuitive impact on total memory use. For example, with wm=1MB we
actually use 17MB of memory, while with wm=2MB we use only 10MB.

But each anti-diagonal has a minimum - the value on the diagonal. I
believe this is the "optimal starting cell" for the hash join. If we
don't pick this, the rules explained in (1) will eventually get us to
the diagonal anyway.

A different visualization is in the attached SVG, which is a surface
plot / heat map of the total memory use. It shows that there really is a
"valley" of minimal values on the diagonal, and that the growth for
doubling batches is much steeper than for doubling work_mem.

Attached is an "adjust-size" patch implementing this. In the end it has
pretty much the same effect as the patch in [1]/messages/by-id/7bed6c08-72a0-4ab9-a79c-e01fcdd0940f@vondra.me, except that it's much
simpler - everything important happens in just two simple blocks, one in
ExecChooseHashTableSize(), the other in ExecHashIncreaseNumBatches().

There's a bit of complexity, because if we allow growing the size of the
in-memory hash table, we probably need to allow increasing number of
buckets. But that's not possible with how we split the hashvalue now, so
the patch adjusts that by reversing the hashvalue bits when calculating
the batch. I'm not sure if this is the best way to do this, there might
well be a better solution.

I admit all of this seemed a bit weird / wrong initially, because it
feels like giving up the memory limit. But the simple truth is that
memory limit is pretty much just a lie - the fact that we only show the
hash table size in EXPLAIN does not mean we're not using gigabytes more
memory, we're just not making it clear. So I'd argue this actually does
a better job in limiting memory usage.

When thinking about reasons why doubling the work_mem might not be the
right thing, I can think of one case - CPU caches. IIRC it may be much
faster to do the lookups if the hash is small enough to fit into L3, and
and doubling this might work against the goal, although I'm not sure how
bad the impact may be. In the batch explosion case it surely doesn't
matter - the cost of spilling/loading many files is much higher. But for
regular (well estimated) cases it might have negative impact.

This is why the patch only adjusts the initial parameters in the "red"
area, not in the green. Maybe it should be a bit more conservative and
only kick in when nbatch value above some threshold.

I'd appreciate opinions and alternative ideas about this.

I'm also attaching the data + SQL script I use to trigger the batch
explosion with up to 2M batches.

regards

[1]: /messages/by-id/7bed6c08-72a0-4ab9-a79c-e01fcdd0940f@vondra.me
/messages/by-id/7bed6c08-72a0-4ab9-a79c-e01fcdd0940f@vondra.me

--
Tomas Vondra

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#2)

Re: Adjusting hash join memory limit to handle batch explosion

Hi Tomas,

Thanks for working on this. I haven't studied this problem recently,
but here are some ideas that occur to me:

1. Try to reduce the per-batch overhead.

2. Stop increasing the number of batches when the per-batch overhead
exceeds a small percentage of work_mem (10%? 5%? 1%?).

If you've reached a point where the per-batch overhead is using up

=10% of your work_mem, then at the next doubling it's going to be

using >=20%, which is pretty insane, and the next doubling after that
is going to be >=40%, which is really silly. For 1MB of work_mem and
what I gather from your remarks is 16kB/batch, we exceed the 10%
threshold at 16 batches. Somebody might claim that capping the number
of batches to 16 is insane, but those 16 batches are using 256kB of
memory and we're supposed to finish the entire operation using <= 1MB
of memory, it really isn't. We pretty obviously are not going to be
able to stay within 1MB no matter what we do.

I think your proposal might be a more refined version of this, where
instead of just completely ceasing to create new batches, you try to
balance creating new batches with overrunning work_mem to get the best
outcome possible overall. Maybe that's a good approach, although
perhaps it is more complicated than we need? At any rate, I found the
vadjust-size patch to be quite hard to understand. I think you if you
want to go that route it would need more comments and to have the
existing ones rewritten so that they are understandable without
needing to scour this email thread (e.g. "Try to move on the
anti-diagonal and see if we'd consume less memory" doesn't seem like
something most people are going to understand without a lot of
context).

...Robert

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Robert Haas (#3)

Re: Adjusting hash join memory limit to handle batch explosion

On 1/6/25 16:42, Robert Haas wrote:

Hi Tomas,

Thanks for working on this. I haven't studied this problem recently,
but here are some ideas that occur to me:

1. Try to reduce the per-batch overhead.

Yeah. The "use files without buffering" approach may be seen as an
extreme version of this, but it didn't perform well enough. The "shared"
buffering was an attempt to have a buffer that doesn't need to scale
linearly with the number of batches, but that has issues too (I'm sure
some of that is due to my faults in the PoC patch).

I wonder if maybe a better solution would be to allow BufFiles with
smaller buffers, not just hard-coded 8kB. OTOH I'm not sure how much
that helps, before the buffering stops being effective as the buffer
gets smaller. I mean, we only have 8kB buffer, so if we cut the buffer
in half for every nbatch doubling, we'd be down to 1B after 13 rounds
(but the buffer is useless once it gets too small to hold multiple
tuples, it's only like 5 cycles).

Maybe it'd still work well enough if we only did that for large nbatch
values, and ensured the buffer can't get too small (say, less than 1kB).
But that only gives 3 doubling cycles - i.e. instead of 8GB of memory
we'd only use 1GB. That's an improvement, but also not very different
from what the "balancing" achieves, except that it's way more invasive
and complex.

2. Stop increasing the number of batches when the per-batch overhead
exceeds a small percentage of work_mem (10%? 5%? 1%?).

If you've reached a point where the per-batch overhead is using up

=10% of your work_mem, then at the next doubling it's going to be

using >=20%, which is pretty insane, and the next doubling after that
is going to be >=40%, which is really silly. For 1MB of work_mem and
what I gather from your remarks is 16kB/batch, we exceed the 10%
threshold at 16 batches. Somebody might claim that capping the number
of batches to 16 is insane, but those 16 batches are using 256kB of
memory and we're supposed to finish the entire operation using <= 1MB
of memory, it really isn't. We pretty obviously are not going to be
able to stay within 1MB no matter what we do.

Agreed.

I think your proposal might be a more refined version of this, where
instead of just completely ceasing to create new batches, you try to
balance creating new batches with overrunning work_mem to get the best
outcome possible overall. Maybe that's a good approach, although
perhaps it is more complicated than we need? At any rate, I found the
vadjust-size patch to be quite hard to understand. I think you if you
want to go that route it would need more comments and to have the
existing ones rewritten so that they are understandable without
needing to scour this email thread (e.g. "Try to move on the
anti-diagonal and see if we'd consume less memory" doesn't seem like
something most people are going to understand without a lot of
context).

Yes, the proposal does essentially this. And you're certainly right some
of the comments are hard to understand without reading some of the
thread, so that would need to improve.

regards

--
Tomas Vondra

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#4)

Re: Adjusting hash join memory limit to handle batch explosion

On Mon, Jan 6, 2025 at 11:51 AM Tomas Vondra <tomas@vondra.me> wrote:

I wonder if maybe a better solution would be to allow BufFiles with
smaller buffers, not just hard-coded 8kB. OTOH I'm not sure how much
that helps, before the buffering stops being effective as the buffer
gets smaller. I mean, we only have 8kB buffer, so if we cut the buffer
in half for every nbatch doubling, we'd be down to 1B after 13 rounds
(but the buffer is useless once it gets too small to hold multiple
tuples, it's only like 5 cycles).

I was more thinking about whether we need to keep all of those buffers
around all the time. If the number of batches doesn't increase, then
after we finish moving things into batches they should never need to
be moved into a different batch. If it does, then things are
different, but for example if we initially plan on 64 batches and then
later decide we need 256 batches, we should really only need 3 buffers
at a time, except for the initial work during batch 0. (In this
example, a tuple that is initially assigned to batch 1 might need to
be moved to batch 65, 129, or 193, but it can't need to go anywhere
else.)

But I don't quite know how we could avoid needing all the buffers at
once during batch 0. That said, it's questionable whether it really
make sense to have an initial number of batches that is very large.
Does partitioning the input data into 64k batches really make sense,
or would it be more efficient to partition it 256 ways initially and
then do a second pass over each of those to split them up another 256
ways? It's a lot more I/O, but trying to split 64k ways at once is
presumably going to thrash the File table as well as do a lot of
completely random physical I/O, so maybe it's worth considering.

Another thought is that, if we really do want to partition 64k ways
all at once, maybe 16kb set aside for each batch is not the right
approach. 64k batches * 16kB/buffer = 1GB, but if we have 1GB of
memory available for partitioning, wouldn't it make sense to read a
gigabyte of tuples, sort them by batch #, and then open each file that
needs to get at least 1 tuple, write all the tuples into that file,
and close it? This seems more scalable than what we do today, because
it doesn't require a certain amount of memory per batch. The
performance might not be great if you're using a really small amount
of memory for a really large number of batches, but it might still be
better than the current algorithm, which could easily leave a lot of
that memory idling a lot of the time.

Said another way, I think the current algorithm is optimized for small
numbers of batches. Repeatedly filling and flushing a 16kB buffer
makes sense if the number of buffers isn't that big so that flushes
are regular and a buffer is typically going to spend a lot of its time
approximately half full. But when the number of batches becomes large,
buffers will start to be flushed less and less often, especially if
there is skew in the data but to some degree even if there isn't. Any
buffer that sits there for "a long time" -- whatever that means
exactly -- without getting flushed is not a good use of memory.

I'm just spitballing here. Don't confuse anything in this email with a
demand for you to do something different than you are.

--
Robert Haas
EDB: http://www.enterprisedb.com

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Robert Haas (#5)

Re: Adjusting hash join memory limit to handle batch explosion

On 1/6/25 19:50, Robert Haas wrote:

On Mon, Jan 6, 2025 at 11:51 AM Tomas Vondra <tomas@vondra.me> wrote:

I wonder if maybe a better solution would be to allow BufFiles with
smaller buffers, not just hard-coded 8kB. OTOH I'm not sure how much
that helps, before the buffering stops being effective as the buffer
gets smaller. I mean, we only have 8kB buffer, so if we cut the buffer
in half for every nbatch doubling, we'd be down to 1B after 13 rounds
(but the buffer is useless once it gets too small to hold multiple
tuples, it's only like 5 cycles).

I was more thinking about whether we need to keep all of those buffers
around all the time. If the number of batches doesn't increase, then
after we finish moving things into batches they should never need to
be moved into a different batch. If it does, then things are
different, but for example if we initially plan on 64 batches and then
later decide we need 256 batches, we should really only need 3 buffers
at a time, except for the initial work during batch 0. (In this
example, a tuple that is initially assigned to batch 1 might need to
be moved to batch 65, 129, or 193, but it can't need to go anywhere
else.)

Right.

But I don't quite know how we could avoid needing all the buffers at
once during batch 0. That said, it's questionable whether it really
make sense to have an initial number of batches that is very large.
Does partitioning the input data into 64k batches really make sense,
or would it be more efficient to partition it 256 ways initially and
then do a second pass over each of those to split them up another 256
ways? It's a lot more I/O, but trying to split 64k ways at once is
presumably going to thrash the File table as well as do a lot of
completely random physical I/O, so maybe it's worth considering.

True, but as soon as you limit the number of batches you could generate,
it's that more or less the same as not enforcing the limit on the amount
of memory consumed by the hash table? Because you have to keep the
tuples that belong to the current batch in memory ...

I suppose you could do this recursively, i.e. split to 256 batches, and
once you can keep the current batch in memory, spill it to disk too. And
then read it from file, and split it into 256 more batches. I think we'd
need to remember the minimum nbatch value for each batch (when it was
created), and then go through all the stages up to current nbatch. But
it could work, I guess.

The thing is - I don't think increasing the work_mem is bad - in fact,
it's exactly the thing that may stop the batch explosion when there are
hash collisions / overlaps. That's what the test script does to trigger
the explosion, although I admit it's an artificial / adversary case. But
similar stuff can happen in PROD, and we blindly increase nbatch when it
can't possibly help, stopping only after running out of hash bits.

Another thought is that, if we really do want to partition 64k ways
all at once, maybe 16kb set aside for each batch is not the right
approach. 64k batches * 16kB/buffer = 1GB, but if we have 1GB of
memory available for partitioning, wouldn't it make sense to read a
gigabyte of tuples, sort them by batch #, and then open each file that
needs to get at least 1 tuple, write all the tuples into that file,
and close it? This seems more scalable than what we do today, because
it doesn't require a certain amount of memory per batch. The
performance might not be great if you're using a really small amount
of memory for a really large number of batches, but it might still be
better than the current algorithm, which could easily leave a lot of
that memory idling a lot of the time.

This is pretty much the idea behind the 0002 patch in the "raw files"
PoC patch, although I tried to use a much smaller batch. Maybe with 1GB
(and better coding than in my PoC patch) it would work better.

Still, if we have 1GB for a buffer, maybe it'd be better to use some of
that for a larger hash table, and not need that many batches ...

Said another way, I think the current algorithm is optimized for small
numbers of batches. Repeatedly filling and flushing a 16kB buffer
makes sense if the number of buffers isn't that big so that flushes
are regular and a buffer is typically going to spend a lot of its time
approximately half full. But when the number of batches becomes large,
buffers will start to be flushed less and less often, especially if
there is skew in the data but to some degree even if there isn't. Any
buffer that sits there for "a long time" -- whatever that means
exactly -- without getting flushed is not a good use of memory.

Right. FWIW I suspect we had similar discussions for the hashagg.

I'm just spitballing here. Don't confuse anything in this email with a
demand for you to do something different than you are.

No, thanks. It's good to have these discussions and be forced to think
about it from a different angle.

regards

--
Tomas Vondra

Melanie Plageman

melanieplageman@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#1)

Re: Adjusting hash join memory limit to handle batch explosion

On Tue, Dec 31, 2024 at 6:07 PM Tomas Vondra <tomas@vondra.me> wrote:

So I decided to revisit the three patches from 2019. Attached are
rebased and cleaned up versions. A couple comments on each one:

1) v20241231-adjust-limit-0001-Account-for-batch-files-in-ha.patch

I believe this is the way to go, for now. The basic idea is to keep the
overall behavior, but "relax" the memory limit as the number of batches
increases to minimize the total memory use.

This may seem a bit weird, but as the number of batches grows there's no
way to not violate the limit. And the current code simply ignores this
and allocates arbitrary amounts of memory.

I'm just catching up on this thread and haven't read all the mails
yet. I started with looking at the patches in the first email and got
a bit confused.

In this patch (v20241231-adjust-limit-0001-Account-for-batch-files-in-ha.patch),
I see that you've started accounting for the spill files in
hashtable->spaceUsed -- in the same way that is done for the tuples in
the hashtable. I know the other memory contexts (hashCxt and batchCxt)
in hashtable aren't appropriate for figuring out spaceUsed, but I was
wondering if the hashtable->spillCxt accurately reflects how much
memory is being used for these spill files at one time? Perhaps it
doesn't make sense to use this, but when we added it in 8c4040edf45, I
thought we might one day be able to use it for determining peak space
usage. Or perhaps you are imagining it only be used for observability?

- Melanie

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Melanie Plageman (#7)

Re: Adjusting hash join memory limit to handle batch explosion

On 1/9/25 17:17, Melanie Plageman wrote:

On Tue, Dec 31, 2024 at 6:07 PM Tomas Vondra <tomas@vondra.me> wrote:

So I decided to revisit the three patches from 2019. Attached are
rebased and cleaned up versions. A couple comments on each one:

1) v20241231-adjust-limit-0001-Account-for-batch-files-in-ha.patch

I believe this is the way to go, for now. The basic idea is to keep the
overall behavior, but "relax" the memory limit as the number of batches
increases to minimize the total memory use.

This may seem a bit weird, but as the number of batches grows there's no
way to not violate the limit. And the current code simply ignores this
and allocates arbitrary amounts of memory.

I'm just catching up on this thread and haven't read all the mails
yet. I started with looking at the patches in the first email and got
a bit confused.

In this patch (v20241231-adjust-limit-0001-Account-for-batch-files-in-ha.patch),
I see that you've started accounting for the spill files in
hashtable->spaceUsed -- in the same way that is done for the tuples in
the hashtable. I know the other memory contexts (hashCxt and batchCxt)
in hashtable aren't appropriate for figuring out spaceUsed, but I was
wondering if the hashtable->spillCxt accurately reflects how much
memory is being used for these spill files at one time? Perhaps it
doesn't make sense to use this, but when we added it in 8c4040edf45, I
thought we might one day be able to use it for determining peak space
usage. Or perhaps you are imagining it only be used for observability?

Good question. Yes, the patch from 12/31 does look at all the memory,
including the batch files. I thought about using the spillCtx too, but I
don't think it it would work because the context tracks *current* memory
usage, and we're interested in how much memory would be used *after*
doubling the number of batches.

regards

--
Tomas Vondra

Melanie Plageman

melanieplageman@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#1)

Re: Adjusting hash join memory limit to handle batch explosion

On Tue, Dec 31, 2024 at 6:07 PM Tomas Vondra <tomas@vondra.me> wrote:

This means that ultimately it's either (1) or (3), and the more I've
been looking into this the more I prefer (1), for a couple reasons:

* It's much simpler (it doesn't really change anything on the basic
behavior, doesn't introduce any new files or anything like that.

* There doesn't seem to be major difference in total memory consumption
between the two approaches. Both allow allocating more memory.

* It actually helps with the "indivisible batch" case - it relaxes the
limit, so there's a chance the batch eventually fits and we stop adding
more and more batches. With spill files that's not the case - we still
keep the original limit, and we end up with the batch explosion (but
then we handle it much more efficiently).

Okay, I've read all the patches proposed in this mail and most of the
downthread ideas, and I want to cast my vote for option 1.
I find the design of 3 too complicated for what it buys us.

The slices make it harder to understand how the system works. The
current 1-1 relationship in master between batches and spill files is
easier to reason about. With the slices, I'm imagining trying to
understand why we, for example, have to move tuples from batch 4 just
because batch 5 got too big for the hashtable.

I think something like this might be worth it if it solved the problem
entirely, but if it is just a somewhat better coping mechanism, I
don't think it is worth it.

I was excited about your raw file experiment. As Robert and you point
out -- we may need a file per batch, but for most of the hash join's
execution we don't need to keep buffers for each batch around.
However, given that the experiment didn't yield great results and we
haven't come up with an alternative solution with sufficiently few
flaws, I'm still in favor of 1.

The part of 1 I struggled to understand is the math in
ExecHashExceededMemoryLimit(). I think the other email you sent with
the charts and diagonals is about choosing the optimal hashtable size
and number of batches (when to stop growing the number of batches and
just increase the size of the hashtable). So, I'll dive into that.

One thing I'm not sure about yet is whether this needs to tweak the
hashjoin costing to also consider the files when deciding how many
batches to use. Maybe it should?

I think it definitely should. The ExecChooseHashTableSize()
calculations look similar to what we use to calculate spaceAllowed, so
it makes sense that we would consider buffile sizes if we are counting
those in spaceUsed now.

- Melanie

#10

Melanie Plageman

melanieplageman@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#2)

Re: Adjusting hash join memory limit to handle batch explosion

On Sun, Jan 5, 2025 at 10:00 PM Tomas Vondra <tomas@vondra.me> wrote:

I think the general idea and formula explained in [1] is right, but
while working on the PoC patch I started to think about how to formalize
this. And I ended up creating two tables that I think visualize is
pretty nicely.

Imagine a table (in the spreadsheet sense), with work_mem values in rows
and nbatch values in columns. And the cell is "total memory" used to
execute such hash join, i.e.

work_mem + (2 * nbatches * 8K)

(Yes, there's a multiplier for the hash table size, but I use work_mem
for simplicity.) This is what the two attached PDF files show,
highlighting two interesting patterns, so let's talk about that.

1) hash-memory-model-1.pdf

Imagine you're executing a hash join - you're in a particular cell of
the table. And we've reached the current memory limit, i.e. we've filled
the hash table, and need to do something. The only solution is to
"expand" the expected "total hash size" (nbatch * hash_table_size),
which we do by simply doubling the number of batches. And often that's
the right thing to do.

For example, let's say we're running with work_mem=4MB and nbatch=16,
we've filled the hash table and are using 4336kB of memory (a little bit
more than work_mem). If we double the number of batches, we may use up
to 4352kB of memory in the next cycle. And that's fine.

If you double the number of batches, isn't that an additional 32 files
with 8kB each -- so 256kB more memory (not 16 kB)?

But hey, there's another way to double the "total hash size" - allowing
the in-memory hash table to be twice as large. In the above case, that
would be wrong, because doubling work_mem would use up to 8432kB.

So in this case it's clearly right to double the number of batches,
because that minimizes the total memory used in the next step.

However, consider for example the cell with work_mem=4MB, nbatch=8192.
We're using 135MB of memory, and need to decide what to do. Doubling the
batches means we'll use up to 266MB. But doubling work_mem increases the
memory use only to 139MB.

Right, it makes sense to use this as the basis for deciding whether or
not to increase nbatches.

This is what the green/red in the table means. Green means "better to
double nbatch" while red is "better to double work_mem". And clearly,
the table is split into two regions, separated by the diagonal.

The diagonal is the "optimal" path - if you start in any cell, the
red/green decisions will get you to the diagonal, and then along it.

The patch [1] aims to do this, but I think this visual explanation is
much clearer than anything in that patch.

Yes, the visual is great, thanks!

2) hash-memory-model-2.pdf

I've also asked if maybe the patch should do something about the choice
of initial nbatch value, which gets me to the second PDF.

Imagine we know the total amount of table in the Hash node is 1GB. There
are different ways to split that into batches. If we have enough memory,
we could do hash join without batching. With work_mem=1MB we'll need to
split this into 1024 batches, or we might do work_mem=2MB with only 512
batches. And so on - we're moving along the anti-diagonal.

The point is that while this changes the work_mem, this can have pretty
non-intuitive impact on total memory use. For example, with wm=1MB we
actually use 17MB of memory, while with wm=2MB we use only 10MB.

But each anti-diagonal has a minimum - the value on the diagonal. I
believe this is the "optimal starting cell" for the hash join. If we
don't pick this, the rules explained in (1) will eventually get us to
the diagonal anyway.

Makes sense.

Attached is an "adjust-size" patch implementing this. In the end it has
pretty much the same effect as the patch in [1], except that it's much
simpler - everything important happens in just two simple blocks, one in
ExecChooseHashTableSize(), the other in ExecHashIncreaseNumBatches().

It's interesting -- since the new patch no longer needs to count
buffile overhead in spaceUsed, spacePeak won't include that overhead.
And ultimately EXPLAIN uses the spacePeak, right?

There's a bit of complexity, because if we allow growing the size of the
in-memory hash table, we probably need to allow increasing number of
buckets. But that's not possible with how we split the hashvalue now, so
the patch adjusts that by reversing the hashvalue bits when calculating
the batch. I'm not sure if this is the best way to do this, there might
well be a better solution.

This part is pretty unpleasant looking (reverse_byte array in the
code). I'll try and think of different ideas. However, I wonder what
other kinds of effects allowing increasing the number of buckets
during execution might have?

I admit all of this seemed a bit weird / wrong initially, because it
feels like giving up the memory limit. But the simple truth is that
memory limit is pretty much just a lie - the fact that we only show the
hash table size in EXPLAIN does not mean we're not using gigabytes more
memory, we're just not making it clear. So I'd argue this actually does
a better job in limiting memory usage.

I guess people can multiply the number of batches * 8kB to get that
extra memory overhead. Maybe we should consider putting that in
EXPLAIN output?

When thinking about reasons why doubling the work_mem might not be the
right thing, I can think of one case - CPU caches. IIRC it may be much
faster to do the lookups if the hash is small enough to fit into L3, and
and doubling this might work against the goal, although I'm not sure how
bad the impact may be. In the batch explosion case it surely doesn't
matter - the cost of spilling/loading many files is much higher. But for
regular (well estimated) cases it might have negative impact.

This is why the patch only adjusts the initial parameters in the "red"
area, not in the green. Maybe it should be a bit more conservative and
only kick in when nbatch value above some threshold.

Wait isn't that the opposite of what you are saying? That is, if we
want to keep the hashtable fitting in L3, wouldn't we want to allow
increasing the number of batches even if it uses more memory? That is
the green area. I see your patch does the red -- increase hashtable
size and decrease nbatches if it is better. But that seems
inconsistent with your point about making the hashtable fit in L3.

I'd appreciate opinions and alternative ideas about this.

I really like the overall idea about being principled in the number of
batches vs hashtable size. I think the question about increasing the
number of buckets and how to do it (during execution) is important to
figure out a good way of doing.

- Melanie

#11

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Melanie Plageman (#10)

Re: Adjusting hash join memory limit to handle batch explosion

On 1/9/25 23:18, Melanie Plageman wrote:

On Sun, Jan 5, 2025 at 10:00 PM Tomas Vondra <tomas@vondra.me> wrote:

I think the general idea and formula explained in [1] is right, but
while working on the PoC patch I started to think about how to formalize
this. And I ended up creating two tables that I think visualize is
pretty nicely.

Imagine a table (in the spreadsheet sense), with work_mem values in rows
and nbatch values in columns. And the cell is "total memory" used to
execute such hash join, i.e.

work_mem + (2 * nbatches * 8K)

(Yes, there's a multiplier for the hash table size, but I use work_mem
for simplicity.) This is what the two attached PDF files show,
highlighting two interesting patterns, so let's talk about that.

1) hash-memory-model-1.pdf

Imagine you're executing a hash join - you're in a particular cell of
the table. And we've reached the current memory limit, i.e. we've filled
the hash table, and need to do something. The only solution is to
"expand" the expected "total hash size" (nbatch * hash_table_size),
which we do by simply doubling the number of batches. And often that's
the right thing to do.

For example, let's say we're running with work_mem=4MB and nbatch=16,
we've filled the hash table and are using 4336kB of memory (a little bit
more than work_mem). If we double the number of batches, we may use up
to 4352kB of memory in the next cycle. And that's fine.

If you double the number of batches, isn't that an additional 32 files
with 8kB each -- so 256kB more memory (not 16 kB)?

Right, I think that's a typo in my message, not sure where I got the
4352kB. The table has a correct value 4592kB.

But hey, there's another way to double the "total hash size" - allowing
the in-memory hash table to be twice as large. In the above case, that
would be wrong, because doubling work_mem would use up to 8432kB.

So in this case it's clearly right to double the number of batches,
because that minimizes the total memory used in the next step.

However, consider for example the cell with work_mem=4MB, nbatch=8192.
We're using 135MB of memory, and need to decide what to do. Doubling the
batches means we'll use up to 266MB. But doubling work_mem increases the
memory use only to 139MB.

Right, it makes sense to use this as the basis for deciding whether or
not to increase nbatches.

This is what the green/red in the table means. Green means "better to

double nbatch" while red is "better to double work_mem". And clearly,
the table is split into two regions, separated by the diagonal.

The diagonal is the "optimal" path - if you start in any cell, the
red/green decisions will get you to the diagonal, and then along it.

The patch [1] aims to do this, but I think this visual explanation is
much clearer than anything in that patch.

Yes, the visual is great, thanks!

Glad you find it useful too.

2) hash-memory-model-2.pdf

I've also asked if maybe the patch should do something about the choice
of initial nbatch value, which gets me to the second PDF.

Imagine we know the total amount of table in the Hash node is 1GB. There
are different ways to split that into batches. If we have enough memory,
we could do hash join without batching. With work_mem=1MB we'll need to
split this into 1024 batches, or we might do work_mem=2MB with only 512
batches. And so on - we're moving along the anti-diagonal.

The point is that while this changes the work_mem, this can have pretty
non-intuitive impact on total memory use. For example, with wm=1MB we
actually use 17MB of memory, while with wm=2MB we use only 10MB.

But each anti-diagonal has a minimum - the value on the diagonal. I
believe this is the "optimal starting cell" for the hash join. If we
don't pick this, the rules explained in (1) will eventually get us to
the diagonal anyway.

Makes sense.

Attached is an "adjust-size" patch implementing this. In the end it has
pretty much the same effect as the patch in [1], except that it's much
simpler - everything important happens in just two simple blocks, one in
ExecChooseHashTableSize(), the other in ExecHashIncreaseNumBatches().

It's interesting -- since the new patch no longer needs to count
buffile overhead in spaceUsed, spacePeak won't include that overhead.
And ultimately EXPLAIN uses the spacePeak, right?

Right. I think this is a good point - I think it was actually helpful
that the initial patch make this extra memory visible in EXPLAIN. But
without the changes to spacePeak that's no longer the case, so maybe we
should add a separate field or something like that ...

There's a bit of complexity, because if we allow growing the size of the
in-memory hash table, we probably need to allow increasing number of
buckets. But that's not possible with how we split the hashvalue now, so
the patch adjusts that by reversing the hashvalue bits when calculating
the batch. I'm not sure if this is the best way to do this, there might
well be a better solution.

This part is pretty unpleasant looking (reverse_byte array in the
code). I'll try and think of different ideas. However, I wonder what
other kinds of effects allowing increasing the number of buckets
during execution might have?

Agreed. It's simply the simplest approach to make the hashing work, I
haven't even tried to measure the overhead. I was looking for some
built-in function to reverse bits etc. but found nothing.

I admit all of this seemed a bit weird / wrong initially, because it
feels like giving up the memory limit. But the simple truth is that
memory limit is pretty much just a lie - the fact that we only show the
hash table size in EXPLAIN does not mean we're not using gigabytes more
memory, we're just not making it clear. So I'd argue this actually does
a better job in limiting memory usage.

I guess people can multiply the number of batches * 8kB to get that
extra memory overhead. Maybe we should consider putting that in
EXPLAIN output?

Exactly what I suggested above (to add that to EXPLAIN).

Expecting people to realize the batches are backed by batch files and
multiply the number by 8kB didn't quite work, I think. People don't
realize each file has a 8kB buffer, and experienced users/hackers are
surprised by how much memory it quietly consumes.

When thinking about reasons why doubling the work_mem might not be the
right thing, I can think of one case - CPU caches. IIRC it may be much
faster to do the lookups if the hash is small enough to fit into L3, and
and doubling this might work against the goal, although I'm not sure how
bad the impact may be. In the batch explosion case it surely doesn't
matter - the cost of spilling/loading many files is much higher. But for
regular (well estimated) cases it might have negative impact.

This is why the patch only adjusts the initial parameters in the "red"
area, not in the green. Maybe it should be a bit more conservative and
only kick in when nbatch value above some threshold.

Wait isn't that the opposite of what you are saying? That is, if we
want to keep the hashtable fitting in L3, wouldn't we want to allow
increasing the number of batches even if it uses more memory? That is
the green area. I see your patch does the red -- increase hashtable
size and decrease nbatches if it is better. But that seems
inconsistent with your point about making the hashtable fit in L3.

You're right. The "red" area means that we double work_mem, so the hash
table would probably exceed the L3. I don't recall what exactly was my
reasoning, maybe I just didn't think it through.

But I think the L3 benefit likely disappears once we exceed some number
of batches, because batching is fairly expensive. (I haven't measured
this yes, but I find it likely.) That'd mean having some threshold (e.g.
1024 batches), and only apply this new balancing when we exceed it,
would be reasonable.

I'd appreciate opinions and alternative ideas about this.

I really like the overall idea about being principled in the number of
batches vs hashtable size. I think the question about increasing the
number of buckets and how to do it (during execution) is important to
figure out a good way of doing.

Agreed.

regards

--
Tomas Vondra

#12

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Melanie Plageman (#9)

Re: Adjusting hash join memory limit to handle batch explosion

On 1/9/25 21:42, Melanie Plageman wrote:

On Tue, Dec 31, 2024 at 6:07 PM Tomas Vondra <tomas@vondra.me> wrote:

This means that ultimately it's either (1) or (3), and the more I've
been looking into this the more I prefer (1), for a couple reasons:

* It's much simpler (it doesn't really change anything on the basic
behavior, doesn't introduce any new files or anything like that.

* There doesn't seem to be major difference in total memory consumption
between the two approaches. Both allow allocating more memory.

* It actually helps with the "indivisible batch" case - it relaxes the
limit, so there's a chance the batch eventually fits and we stop adding
more and more batches. With spill files that's not the case - we still
keep the original limit, and we end up with the batch explosion (but
then we handle it much more efficiently).

Okay, I've read all the patches proposed in this mail and most of the
downthread ideas, and I want to cast my vote for option 1.
I find the design of 3 too complicated for what it buys us.

The slices make it harder to understand how the system works. The
current 1-1 relationship in master between batches and spill files is
easier to reason about. With the slices, I'm imagining trying to
understand why we, for example, have to move tuples from batch 4 just
because batch 5 got too big for the hashtable.

I think something like this might be worth it if it solved the problem
entirely, but if it is just a somewhat better coping mechanism, I
don't think it is worth it.

I was excited about your raw file experiment. As Robert and you point
out -- we may need a file per batch, but for most of the hash join's
execution we don't need to keep buffers for each batch around.
However, given that the experiment didn't yield great results and we
haven't come up with an alternative solution with sufficiently few
flaws, I'm still in favor of 1.

But I think those were two distinct proposals.

My experiment with raw files keeps adding batches just like the current
code (so it might quickly explode to 1M batches) and then keep feeding
data to 1M files at the same time. This doesn't work, the buffering
clearly helps a lot, and it'd affect all hashjoins, even those with
fewer batches.

Robert's idea kept using buffered files, but limited how many we can
fill at any phase. Say we'd use a limit of 1024 batches, but we actually
need 1M batches. Then we'd do the build in two phases - we'd generate
1024 batches, and then we'd split each of those batches into 1024
smaller batches. The trick (as I understand it) is those batches can't
overlap, so we'd not need more than 1024 batches, which greatly limits
the memory consumption. We could even use a lower limit, derived from
work_mem or something like that.

Of course, this is a more complex change than the "balancing" patch. But
maybe not that much, not sure. For me the main disadvantage is it
doesn't really help with the batch explosion for skewed data sets (or
data with many hash collisions). It can easily happen we blindly
increase nbatch until we use all the bits, and then break the work_mem
limit anyway.

But maybe there's a way to address that - the growthEnabled=false safety
is an unreliable solution, because it requires the whole batch to fall
to either of the new batches. A single tuple breaks that.

What if we instead compared the two new batches, and instead looked at
how far the split is from 1/2? And if it's very far from 1/2, we'd
either increase work_mem (a bit like the balancing), or disable nbatch
increases (maybe just temporarily).

The part of 1 I struggled to understand is the math in
ExecHashExceededMemoryLimit(). I think the other email you sent with
the charts and diagonals is about choosing the optimal hashtable size
and number of batches (when to stop growing the number of batches and
just increase the size of the hashtable). So, I'll dive into that.

That math is a bit unclear even to me, that patch was written before I
took the time to work out the formulas and visualizations. It works and
does about the right decisions, but with less rigor. So maybe don't
waste too much time trying to understand it.

One thing I'm not sure about yet is whether this needs to tweak the
hashjoin costing to also consider the files when deciding how many
batches to use. Maybe it should?

I think it definitely should. The ExecChooseHashTableSize()
calculations look similar to what we use to calculate spaceAllowed, so
it makes sense that we would consider buffile sizes if we are counting
those in spaceUsed now.

Yeah. I think the flaw is we may not actually know the number of batches
during planning. In the batch explosion example we start with very few
batches, that only happens during execution.

regards

--
Tomas Vondra

#13

Melanie Plageman

melanieplageman@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#12)

Re: Adjusting hash join memory limit to handle batch explosion

On Thu, Jan 9, 2025 at 6:59 PM Tomas Vondra <tomas@vondra.me> wrote:

On 1/9/25 21:42, Melanie Plageman wrote:

I was excited about your raw file experiment. As Robert and you point
out -- we may need a file per batch, but for most of the hash join's
execution we don't need to keep buffers for each batch around.
However, given that the experiment didn't yield great results and we
haven't come up with an alternative solution with sufficiently few
flaws, I'm still in favor of 1.

But I think those were two distinct proposals.

My experiment with raw files keeps adding batches just like the current
code (so it might quickly explode to 1M batches) and then keep feeding
data to 1M files at the same time. This doesn't work, the buffering
clearly helps a lot, and it'd affect all hashjoins, even those with
fewer batches.

I see.

Robert's idea kept using buffered files, but limited how many we can
fill at any phase. Say we'd use a limit of 1024 batches, but we actually
need 1M batches. Then we'd do the build in two phases - we'd generate
1024 batches, and then we'd split each of those batches into 1024
smaller batches. The trick (as I understand it) is those batches can't
overlap, so we'd not need more than 1024 batches, which greatly limits
the memory consumption. We could even use a lower limit, derived from
work_mem or something like that.

I think this is because we get the batch based on

*batchno = pg_rotate_right32(hashvalue, hashtable->log2_nbuckets) &
(nbatch - 1);

And tuples can only spill forward. I think Robert's example is if we
plan for 64 batches and eventually increase to 256 batches, a tuple
assigned to batch 1 could go to 65, 129, or 193 but no other batch --
meaning we would only need 3 files open when processing batch 1. But I
think we would need to do more explicit file flushing and closing and
opening, right? Which maybe doesn't matter when compared to the
overhead of so many more buffers.

Of course, this is a more complex change than the "balancing" patch. But
maybe not that much, not sure. For me the main disadvantage is it
doesn't really help with the batch explosion for skewed data sets (or
data with many hash collisions). It can easily happen we blindly
increase nbatch until we use all the bits, and then break the work_mem
limit anyway.

But maybe there's a way to address that - the growthEnabled=false safety
is an unreliable solution, because it requires the whole batch to fall
to either of the new batches. A single tuple breaks that.

What if we instead compared the two new batches, and instead looked at
how far the split is from 1/2? And if it's very far from 1/2, we'd
either increase work_mem (a bit like the balancing), or disable nbatch
increases (maybe just temporarily).

Meaning like have some threshold for the number of tuples over the
limit we are? Right now, we decide to increase batches when we
encounter that one tuple that puts us over the limit. So, I could see
it making sense to decide with more foresight. Or we could even keep
track of the amount over the limit we are and increase the number of
batches once we hit that threshold.

This kind of seems like it would circle back to your algorithm for
deciding on the right tradeoff between hashtable size and number of
batches, though.

You could do something like this _and_ do something like close the
files that can't be the target of tuples from the current batch --
which would allow you to tolerate many more batch increases before
doubling the hashtable size is worth it. But it seems like the
algorithm to adapt the hashtable size based on the optimal tradeoff
between hashtable size and number of batches could be done first and
the patch to close files could be done later.

- Melanie

#14

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Melanie Plageman (#13)

Re: Adjusting hash join memory limit to handle batch explosion

On 1/10/25 15:54, Melanie Plageman wrote:

On Thu, Jan 9, 2025 at 6:59 PM Tomas Vondra <tomas@vondra.me> wrote:

On 1/9/25 21:42, Melanie Plageman wrote:

I was excited about your raw file experiment. As Robert and you point
out -- we may need a file per batch, but for most of the hash join's
execution we don't need to keep buffers for each batch around.
However, given that the experiment didn't yield great results and we
haven't come up with an alternative solution with sufficiently few
flaws, I'm still in favor of 1.

But I think those were two distinct proposals.

My experiment with raw files keeps adding batches just like the current
code (so it might quickly explode to 1M batches) and then keep feeding
data to 1M files at the same time. This doesn't work, the buffering
clearly helps a lot, and it'd affect all hashjoins, even those with
fewer batches.

I see.

Robert's idea kept using buffered files, but limited how many we can
fill at any phase. Say we'd use a limit of 1024 batches, but we actually
need 1M batches. Then we'd do the build in two phases - we'd generate
1024 batches, and then we'd split each of those batches into 1024
smaller batches. The trick (as I understand it) is those batches can't
overlap, so we'd not need more than 1024 batches, which greatly limits
the memory consumption. We could even use a lower limit, derived from
work_mem or something like that.

I think this is because we get the batch based on

*batchno = pg_rotate_right32(hashvalue, hashtable->log2_nbuckets) &
(nbatch - 1);

And tuples can only spill forward. I think Robert's example is if we
plan for 64 batches and eventually increase to 256 batches, a tuple
assigned to batch 1 could go to 65, 129, or 193 but no other batch --
meaning we would only need 3 files open when processing batch 1.

Yes, I think that's why we only need 3 more files when splitting a
batch. The way I explain it is that going from 64 -> 256 adds 2 more
bits to the "batchno" part of the batch, and one of the patterns means
"current batch", so 3 new files.

This does remind me the "one spill file per slice" patch in [1]/messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development,
although it approaches it from a different angle. My patch defined the
"slice" as batches we can keep in work_mem, while Robert proposed to
decide how many batches we can open (first level of batching), and then
maybe do that recursively if needed. That seems like a fundamentally
more sound approach (indeed, my patch can create too many slices).

[1]: /messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development
/messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development

But I think we would need to do more explicit file flushing and
closing and opening, right? Which maybe doesn't matter when compared
to the overhead of so many more buffers.

Would it be all that much flushing and closing? Yes, we'd need to flush
and release the buffers (which I don't think BufFiles can do right now,
but let's ignore that for now). But I'd hope the batches are fairly
large (because that's why expect to generate them), so one more write
should not make a lot of difference on top of the actual bach split.
Chances are it's far cheaper than the extra memory pressure due to
keeping all batches in memory ...

I wonder if it might be a problem that those future batches are "far
apart". I mean, we're splitting batch 1, and the tuple may go to batches
+64, +128 and +192 batches ahead. If we allow larger "jumps" (e.g. from
256 to 64k batches), it'd be even more visible.

For the case of batch explosion I don't think it matters too much - it
will still explode into absurd number of batches, that doesn't change.
But that's fine, the point is to not cause OOM. Improving this case
would require increasing the work_mem limit (either directly or by
stopping the growth).

For regular cases I think the idea is the limit would be high enough to
not really hit this too often. I mean, how many real-world queries use
more than ~1024 batches? I don't think that's very common.

Of course, this is a more complex change than the "balancing" patch. But
maybe not that much, not sure. For me the main disadvantage is it
doesn't really help with the batch explosion for skewed data sets (or
data with many hash collisions). It can easily happen we blindly
increase nbatch until we use all the bits, and then break the work_mem
limit anyway.

But maybe there's a way to address that - the growthEnabled=false safety
is an unreliable solution, because it requires the whole batch to fall
to either of the new batches. A single tuple breaks that.

What if we instead compared the two new batches, and instead looked at
how far the split is from 1/2? And if it's very far from 1/2, we'd
either increase work_mem (a bit like the balancing), or disable nbatch
increases (maybe just temporarily).

Meaning like have some threshold for the number of tuples over the
limit we are? Right now, we decide to increase batches when we
encounter that one tuple that puts us over the limit. So, I could see
it making sense to decide with more foresight. Or we could even keep
track of the amount over the limit we are and increase the number of
batches once we hit that threshold.

Not sure I understand. I meant that we disable nbatch growth like this:

if (nfreed == 0 || nfreed == ninmemory)
{
hashtable->growEnabled = false;
}

which means that it only takes a single tuple that makes it to the other
batch to keep growing. But if 99.9999% tuples went to one of the
batches, increasing nbatch seems pretty futile.

But it goes in the opposite direction too. Imagine a uniform data set
with plenty of distinct values, but correlated / sorted, and each value
having more rows that can fit into a single batch. We'll immediately
disable growth, which is ... not great.

These are somewhat separate / independent issues, but I thin having a
concept of "retrying the nbatch growth after a while" would help.

This kind of seems like it would circle back to your algorithm for
deciding on the right tradeoff between hashtable size and number of
batches, though.

Yes, it's about the same general idea, just expressed in a slightly
different way (the growing the work_mem part).

You could do something like this _and_ do something like close the
files that can't be the target of tuples from the current batch --
which would allow you to tolerate many more batch increases before
doubling the hashtable size is worth it. But it seems like the
algorithm to adapt the hashtable size based on the optimal tradeoff
between hashtable size and number of batches could be done first and
the patch to close files could be done later.

Right. I don't think Robert's idea is a a complete answer, because it
does not consider the tradeoff that maybe increasing work_mem would be
better. OTOH maybe that's not something the hashjoin should worry about.
The goal is not to optimize the work_mem value, but make sure we don't
use significantly more memory ...

If hashjoin starts to optimize this, why shouldn't the other places
using work_mem do something similar?

regards

--
Tomas Vondra

#15

Melanie Plageman

melanieplageman@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#14)

Re: Adjusting hash join memory limit to handle batch explosion

On Fri, Jan 10, 2025 at 11:18 AM Tomas Vondra <tomas@vondra.me> wrote:

On 1/10/25 15:54, Melanie Plageman wrote:

On Thu, Jan 9, 2025 at 6:59 PM Tomas Vondra <tomas@vondra.me> wrote:
I think this is because we get the batch based on

*batchno = pg_rotate_right32(hashvalue, hashtable->log2_nbuckets) &
(nbatch - 1);

And tuples can only spill forward. I think Robert's example is if we
plan for 64 batches and eventually increase to 256 batches, a tuple
assigned to batch 1 could go to 65, 129, or 193 but no other batch --
meaning we would only need 3 files open when processing batch 1.

Yes, I think that's why we only need 3 more files when splitting a
batch. The way I explain it is that going from 64 -> 256 adds 2 more
bits to the "batchno" part of the batch, and one of the patterns means
"current batch", so 3 new files.

This does remind me the "one spill file per slice" patch in [1],
although it approaches it from a different angle. My patch defined the
"slice" as batches we can keep in work_mem, while Robert proposed to
decide how many batches we can open (first level of batching), and then
maybe do that recursively if needed. That seems like a fundamentally
more sound approach (indeed, my patch can create too many slices).

[1]
/messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development

But I think we would need to do more explicit file flushing and
closing and opening, right? Which maybe doesn't matter when compared
to the overhead of so many more buffers.

Would it be all that much flushing and closing? Yes, we'd need to flush
and release the buffers (which I don't think BufFiles can do right now,
but let's ignore that for now). But I'd hope the batches are fairly
large (because that's why expect to generate them), so one more write
should not make a lot of difference on top of the actual bach split.
Chances are it's far cheaper than the extra memory pressure due to
keeping all batches in memory ...

I wonder if it might be a problem that those future batches are "far
apart". I mean, we're splitting batch 1, and the tuple may go to batches
+64, +128 and +192 batches ahead. If we allow larger "jumps" (e.g. from
256 to 64k batches), it'd be even more visible.

I don't follow. Why would it be a problem if tuples have to go to
batches that are far away in number?

Of course, this is a more complex change than the "balancing" patch. But
maybe not that much, not sure. For me the main disadvantage is it
doesn't really help with the batch explosion for skewed data sets (or
data with many hash collisions). It can easily happen we blindly
increase nbatch until we use all the bits, and then break the work_mem
limit anyway.

But maybe there's a way to address that - the growthEnabled=false safety
is an unreliable solution, because it requires the whole batch to fall
to either of the new batches. A single tuple breaks that.

What if we instead compared the two new batches, and instead looked at
how far the split is from 1/2? And if it's very far from 1/2, we'd
either increase work_mem (a bit like the balancing), or disable nbatch
increases (maybe just temporarily).

Meaning like have some threshold for the number of tuples over the
limit we are? Right now, we decide to increase batches when we
encounter that one tuple that puts us over the limit. So, I could see
it making sense to decide with more foresight. Or we could even keep
track of the amount over the limit we are and increase the number of
batches once we hit that threshold.

Not sure I understand. I meant that we disable nbatch growth like this:

if (nfreed == 0 || nfreed == ninmemory)
{
hashtable->growEnabled = false;
}

Ah, right. I was thinking of the wrong thing.

which means that it only takes a single tuple that makes it to the other
batch to keep growing. But if 99.9999% tuples went to one of the
batches, increasing nbatch seems pretty futile.

Right. Yes, that is unfortunate. You could do a percentage threshold.
Or if we knew how big the biggest batch is, we could decide whether or
not to disable growth based on the size the hashtable would be for
that batch vs the overhead of another doubling of nbatches.

But it goes in the opposite direction too. Imagine a uniform data set
with plenty of distinct values, but correlated / sorted, and each value
having more rows that can fit into a single batch. We'll immediately
disable growth, which is ... not great.

These are somewhat separate / independent issues, but I thin having a
concept of "retrying the nbatch growth after a while" would help.

Yes, I think retrying nbatch growth later makes sense in this case. Or
when doubling nbatches wouldn't help split one rogue batch but would
help other big batches.

You could do something like this _and_ do something like close the
files that can't be the target of tuples from the current batch --
which would allow you to tolerate many more batch increases before
doubling the hashtable size is worth it. But it seems like the
algorithm to adapt the hashtable size based on the optimal tradeoff
between hashtable size and number of batches could be done first and
the patch to close files could be done later.

Right. I don't think Robert's idea is a a complete answer, because it
does not consider the tradeoff that maybe increasing work_mem would be
better. OTOH maybe that's not something the hashjoin should worry about.
The goal is not to optimize the work_mem value, but make sure we don't
use significantly more memory ...

Well it's also not a complete solution because it doesn't solve the
hash collision/batch explosion case.

If hashjoin starts to optimize this, why shouldn't the other places
using work_mem do something similar?

Yes, I suppose other spilling operators (like hashagg) that use
buffered files may consider doing this. But I don't think that is a
reason not to use this particular strategy to "fix" this hash join
batch explosion issue.

You could make the argument that because it is the buffers and not the
actual number of batches that is the problem, that we should fix it by
closing the files that aren't being used while processing a batch.
But I really like how small and isolated your sizing balance patch is.
And I actually think that the fact that it could be used to optimize
this tradeoff (work_mem/file buffers) in other places is good. Anyway,
my point was just that we could do both -- likely in any order.

- Melanie

#16

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Melanie Plageman (#15)

Re: Adjusting hash join memory limit to handle batch explosion

On 1/11/25 00:09, Melanie Plageman wrote:

On Fri, Jan 10, 2025 at 11:18 AM Tomas Vondra <tomas@vondra.me> wrote:

On 1/10/25 15:54, Melanie Plageman wrote:

On Thu, Jan 9, 2025 at 6:59 PM Tomas Vondra <tomas@vondra.me> wrote:
I think this is because we get the batch based on

*batchno = pg_rotate_right32(hashvalue, hashtable->log2_nbuckets) &
(nbatch - 1);

And tuples can only spill forward. I think Robert's example is if we
plan for 64 batches and eventually increase to 256 batches, a tuple
assigned to batch 1 could go to 65, 129, or 193 but no other batch --
meaning we would only need 3 files open when processing batch 1.

Yes, I think that's why we only need 3 more files when splitting a
batch. The way I explain it is that going from 64 -> 256 adds 2 more
bits to the "batchno" part of the batch, and one of the patterns means
"current batch", so 3 new files.

This does remind me the "one spill file per slice" patch in [1],
although it approaches it from a different angle. My patch defined the
"slice" as batches we can keep in work_mem, while Robert proposed to
decide how many batches we can open (first level of batching), and then
maybe do that recursively if needed. That seems like a fundamentally
more sound approach (indeed, my patch can create too many slices).

[1]
/messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development

But I think we would need to do more explicit file flushing and
closing and opening, right? Which maybe doesn't matter when compared
to the overhead of so many more buffers.

Would it be all that much flushing and closing? Yes, we'd need to flush
and release the buffers (which I don't think BufFiles can do right now,
but let's ignore that for now). But I'd hope the batches are fairly
large (because that's why expect to generate them), so one more write
should not make a lot of difference on top of the actual bach split.
Chances are it's far cheaper than the extra memory pressure due to
keeping all batches in memory ...

I wonder if it might be a problem that those future batches are "far
apart". I mean, we're splitting batch 1, and the tuple may go to batches
+64, +128 and +192 batches ahead. If we allow larger "jumps" (e.g. from
256 to 64k batches), it'd be even more visible.

I don't follow. Why would it be a problem if tuples have to go to
batches that are far away in number?

I think you're right it's not a problem, I was just thinking aloud (or
whatever you do in an e-mail).

Of course, this is a more complex change than the "balancing" patch. But
maybe not that much, not sure. For me the main disadvantage is it
doesn't really help with the batch explosion for skewed data sets (or
data with many hash collisions). It can easily happen we blindly
increase nbatch until we use all the bits, and then break the work_mem
limit anyway.

But maybe there's a way to address that - the growthEnabled=false safety
is an unreliable solution, because it requires the whole batch to fall
to either of the new batches. A single tuple breaks that.

What if we instead compared the two new batches, and instead looked at
how far the split is from 1/2? And if it's very far from 1/2, we'd
either increase work_mem (a bit like the balancing), or disable nbatch
increases (maybe just temporarily).

Meaning like have some threshold for the number of tuples over the
limit we are? Right now, we decide to increase batches when we
encounter that one tuple that puts us over the limit. So, I could see
it making sense to decide with more foresight. Or we could even keep
track of the amount over the limit we are and increase the number of
batches once we hit that threshold.

Not sure I understand. I meant that we disable nbatch growth like this:

if (nfreed == 0 || nfreed == ninmemory)
{
hashtable->growEnabled = false;
}

Ah, right. I was thinking of the wrong thing.

which means that it only takes a single tuple that makes it to the other
batch to keep growing. But if 99.9999% tuples went to one of the
batches, increasing nbatch seems pretty futile.

Right. Yes, that is unfortunate. You could do a percentage threshold.
Or if we knew how big the biggest batch is, we could decide whether or
not to disable growth based on the size the hashtable would be for
that batch vs the overhead of another doubling of nbatches.

I was thinking it might be possible to express this as a formula similar
to the "balancing". I mean, something that says "just double as you
wish" when the current doubling split the batch 50:50, but delays the
next doubling if the batch gets split 99:1 (with some continuous
transition between those two extremes).

Or maybe this could also drive increasing the memory limit. Yes, there's
a chance that the next doubling will split it more evenly, but I think
it's much more likely there really are hash collisions of some sort.

But it goes in the opposite direction too. Imagine a uniform data set
with plenty of distinct values, but correlated / sorted, and each value
having more rows that can fit into a single batch. We'll immediately
disable growth, which is ... not great.

These are somewhat separate / independent issues, but I thin having a
concept of "retrying the nbatch growth after a while" would help.

Yes, I think retrying nbatch growth later makes sense in this case. Or
when doubling nbatches wouldn't help split one rogue batch but would
help other big batches.

Exactly. Giving up the growth entirely seems a bit premature. I don't
think there's a principled formula to determine when to retry, but it
might be enough to try after the hash table doubles in size. That's
pretty much the "let's increase work_mem a bit" I mentioned above.

You could do something like this _and_ do something like close the
files that can't be the target of tuples from the current batch --
which would allow you to tolerate many more batch increases before
doubling the hashtable size is worth it. But it seems like the
algorithm to adapt the hashtable size based on the optimal tradeoff
between hashtable size and number of batches could be done first and
the patch to close files could be done later.

Right. I don't think Robert's idea is a a complete answer, because it
does not consider the tradeoff that maybe increasing work_mem would be
better. OTOH maybe that's not something the hashjoin should worry about.
The goal is not to optimize the work_mem value, but make sure we don't
use significantly more memory ...

Well it's also not a complete solution because it doesn't solve the
hash collision/batch explosion case.

I think it does, in a way. It doesn't prevent the batch explosion, of
course, it still ends with millions of batches. But it's not keeping all
the BufFiles open at the same time, so it does not use the insane
amounts of memory. And it's slower than the balancing, of course.

If hashjoin starts to optimize this, why shouldn't the other places
using work_mem do something similar?

Yes, I suppose other spilling operators (like hashagg) that use
buffered files may consider doing this. But I don't think that is a
reason not to use this particular strategy to "fix" this hash join
batch explosion issue.

You could make the argument that because it is the buffers and not the
actual number of batches that is the problem, that we should fix it by
closing the files that aren't being used while processing a batch.
But I really like how small and isolated your sizing balance patch is.
And I actually think that the fact that it could be used to optimize
this tradeoff (work_mem/file buffers) in other places is good. Anyway,
my point was just that we could do both -- likely in any order.

Right. The way I'm looking at this is the balancing patch is a strict
improvement over the current state. Robert's proposal is more principled
in that it actually tries to enforce the promised memory limit.

regards

--
Tomas Vondra

#17

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Melanie Plageman (#13)

Re: Adjusting hash join memory limit to handle batch explosion

On 1/10/25 15:54, Melanie Plageman wrote:

On Thu, Jan 9, 2025 at 6:59 PM Tomas Vondra <tomas@vondra.me> wrote:

...

Robert's idea kept using buffered files, but limited how many we can
fill at any phase. Say we'd use a limit of 1024 batches, but we actually
need 1M batches. Then we'd do the build in two phases - we'd generate
1024 batches, and then we'd split each of those batches into 1024
smaller batches. The trick (as I understand it) is those batches can't
overlap, so we'd not need more than 1024 batches, which greatly limits
the memory consumption. We could even use a lower limit, derived from
work_mem or something like that.

I think this is because we get the batch based on

*batchno = pg_rotate_right32(hashvalue, hashtable->log2_nbuckets) &
(nbatch - 1);

And tuples can only spill forward. I think Robert's example is if we
plan for 64 batches and eventually increase to 256 batches, a tuple
assigned to batch 1 could go to 65, 129, or 193 but no other batch --
meaning we would only need 3 files open when processing batch 1. But I
think we would need to do more explicit file flushing and closing and
opening, right? Which maybe doesn't matter when compared to the
overhead of so many more buffers.

I had a quiet evening yesterday, so I decided to take a stab at this and
see how hard would it be, and how bad would the impact be. Attached is
an experimental patch, doing the *bare* minimum for a simple query:

1) It defines a limit of 128 batches (a bit low, but also 1MB). In
practice we'd use something like 256 - 1024, probably. Doesn't matter.

2) Ensures the initial pass over data in MultiExecPrivateHash does not
use more than 128 batches, switches to "tooManyBatches=true" if that
happens (and dumps the batch to file ExecHashDumpBatchToFile, even if
it's batchno=0). And later it calls ExecHashHandleTooManyBatches() to
increase the nbatch further.

3) Does something similar for the outer relation - if there are too many
batches, we do ExecHashJoinRepartitionBatches() which first partitions
into 128 batches. This only does a single pass in the WIP, though.
Should be recursive or something.

4) Extends the BufFile API with BufFileHasBuffer/BufFileFreeBuffer so
that the code can free the buffers. It also means the buffer needs to be
allocated separately, not embedded in BufFile struct. (I'm a bit
surprised it works without having to re-read the buffer after freeing
it, but that's probably thanks to how hashjoin uses the files).

Anyway, this seems to work, and a simple experiment looks like this:

--------------------------------------------------------------------
create table t (a int, b text);

insert into t select i, md5(i::text)
from generate_series(1,100000) s(i);

vacuum analyze;

set work_mem='128kB';
explain analyze select * from t t1 join t t2 on (t1.a = t2.a);
--------------------------------------------------------------------

This is just enough to need 256 batches, i.e. "one doubling" over the
128 batch limit.

On master I get this:

QUERY PLAN
-----------------------------------------------------------------------
Hash Join (cost=15459.00..51555.40 rows=1638740 width=74) (actual
time=80.065..337.192 rows=1600000 loops=1)
Hash Cond: (t1.a = t2.a)
Buffers: shared hit=6668, temp read=5806 written=5806
-> Seq Scan on t t1 (cost=0.00..7334.00 rows=400000 width=37)
(actual time=0.008..19.867 rows=400000 loops=1)
Buffers: shared hit=3334
-> Hash (cost=7334.00..7334.00 rows=400000 width=37) (actual
time=76.824..76.824 rows=400000 loops=1)
Buckets: 4096 Batches: 256 Memory Usage: 132kB
Buffers: shared hit=3334, temp written=2648
-> Seq Scan on t t2 (cost=0.00..7334.00 rows=400000 width=37)
(actual time=0.001..20.855 rows=400000 loops=1)
Buffers: shared hit=3334
Planning Time: 0.100 ms
Execution Time: 385.779 ms
(12 rows)

while with the patch we get this:

QUERY PLAN
-----------------------------------------------------------------------
Hash Join (cost=15459.00..51555.40 rows=1638740 width=74) (actual
time=93.346..325.604 rows=1600000 loops=1)
Hash Cond: (t1.a = t2.a)
Buffers: shared hit=6668, temp read=8606 written=8606
-> Seq Scan on t t1 (cost=0.00..7334.00 rows=400000 width=37)
(actual time=0.006..14.416 rows=400000 loops=1)
Buffers: shared hit=3334
-> Hash (cost=7334.00..7334.00 rows=400000 width=37) (actual
time=48.481..48.482 rows=400000 loops=1)
Buckets: 4096 (originally 4096) Batches: 256 (originally 128)
Memory Usage: 132kB
Buffers: shared hit=3334, temp read=23 written=2860
-> Seq Scan on t t2 (cost=0.00..7334.00 rows=400000 width=37)
(actual time=0.001..14.754 rows=400000 loops=1)
Buffers: shared hit=3334
Planning Time: 0.061 ms
Execution Time: 374.229 ms
(12 rows)

So for this particular query there doesn't seem to be a particularly
massive hit. With more batches that's unfortunately not the case, but
that seems to be mostly due to looping over all batches when freeing
buffers. Looping over 1M batches (which we do for every batch) is
expensive, but that could be improved somehow - we only have a couple
files open (say 1024), so we could keep them in a list or something. And
I think we don't need to free the batches this often anyway, we might
not even opened any future batches.

I'm sure there's plenty of issues with the patch - e.g. it may not not
handle nbatch increases later (after batch 0), I ignored skew buckets,
and stuff like that ...

But it does seem like a workable idea ...

regards

--
Tomas Vondra

#18

Melanie Plageman

melanieplageman@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#17)

Re: Adjusting hash join memory limit to handle batch explosion

On Sat, Jan 11, 2025 at 7:42 PM Tomas Vondra <tomas@vondra.me> wrote:

I had a quiet evening yesterday, so I decided to take a stab at this and
see how hard would it be, and how bad would the impact be. Attached is
an experimental patch, doing the *bare* minimum for a simple query:

1) It defines a limit of 128 batches (a bit low, but also 1MB). In
practice we'd use something like 256 - 1024, probably. Doesn't matter.

2) Ensures the initial pass over data in MultiExecPrivateHash does not
use more than 128 batches, switches to "tooManyBatches=true" if that
happens (and dumps the batch to file ExecHashDumpBatchToFile, even if
it's batchno=0). And later it calls ExecHashHandleTooManyBatches() to
increase the nbatch further.

3) Does something similar for the outer relation - if there are too many
batches, we do ExecHashJoinRepartitionBatches() which first partitions
into 128 batches. This only does a single pass in the WIP, though.
Should be recursive or something.

4) Extends the BufFile API with BufFileHasBuffer/BufFileFreeBuffer so
that the code can free the buffers. It also means the buffer needs to be
allocated separately, not embedded in BufFile struct. (I'm a bit
surprised it works without having to re-read the buffer after freeing
it, but that's probably thanks to how hashjoin uses the files).

I started looking at this. Even though you do explain what it does
above, I still found it a bit hard to follow. Could you walk through
an example (like the one you gave in SQL) but explaining what happens
in the implementation? Basically what you have in 2 and 3 above but
with a specific example.

This is my understanding of what this does:
if we are at the max number of batches when building the hashtable and
we run out of space and need to double nbatches, we
1. dump the data from the current batch that is in the hashtable into a file
2. close and flush are the currently open buffiles, double the number
of batches, and then only open files for the batches we need to store
tuples from the batch we were trying to put in the hashtable when we
hit the limit (now in a temp file)

I also don't understand why ExecHashJoinRepartitionBatches() is needed
-- I think it has something to do with needing a certain number of
buffers open while processing batch 0, but what does this have to do
with the outer side of the join?

Another random question: why doesn't ExecHashHandleTooManyBatches()
free the outer batch files?

- Melanie

#19

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Melanie Plageman (#18)

Re: Adjusting hash join memory limit to handle batch explosion

On 1/13/25 17:32, Melanie Plageman wrote:

On Sat, Jan 11, 2025 at 7:42 PM Tomas Vondra <tomas@vondra.me> wrote:

I had a quiet evening yesterday, so I decided to take a stab at this and
see how hard would it be, and how bad would the impact be. Attached is
an experimental patch, doing the *bare* minimum for a simple query:

1) It defines a limit of 128 batches (a bit low, but also 1MB). In
practice we'd use something like 256 - 1024, probably. Doesn't matter.

2) Ensures the initial pass over data in MultiExecPrivateHash does not
use more than 128 batches, switches to "tooManyBatches=true" if that
happens (and dumps the batch to file ExecHashDumpBatchToFile, even if
it's batchno=0). And later it calls ExecHashHandleTooManyBatches() to
increase the nbatch further.

3) Does something similar for the outer relation - if there are too many
batches, we do ExecHashJoinRepartitionBatches() which first partitions
into 128 batches. This only does a single pass in the WIP, though.
Should be recursive or something.

4) Extends the BufFile API with BufFileHasBuffer/BufFileFreeBuffer so
that the code can free the buffers. It also means the buffer needs to be
allocated separately, not embedded in BufFile struct. (I'm a bit
surprised it works without having to re-read the buffer after freeing
it, but that's probably thanks to how hashjoin uses the files).

I started looking at this. Even though you do explain what it does
above, I still found it a bit hard to follow. Could you walk through
an example (like the one you gave in SQL) but explaining what happens
in the implementation? Basically what you have in 2 and 3 above but
with a specific example.

OK, I'll try ... see the end of this message.

This is my understanding of what this does:
if we are at the max number of batches when building the hashtable and
we run out of space and need to double nbatches, we
1. dump the data from the current batch that is in the hashtable into a file
2. close and flush are the currently open buffiles, double the number
of batches, and then only open files for the batches we need to store
tuples from the batch we were trying to put in the hashtable when we
hit the limit (now in a temp file)

Roughly, but the second step needs to happen only after we finish the
first pass over the inner relation. I'll try to explain this as part of
the example.

I also don't understand why ExecHashJoinRepartitionBatches() is needed
-- I think it has something to do with needing a certain number of
buffers open while processing batch 0, but what does this have to do
with the outer side of the join?

No, this is about building batches on the outer side. We've built the
hash table, and we may have ended with a very high nbatch. We can't
build all of them right away (would need too many buffiles), so we do
that in multiple phases, to not cross the limit.

Another random question: why doesn't ExecHashHandleTooManyBatches()
free the outer batch files?

Because it was tailored for the example when all batch splits happen for
batch 0, before we even start processing the outer side. In practice it
probably should free the files.

Let's do the example - as I mentioned, I only tried doing this for the
case where all the batch increases happen for batch 0, before we start
building the outer batches. I'm 99% sure the patch will need to modify a
couple more places to handle batch increases in later stages.

Assume we don't want to use more than 128 batches, but that we're
running a query that needs 256 batches. The patch will do this:

1) ExecHashTableCreate will set nbatch_maximum=128 as the limit for the
current pass over inner relation, and it'll cap the other nbatch fields
accordingly. If we already know we'll need more batches, we set
tooManyBatches=true to remember this.

But let's we start with nbatch=64, nbatch_maximum=128 (and thus also
with tooManyBatches=false).

2) We start loading data into the hash table, until exceed the memory
limit for the first time. We double the number to 128, move some of the
data from the hash table to the new batch, and continue.

3) We hit the memory limit again, but this time we've hit

(nbatch == nbatch_maximum)

so we can't double the number of batches. But we also can't continue
adding data to the in-memory hash table, so we set tooManyBatches=true
and we start spilling even the current batch to a file.

4) We finish the first pass over the inner relation with

nbatch = 128
nbatch_maximum = 128
tooManyBatches = true

so we need to do something. We run ExecHashHandleTooManyBatches() starts
increasing the nbatches until the current batch fits into work_mem. We
have nbatch=128, and the query needs nbatch=256, so we only do one loop.

Note: Right now it simply doubles the number of batches in each loop.
But it could be faster and do up to 128 in one step.

128 -> 16k -> 1M

The later batches will already do all the increases in a single step,
that needs an improvement too.

4) After ExecHashHandleTooManyBatches completed, we have the inner side
of the batch mostly "done". We have nbatch=256.

5) We start building batches on the outer side, but we also don't want
to build all the batches at once - we want to build 128 and only then go
to 256 (or further). This is what ExecHashJoinRepartitionBatches does.

If we have too many batches for one pass, we build 128 batches in the
first pass. And then we just read the batch files, doing further splits.
Right now this just does a single pass and thus splits the relation into
128 batches, and then just continues as before. That's enough for 256
batches, because 256 is a single step past 128.

But it really should be recursive / do multiple passes, to handle more
cases with more than 16k batches (although with higher limit it would be
less of an issue).

5) It does free the file buffers in various places. Chances are some of
those places are unnecessary, and it should be done in some more places.

As I said, I don't claim this to handle all cases, especially with
splits in later batches.

Does this make it clearer?

regards

--
Tomas Vondra

#20

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Tomas Vondra (#19)

Re: Adjusting hash join memory limit to handle batch explosion

Hi,

Here's a somewhat cleaned up version of the original patch series (with
memory balancing) from [1]/messages/by-id/9e01b538-fb62-4386-b703-548818911702@vondra.me.

1) v20250125-0001-Balance-memory-usage-with-hashjoin-batch-e.patch
------------------------------------------------------------------

The 0001 patch does exactly the same thing as
vadjust-size-0001-hashjoin-sizing-balance.patch, except that it moves
the code a bit and (hopefully) does a better job at explaining the logic
in the comments.

I'm fairly happy with how simple and non-invasive this is, and how well
it deals the issue. Sure, limiting the number of batch files (and then
splitting them recursively" later) seems possible and perhaps more
"correct" (in the sense that it better enforces the memory limit). But
it's far more invasive, impacts everyone (not just the rare case of
batch explosion), and doesn't help with "indivisible" batches (we still
end up with the batch explosion).

I don't have capacity/interest to continue working on this (limiting the
number of spill files) in the near term, and even if I had I don't think
it'd be doable for PG18, considering there's just one commitfest.

My plan is to get something like 0001 into PG18. It's strictly better
than what we have now, that's for sure, and I think is good enough for
the rare cases of batch explosion.

The one open question I have is what to do about the hashing, and how we
calculate bucket/batch. With the current scheme we can't increase the
number of buckets above nbuckets_optimal, which is sized for the largest
hash_table we can fit into (work_mem * hash_mem_multiplier). But the
patch is based on the idea that at some point it's better to grow the
hash table beyond that limit.

So either we need to change how we split hash, or just accept that if
the hash table grows too much, we may get longer chains. Which I guess
might be still better than having too many batch files.

The patch uses the lookup table algorithm to reverse bits in the hash
from here:

https://graphics.stanford.edu/~seander/bithacks.html#BitReverseTable

And then it takes nbatch from the reversed value, i.e from the beginning
of the hash (while nbuckets is taken from the end as before). I tried
the other algorithms, but they all seemed slower. Another option would
be to start calculating two separate hashes, or a 64-bit hash (and split
it into two). Not sure.

2) v20250125-0002-Postpone-hashtable-growth-instead-of-disab.patch
------------------------------------------------------------------

0002 is an experimental patch to handle another failure I speculated
about, namely "premature disabling of nbatch growth". The theory was
that if the inner relation is correlated, we may disable nbatch growth
prematurely, because the first nbatch increase happens to not split the
batch at all (because of the correlation).

It turned out to be harder to trigger, because it assumes we actually
start with too few batches - i.e. that we significantly underestimate
the inner relation. I'm sure that can easily happen, but the impact
seems to be less severe than for the batch explosion.

There's a SQL script with an example triggering this in 0003.

The patch simply stops using the "true/false" flag, and instead just
doubles the spaceAllowed threshold (a bit like 0001), effectively
postponing next round of nbatch doubling.

This made me realize we already have the issue with nbuckets sizing - if
we disable nbatch growth (be it forever or just temporarily), we
essentially allow the hash table to exceed the expected size. And thus
the nbuckets may be too low. So we'd already need to increase the number
of buckets, it's not just a matter of the 0001 patch.

3) v20250125-0003-hashjoin-patch-tests.patch
--------------------------------------------

This has some files illustrating the memory usage etc. as explained in
the original message [1]/messages/by-id/9e01b538-fb62-4386-b703-548818911702@vondra.me. But it also has three scripts to reproduce the
issues.

- patch/batch-explosion.sql - batch explosion
- patch/disabled-growth.sql - disabled growth / correlated data
- patch/disabled-growth2.sql - disabled growth / hash pattern

To use these scripts, copy the hash-collisions.data to /tmp (the scripts
copy the data into a table). And then to \i of the script.

Each of the patches has a GUC to enable the behavior

- enable_hashjoin_adjust
- enable_hashjoin_growth

and by default it's "false" i.e. disabled. So if you want to see the new
behavior, you need to explicitly set it to 'true' before the script.
These GUCs are meant only for easier development and would be removed
from the final commit.

regards

[1]: /messages/by-id/9e01b538-fb62-4386-b703-548818911702@vondra.me
/messages/by-id/9e01b538-fb62-4386-b703-548818911702@vondra.me

--
Tomas Vondra