accounting for memory used for BufFile during hash joins
Hi,
I'm starting this thread mostly to keep track of patches developed in
response to issue [1]/messages/by-id/bc138e9f-c89e-9147-5395-61d51a757b3b@gusw.net reported on pgsql-performance. The symptoms are
very simple - query performing a hash join ends up using much more
memory than expected (pretty much ignoring work_mem), and possibly
ending up with OOM.
The root cause is that hash join treats batches as pretty much free, but
that's not really true - we do allocate two BufFile structs per batch,
and each BufFile is ~8kB as it includes PGAlignedBuffer.
This is not ideal even if we happen to estimate everything correctly,
because for example with work_mem=4MB and nbatch=1024, it means we'll
use about 16MB (2*8kB*1024) for the BufFile structures alone, plus the
work_mem for hash table itself.
But it can easily explode when we under-estimate the hash side. In the
pgsql-performance message, the hash side (with the patches applied,
allowing the query to complete) it looks like this:
Hash (cost=2823846.37..2823846.37 rows=34619 width=930)
(actual time=252946.367..252946.367 rows=113478127 loops=1)
So it's 3277x under-estimated. It starts with 16 batches, and ends up
adding more and more batches until it fails with 524288 of them (it gets
to that many batches because some of the values are very common and we
don't disable the growth earlier).
The OOM is not very surprising, because with 524288 batches it'd need
about 8GB of memory, and the system only has 8GB RAM installed.
The two attached patches both account for the BufFile memory, but then
use very different strategies when the work_mem limit is reached.
The first patch realizes it's impossible to keep adding batches without
breaking the work_mem limit, because at some point the BufFile will need
more memory than that. But it does not make sense to stop adding batches
entirely, because then the hash table could grow indefinitely.
So the patch abandons the idea of enforcing work_mem in this situation,
and instead attempts to minimize memory usage over time - it increases
the spaceAllowed in a way that ensures doubling the number of batches
actually reduces memory usage in the long run.
The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).
Neither of those patches tweaks ExecChooseHashTableSize() to consider
memory needed for BufFiles while deciding how many batches will be
needed. That's something that probably needs to happen, but it would not
help with the underestimate issue.
I'm not entirely sure which of those approaches is the right one. The
first one is clearly just a "damage control" for cases where the hash
side turned out to be much larger than we expected. With good estimates
we probably would not have picked a hash join for those (that is, we
should have realized we can't keep work_mem and prohibit hash join).
The second patch however makes hash join viable for some of those cases,
and it seems to work pretty well (there are some numbers in the message
posted to pgsql-performance thread). So I kinda like this second one.
It's all just PoC quality, at this point, far from committable state.
[1]: /messages/by-id/bc138e9f-c89e-9147-5395-61d51a757b3b@gusw.net
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
On Fri, May 3, 2019 at 5:34 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).I want to see if I understand the implications of the per-slice-overflow
patch
for execution of hashjoin:
For each bucket in the hashtable, when attempting to double the number of
batches, if the memory that the BufFile structs will occupy once this is
done
will exceed the work_mem, split each batch into slices that fit into memory.
This means that, for each probe-side tuple hashing to that bucket, you have
to
load every slice of each batch separately into memory to ensure correct
results.
Is this right?
I'm not entirely sure which of those approaches is the right one. The
first one is clearly just a "damage control" for cases where the hash
side turned out to be much larger than we expected. With good estimates
we probably would not have picked a hash join for those (that is, we
should have realized we can't keep work_mem and prohibit hash join).The second patch however makes hash join viable for some of those cases,
and it seems to work pretty well (there are some numbers in the message
posted to pgsql-performance thread). So I kinda like this second one.So, my initial reaction after taking a look at the patches is that I
prefer the
first approach--increasing the resize threshhold. The second patch, the
per-slice-overflow patch, adds a major new mechanic to hashjoin in order to
address what is, based on my understanding, an edge case.
--
Melanie Plageman
On Tue, May 7, 2019 at 9:58 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
On Fri, May 3, 2019 at 5:34 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).I want to see if I understand the implications of the per-slice-overflow patch
for execution of hashjoin:
For each bucket in the hashtable, when attempting to double the number of
batches, if the memory that the BufFile structs will occupy once this is done
will exceed the work_mem, split each batch into slices that fit into memory.
This means that, for each probe-side tuple hashing to that bucket, you have to
load every slice of each batch separately into memory to ensure correct results.
Is this right?
Seems expensive for large numbers of slices -- you need to join the
outer batch against each inner slice. But I wonder how we'd deal with
outer joins, as Tom Lane asked in another thread:
/messages/by-id/12185.1488932980@sss.pgh.pa.us
I'm not entirely sure which of those approaches is the right one. The
first one is clearly just a "damage control" for cases where the hash
side turned out to be much larger than we expected. With good estimates
we probably would not have picked a hash join for those (that is, we
should have realized we can't keep work_mem and prohibit hash join).The second patch however makes hash join viable for some of those cases,
and it seems to work pretty well (there are some numbers in the message
posted to pgsql-performance thread). So I kinda like this second one.So, my initial reaction after taking a look at the patches is that I prefer the
first approach--increasing the resize threshhold. The second patch, the
per-slice-overflow patch, adds a major new mechanic to hashjoin in order to
address what is, based on my understanding, an edge case.
Personally I'd like to make work_mem more reliable, even if it takes a
major new mechanism.
Stepping back a bit, I think there is something fishy about the way we
detect extreme skew. Is that a factor in this case? Right now we
wait until we have a batch that gets split into child batches
containing exactly 0% and 100% of the tuples before we give up.
Previously I had thought of that as merely a waste of time, but
clearly it's also a waste of unmetered memory. Oops.
I think our extreme skew detector should go off sooner, because
otherwise if you have N nicely distributed unique keys and also M
duplicates of one bad egg key that'll never fit in memory, we keep
repartitioning until none of the N keys fall into the batch containing
the key for the M duplicates before we give up! You can use
balls-into-bins maths to figure out the number, but I think that means
we expect to keep splitting until we have N * some_constant batches,
and that's just silly and liable to create massive numbers of
partitions proportional to N, even though we're trying to solve a
problem with M. In another thread I suggested we should stop when
(say) 95% of the tuples go to one child batch. I'm not sure how you
pick the number.
Of course that doesn't solve the problem that we don't have a better
plan for dealing with the M duplicates -- it just avoids a needless
batch explosions triggered by bad maths. I think we need something
like Tomas's #2, or a way to switch to sort-merge, or some other
scheme. I'm not sure how to compare the slice idea, which involves
processing outer tuples * inner slices with the sort-merge idea, which
involves sorting the inner and outer batch, plus the entirely new
concept of switching to another node at execution time.
I also wondered about reducing the buffer size of the BufFiles, but
that doesn't seem to be fixing the real problem.
--
Thomas Munro
https://enterprisedb.com
On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
On Tue, May 7, 2019 at 9:58 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:On Fri, May 3, 2019 at 5:34 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).I want to see if I understand the implications of the per-slice-overflow patch
for execution of hashjoin:
For each bucket in the hashtable, when attempting to double the number of
batches, if the memory that the BufFile structs will occupy once this is done
will exceed the work_mem, split each batch into slices that fit into memory.
This means that, for each probe-side tuple hashing to that bucket, you have to
load every slice of each batch separately into memory to ensure correct results.
Is this right?
Seems expensive for large numbers of slices -- you need to join the
outer batch against each inner slice.
Nope, that's not how it works. It's the array of batches that gets
sliced, not the batches themselves.
It does slightly increase the amount of data we need to shuffle between
the temp files, because we can't write the data directly to batches in
"future" slices. But that amplification is capped to ~2.2x (compared to
the ~1.4x in master) - I've shared some measurements in [1]/messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development.
[1]: /messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development
But I wonder how we'd deal with outer joins, as Tom Lane asked in
another thread:
That seems unrelated - we slice the array of batches, to keep memory
needed for BufFile under control. The hash table remains intact, so
there's no issue with outer joins.
I'm not entirely sure which of those approaches is the right one. The
first one is clearly just a "damage control" for cases where the hash
side turned out to be much larger than we expected. With good estimates
we probably would not have picked a hash join for those (that is, we
should have realized we can't keep work_mem and prohibit hash join).The second patch however makes hash join viable for some of those cases,
and it seems to work pretty well (there are some numbers in the message
posted to pgsql-performance thread). So I kinda like this second one.So, my initial reaction after taking a look at the patches is that I prefer the
first approach--increasing the resize threshhold. The second patch, the
per-slice-overflow patch, adds a major new mechanic to hashjoin in order to
address what is, based on my understanding, an edge case.Personally I'd like to make work_mem more reliable, even if it takes a
major new mechanism.
Yeah, I share that attitude.
Stepping back a bit, I think there is something fishy about the way we
detect extreme skew. Is that a factor in this case? Right now we
wait until we have a batch that gets split into child batches
containing exactly 0% and 100% of the tuples before we give up.
Previously I had thought of that as merely a waste of time, but
clearly it's also a waste of unmetered memory. Oops.
Yes, that was a factor in the reported query - the data set contained
significant number of duplicate values (~10%) but it took a while to
disable growth because there always happened to be a couple rows with a
different value.
I think our extreme skew detector should go off sooner, because
otherwise if you have N nicely distributed unique keys and also M
duplicates of one bad egg key that'll never fit in memory, we keep
repartitioning until none of the N keys fall into the batch containing
the key for the M duplicates before we give up! You can use
balls-into-bins maths to figure out the number, but I think that means
we expect to keep splitting until we have N * some_constant batches,
and that's just silly and liable to create massive numbers of
partitions proportional to N, even though we're trying to solve a
problem with M. In another thread I suggested we should stop when
(say) 95% of the tuples go to one child batch. I'm not sure how you
pick the number.
I agree we should relax the 0%/100% split condition, and disable the
growth sooner. But I think we should also re-evaluate that decision
after a while - the data set may be correlated in some way, in which
case we may disable the growth prematurely. It may not reduce memory
usage now, but it may help in the future.
It's already an issue, but it would be even more likely if we disabled
growth e.g. with just 5%/95% splits.
FWIW I believe this is mostly orthogonal issue to what's discussed in
this thread.
Of course that doesn't solve the problem that we don't have a better
plan for dealing with the M duplicates -- it just avoids a needless
batch explosions triggered by bad maths. I think we need something
like Tomas's #2, or a way to switch to sort-merge, or some other
scheme. I'm not sure how to compare the slice idea, which involves
processing outer tuples * inner slices with the sort-merge idea, which
involves sorting the inner and outer batch, plus the entirely new
concept of switching to another node at execution time.
Do we actually check how many duplicates are there during planning? I
wonder if we could penalize (of even disable) hashjoins when there are
too many duplicates to fit into work_mem. Of course, that's going to be
tricky with filtering, and so on.
Switching to some other algorithm during execution moves the goal posts
to the next galaxy, I'm afraid.
I also wondered about reducing the buffer size of the BufFiles, but
that doesn't seem to be fixing the real problem.
Yeah. It might help a bit, but it's very limited - even if you reduce
the buffer to say 1kB, it's just a factor of 8. And I'm not sure what
would be the impact on performance.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
Do we actually check how many duplicates are there during planning?
Certainly that's part of the planner's cost estimates ... but it's
only as good as the planner's statistical knowledge.
regards, tom lane
On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
Seems expensive for large numbers of slices -- you need to join the
outer batch against each inner slice.Nope, that's not how it works. It's the array of batches that gets
sliced, not the batches themselves.
Sorry, I read only the description and not the code, and got confused
about that. So, I see three separate but related problems:
A. Broken escape valve: sometimes we generate a huge number of
batches while trying to split up many duplicates, because of the
presence of other more uniformly distributed keys. We could fix that
with (say) a 95% rule.
B. Lack of good alternative execution strategy when the escape valve
is triggered. A batch cannot be split effectively, but cannot fit in
work_mem, so for now we decide to ignore work_mem.
C. Unmetered explosion of batches and thus BufFiles, probably usually
caused by problem A, but theoretically also due to a real need for
partitions.
But I wonder how we'd deal with outer joins, as Tom Lane asked in
another thread:That seems unrelated - we slice the array of batches, to keep memory
needed for BufFile under control. The hash table remains intact, so
there's no issue with outer joins.
Right, sorry, my confusion. I thought you were describing
https://en.wikipedia.org/wiki/Block_nested_loop. (I actually think we
can make that work for left outer joins without too much fuss by
writing out a stream of match bits to a new temporary file. Googling,
I see that MySQL originally didn't support BNL for outer joins and
then added some match flag propagation thing recently.)
I agree we should relax the 0%/100% split condition, and disable the
growth sooner. But I think we should also re-evaluate that decision
after a while - the data set may be correlated in some way, in which
case we may disable the growth prematurely. It may not reduce memory
usage now, but it may help in the future.It's already an issue, but it would be even more likely if we disabled
growth e.g. with just 5%/95% splits.FWIW I believe this is mostly orthogonal issue to what's discussed in
this thread.
But isn't problem A the root cause of problem C, in most cases? There
must also be "genuine" cases of problem C that would occur even if we
fix that, of course: someone has small work_mem, and data that can be
effectively partitioned to fit it, but it just takes a huge number of
partitions to do it. So that we don't behave badly in those cases, I
agree with you 100%: we should fix the memory accounting to count
BufFile overheads as you are proposing, and then I guess ideally
switch to our alternative strategy (BNL or sort-merge or ...) when we
see that BufFiles are wasting to much work_mem and its time to try
something else. It seems you don't actually have one of those cases
here, though?
I think we should fix problem A. Then handle problem C by accounting
for BufFiles, and figure out a way to switch to our alternative
strategy (currently: ignore work_mem), when we think that creating
more BufFiles will be futile (not sure exactly what the rule there
should be). And then work on fixing B properly with a good strategy.
Here's a straw-man idea: we could adopt BNL, and then entirely remove
our repartitioning code. If the planner's number of partitions turns
out to be not enough, we'll just handle it using BNL loops.
Switching to some other algorithm during execution moves the goal posts
to the next galaxy, I'm afraid.
The main problem I'm aware of with sort-merge join is: not all that is
hashable is sortable. So BNL is actually the only solution I'm aware
of for problem B that doesn't involve changing a fundamental thing
about PostgreSQL's data type requirements.
--
Thomas Munro
https://enterprisedb.com
On Mon, May 06, 2019 at 11:18:28PM -0400, Tom Lane wrote:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
Do we actually check how many duplicates are there during planning?
Certainly that's part of the planner's cost estimates ... but it's
only as good as the planner's statistical knowledge.
I'm looking at the code, and the only place where I see code dealing with
MCVs (probably the best place for info about duplicate values) is
estimate_hash_bucketsize in final_cost_hashjoin. That's not quite what I
had in mind - I was thinking more about something along the lines "See the
larget group of duplicate values, disable hash join if it can't fit into
work_mem at all."
Of course, if the input estimates are off, that may not work too well. It
would certainly not help the query failing with OOM, because that was a
case of severe underestimate.
Or did you mean some other piece of code that I have missed.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, May 07, 2019 at 04:28:36PM +1200, Thomas Munro wrote:
On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
Seems expensive for large numbers of slices -- you need to join the
outer batch against each inner slice.Nope, that's not how it works. It's the array of batches that gets
sliced, not the batches themselves.Sorry, I read only the description and not the code, and got confused
about that. So, I see three separate but related problems:A. Broken escape valve: sometimes we generate a huge number of
batches while trying to split up many duplicates, because of the
presence of other more uniformly distributed keys. We could fix that
with (say) a 95% rule.
B. Lack of good alternative execution strategy when the escape valve
is triggered. A batch cannot be split effectively, but cannot fit in
work_mem, so for now we decide to ignore work_mem.
C. Unmetered explosion of batches and thus BufFiles, probably usually
caused by problem A, but theoretically also due to a real need for
partitions.
Right. I don't think a single solution addressing all those issues exists.
It's more likely we need multiple improvements.
But I wonder how we'd deal with outer joins, as Tom Lane asked in
another thread:That seems unrelated - we slice the array of batches, to keep memory
needed for BufFile under control. The hash table remains intact, so
there's no issue with outer joins.Right, sorry, my confusion. I thought you were describing
https://en.wikipedia.org/wiki/Block_nested_loop. (I actually think we
can make that work for left outer joins without too much fuss by
writing out a stream of match bits to a new temporary file. Googling,
I see that MySQL originally didn't support BNL for outer joins and
then added some match flag propagation thing recently.)
Possibly, I'm not against implementing that, although I don't have very
good idea what the benefits of BNL joins are (performance-wise). In any
case, I think entirely unrelated to hash joins.
I agree we should relax the 0%/100% split condition, and disable the
growth sooner. But I think we should also re-evaluate that decision
after a while - the data set may be correlated in some way, in which
case we may disable the growth prematurely. It may not reduce memory
usage now, but it may help in the future.It's already an issue, but it would be even more likely if we disabled
growth e.g. with just 5%/95% splits.FWIW I believe this is mostly orthogonal issue to what's discussed in
this thread.But isn't problem A the root cause of problem C, in most cases? There
must also be "genuine" cases of problem C that would occur even if we
fix that, of course: someone has small work_mem, and data that can be
effectively partitioned to fit it, but it just takes a huge number of
partitions to do it. So that we don't behave badly in those cases, I
agree with you 100%: we should fix the memory accounting to count
BufFile overheads as you are proposing, and then I guess ideally
switch to our alternative strategy (BNL or sort-merge or ...) when we
see that BufFiles are wasting to much work_mem and its time to try
something else. It seems you don't actually have one of those cases
here, though?
Maybe. Or maybe not. I don't have enough data to make such judgements
about the causes in general. We have one query from pgsql-performance.
There might be more, but IMO that's probably biased data set.
But even that reported query actually is not the case that A causes C.
The outer side of the hash join was significantly underestimated (34619
vs. 113478127) due to highly-correlated conditions.
And in that case it's trivial to cause nbatch explosion even with perfect
data sets with no duplicates (so no escape valve failure).
I think we should fix problem A. Then handle problem C by accounting
for BufFiles, and figure out a way to switch to our alternative
strategy (currently: ignore work_mem), when we think that creating
more BufFiles will be futile (not sure exactly what the rule there
should be). And then work on fixing B properly with a good strategy.
Here's a straw-man idea: we could adopt BNL, and then entirely remove
our repartitioning code. If the planner's number of partitions turns
out to be not enough, we'll just handle it using BNL loops.
Yeah, something like that.
I think we can fix A by relaxing the escape valve condition, and then
rechecking it once in a while. So we fill work_mem, realize it didn't
actually reduce the batch size significantly and disable nbatch growth.
But at the same time we increase the threshold to 2x work_mem, and after
reaching it we "consider" a nbatch increase. That is, we walk the batch
and see how many tuples would move if we increased nbatch (that should be
fairly cheap) - if it helps, great, enable growth and split the batch. If
not, double the threshold again. Rinse and repeat.
For C, I think we can use either of the two approaches I proposed. I like
the second option better, as it actually enforces work_mem. The first
option kinda helped with A too, although in different way, ana I think the
solution I outlined in the previous paragraph will work better.
No opinion regarding the switch to BNL, at the moment.
Switching to some other algorithm during execution moves the goal posts
to the next galaxy, I'm afraid.The main problem I'm aware of with sort-merge join is: not all that is
hashable is sortable. So BNL is actually the only solution I'm aware
of for problem B that doesn't involve changing a fundamental thing
about PostgreSQL's data type requirements.
Sure, each of those algorithms has limitations. But I think that's mostly
irrelevant to the main issue - switching between algorithms mid-execution.
At that point some of the tuples might have been already sent sent to the
other nodes, and I have no idea how to "resume" the tuple stream short of
buffering everything locally until the join completes. And that would be
rather terrible, I guess.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
On Mon, May 06, 2019 at 11:18:28PM -0400, Tom Lane wrote:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
Do we actually check how many duplicates are there during planning?
Certainly that's part of the planner's cost estimates ... but it's
only as good as the planner's statistical knowledge.
I'm looking at the code, and the only place where I see code dealing with
MCVs (probably the best place for info about duplicate values) is
estimate_hash_bucketsize in final_cost_hashjoin.
What I'm thinking of is this bit in final_cost_hashjoin:
/*
* If the bucket holding the inner MCV would exceed work_mem, we don't
* want to hash unless there is really no other alternative, so apply
* disable_cost. (The executor normally copes with excessive memory usage
* by splitting batches, but obviously it cannot separate equal values
* that way, so it will be unable to drive the batch size below work_mem
* when this is true.)
*/
if (relation_byte_size(clamp_row_est(inner_path_rows * innermcvfreq),
inner_path->pathtarget->width) >
(work_mem * 1024L))
startup_cost += disable_cost;
It's certainly likely that that logic needs improvement in view of this
discussion --- I was just pushing back on the claim that we weren't
considering the issue at all.
regards, tom lane
On Tue, May 07, 2019 at 10:42:36AM -0400, Tom Lane wrote:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
On Mon, May 06, 2019 at 11:18:28PM -0400, Tom Lane wrote:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
Do we actually check how many duplicates are there during planning?
Certainly that's part of the planner's cost estimates ... but it's
only as good as the planner's statistical knowledge.I'm looking at the code, and the only place where I see code dealing with
MCVs (probably the best place for info about duplicate values) is
estimate_hash_bucketsize in final_cost_hashjoin.What I'm thinking of is this bit in final_cost_hashjoin:
/*
* If the bucket holding the inner MCV would exceed work_mem, we don't
* want to hash unless there is really no other alternative, so apply
* disable_cost. (The executor normally copes with excessive memory usage
* by splitting batches, but obviously it cannot separate equal values
* that way, so it will be unable to drive the batch size below work_mem
* when this is true.)
*/
if (relation_byte_size(clamp_row_est(inner_path_rows * innermcvfreq),
inner_path->pathtarget->width) >
(work_mem * 1024L))
startup_cost += disable_cost;It's certainly likely that that logic needs improvement in view of this
discussion --- I was just pushing back on the claim that we weren't
considering the issue at all.
Ah, this code is new in 11, and I was looking at code from 10 for some
reason. I don't think we can do much better than this, except perhaps
falling back to (1/ndistinct) when there's no MCV available.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, May 6, 2019 at 8:15 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
Nope, that's not how it works. It's the array of batches that gets
sliced, not the batches themselves.It does slightly increase the amount of data we need to shuffle between
the temp files, because we can't write the data directly to batches in
"future" slices. But that amplification is capped to ~2.2x (compared to
the ~1.4x in master) - I've shared some measurements in [1].[1]
/messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development
Cool, I misunderstood. I looked at the code again today, and, at the email
thread where you measured "amplification".
In terms of how many times you write each tuple, is it accurate to say that
a
tuple can now be spilled three times (in the worst case) whereas, before, it
could be spilled only twice?
1 - when building the inner side hashtable, tuple is spilled to a "slice"
file
2 - (assuming the number of batches was increased) during execution, when a
tuple belonging to a later slice's spill file is found, it is re-spilled to
that
slice's spill file
3 - during execution, when reading from its slice file, it is re-spilled
(again)
to its batch's spill file
Is it correct that the max number of BufFile structs you will have is equal
to
the number of slices + number of batches in a slice
because that is the max number of open BufFiles you would have at a time?
By the way, applying v4 patch on master, in an assert build, I am tripping
some
asserts -- starting with
Assert(!file->readOnly);
in BufFileWrite
One thing I was a little confused by was the nbatch_inmemory member of the
hashtable. The comment in ExecChooseHashTableSize says that it is
determining
the number of batches we can fit in memory. I thought that the problem was
the
amount of space taken up by the BufFile data structure itself--which is
related
to the number of open BufFiles you need at a time. This comment in
ExecChooseHashTableSize makes it sound like you are talking about fitting
more
than one batch of tuples into memory at a time. I was under the impression
that
you could only fit one batch of tuples in memory at a time.
So, I was stepping through the code with work_mem set to the lower bound,
and in
ExecHashIncreaseNumBatches, I got confused.
hashtable->nbatch_inmemory was 2 for me, thus, nbatch_tmp was 2
so, I didn't meet this condition
if (nbatch_tmp > hashtable->nbatch_inmemory)
since I just set nbatch_tmp using hashtable->nbatch_inmemory
So, I didn't increase the number of slices, which is what I was expecting.
What happens when hashtable->nbatch_inmemory is equal to nbatch_tmp?
--
Melanie Plageman
On Tue, May 7, 2019 at 6:59 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
On Tue, May 07, 2019 at 04:28:36PM +1200, Thomas Munro wrote:
On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
Switching to some other algorithm during execution moves the goal posts
to the next galaxy, I'm afraid.The main problem I'm aware of with sort-merge join is: not all that is
hashable is sortable. So BNL is actually the only solution I'm aware
of for problem B that doesn't involve changing a fundamental thing
about PostgreSQL's data type requirements.Sure, each of those algorithms has limitations. But I think that's mostly
irrelevant to the main issue - switching between algorithms mid-execution.
At that point some of the tuples might have been already sent sent to the
other nodes, and I have no idea how to "resume" the tuple stream short of
buffering everything locally until the join completes. And that would be
rather terrible, I guess.
What if you switched to NLJ on a batch-by-batch basis and did it before
starting
execution of the join but after building the inner side of the hash table.
That
way, no tuples will have been sent to other nodes yet.
--
Melanie Plageman
On Tue, May 07, 2019 at 05:43:56PM -0700, Melanie Plageman wrote:
On Tue, May 7, 2019 at 6:59 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:On Tue, May 07, 2019 at 04:28:36PM +1200, Thomas Munro wrote:
On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
Switching to some other algorithm during execution moves the goalposts
to the next galaxy, I'm afraid.
The main problem I'm aware of with sort-merge join is: not all that is
hashable is sortable. So BNL is actually the only solution I'm aware
of for problem B that doesn't involve changing a fundamental thing
about PostgreSQL's data type requirements.Sure, each of those algorithms has limitations. But I think that's
mostly
irrelevant to the main issue - switching between algorithms
mid-execution.
At that point some of the tuples might have been already sent sent to
the
other nodes, and I have no idea how to "resume" the tuple stream short
of
buffering everything locally until the join completes. And that would be
rather terrible, I guess.What if you switched to NLJ on a batch-by-batch basis and did it before
starting
execution of the join but after building the inner side of the hash
table. That
way, no tuples will have been sent to other nodes yet.
Interesting idea! I think you're right doing it on a per-batch basis
would solve that problem. Essentially, if all (or >95%) of the tuples
has the same hash value, we could switch to a special "degraded" mode
doing something like a NL. At that point the hash table benefits are
lost anyway, because all the tuples are in a single chain, so it's not
going to be much slower.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, May 07, 2019 at 05:30:27PM -0700, Melanie Plageman wrote:
On Mon, May 6, 2019 at 8:15 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:Nope, that's not how it works. It's the array of batches that gets
sliced, not the batches themselves.It does slightly increase the amount of data we need to shuffle between
the temp files, because we can't write the data directly to batches in
"future" slices. But that amplification is capped to ~2.2x (compared to
the ~1.4x in master) - I've shared some measurements in [1].[1]
/messages/by-id/20190428141901.5dsbge2ka3rxmpk6@developmentCool, I misunderstood. I looked at the code again today, and, at the email
thread where you measured "amplification".
Oh! I hope you're not too disgusted by the code in that PoC patch ;-)
In terms of how many times you write each tuple, is it accurate to
say that a tuple can now be spilled three times (in the worst case)
whereas, before, it could be spilled only twice?1 - when building the inner side hashtable, tuple is spilled to a "slice"
file
2 - (assuming the number of batches was increased) during execution, when
a tuple belonging to a later slice's spill file is found, it is re-spilled
to that slice's spill file
3 - during execution, when reading from its slice file, it is re-spilled
(again) to its batch's spill file
Yes, that's mostly accurate understanding. Essentially this might add
one extra step of "reshuffling" from the per-slice to per-batch files.
Is it correct that the max number of BufFile structs you will have
is equal to the number of slices + number of batches in a slice
because that is the max number of open BufFiles you would have at a
time?
Yes. With the caveat that we need twice that number of BufFile structs,
because we need them on both sides of the join.
By the way, applying v4 patch on master, in an assert build, I am tripping
some
asserts -- starting with
Assert(!file->readOnly);
in BufFileWrite
Whoooops :-/
One thing I was a little confused by was the nbatch_inmemory member
of the hashtable. The comment in ExecChooseHashTableSize says that
it is determining the number of batches we can fit in memory. I
thought that the problem was the amount of space taken up by the
BufFile data structure itself--which is related to the number of
open BufFiles you need at a time. This comment in
ExecChooseHashTableSize makes it sound like you are talking about
fitting more than one batch of tuples into memory at a time. I was
under the impression that you could only fit one batch of tuples in
memory at a time.
I suppose you mean this chunk:
/*
* See how many batches we can fit into memory (driven mostly by size
* of BufFile, with PGAlignedBlock being the largest part of that).
* We need one BufFile for inner and outer side, so we count it twice
* for each batch, and we stop once we exceed (work_mem/2).
*/
while ((nbatch_inmemory * 2) * sizeof(PGAlignedBlock) * 2
<= (work_mem * 1024L / 2))
nbatch_inmemory *= 2;
Yeah, that comment is a bit confusing. What the code actually does is
computing the largest "slice" of batches for which we can keep the
BufFile structs in memory, without exceeding work_mem/2.
Maybe the nbatch_inmemory should be renamed to nbatch_slice, not sure.
So, I was stepping through the code with work_mem set to the lower
bound, and in ExecHashIncreaseNumBatches, I got confused.
hashtable->nbatch_inmemory was 2 for me, thus, nbatch_tmp was 2 so,
I didn't meet this condition if (nbatch_tmp >
hashtable->nbatch_inmemory) since I just set nbatch_tmp using
hashtable->nbatch_inmemory So, I didn't increase the number of
slices, which is what I was expecting. What happens when
hashtable->nbatch_inmemory is equal to nbatch_tmp?
Ah, good catch. The condition you're refering to
if (nbatch_tmp > hashtable->nbatch_inmemory)
should actually be
if (nbatch > hashtable->nbatch_inmemory)
because the point is to initialize BufFile structs for the overflow
files, and we need to do that once we cross nbatch_inmemory.
And it turns out this actually causes the assert failures in regression
tests, you reported earlier. It failed to initialize the overflow files
in some cases, so the readOnly flag seemed to be set.
Attached is an updated patch, fixing this. I tried to clarify some of
the comments too, and I fixed another bug I found while running the
regression tests. It's still very much a crappy PoC code, though.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
On Wed, May 8, 2019 at 8:08 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
On Tue, May 07, 2019 at 05:30:27PM -0700, Melanie Plageman wrote:
One thing I was a little confused by was the nbatch_inmemory member
of the hashtable. The comment in ExecChooseHashTableSize says that
it is determining the number of batches we can fit in memory. I
thought that the problem was the amount of space taken up by the
BufFile data structure itself--which is related to the number of
open BufFiles you need at a time. This comment in
ExecChooseHashTableSize makes it sound like you are talking about
fitting more than one batch of tuples into memory at a time. I was
under the impression that you could only fit one batch of tuples in
memory at a time.I suppose you mean this chunk:
/*
* See how many batches we can fit into memory (driven mostly by size
* of BufFile, with PGAlignedBlock being the largest part of that).
* We need one BufFile for inner and outer side, so we count it twice
* for each batch, and we stop once we exceed (work_mem/2).
*/
while ((nbatch_inmemory * 2) * sizeof(PGAlignedBlock) * 2
<= (work_mem * 1024L / 2))
nbatch_inmemory *= 2;Yeah, that comment is a bit confusing. What the code actually does is
computing the largest "slice" of batches for which we can keep the
BufFile structs in memory, without exceeding work_mem/2.Maybe the nbatch_inmemory should be renamed to nbatch_slice, not sure.
I definitely would prefer to see hashtable->nbatch_inmemory renamed to
hashtable->nbatch_slice--or maybe hashtable->nbuff_inmemory?
I've been poking around the code for awhile today, and, even though I
know that the nbatch_inmemory is referring to the buffiles that can
fit in memory, I keep forgetting and thinking it is referring to the
tuple data that can fit in memory.
It might be worth explicitly calling out somewhere in the comments
that overflow slices will only be created either when the number of
batches was underestimated as part of ExecHashIncreaseNumBatches and
the new number of batches exceeds the value for
hashtable->nbatch_inmemory or when creating the hashtable initially
and the number of batches exceeds the value for
hashtable->nbatch_inmemory (the name confuses this for me at hashtable
creation time especially) -- the number of actual buffiles that can be
managed in memory.
Attached is an updated patch, fixing this. I tried to clarify some of
the comments too, and I fixed another bug I found while running the
regression tests. It's still very much a crappy PoC code, though.
So, I ran the following example on master and with your patch.
drop table foo;
drop table bar;
create table foo(a int, b int);
create table bar(c int, d int);
insert into foo select i, i from generate_series(1,10000)i;
insert into bar select 1, 1 from generate_series(1,1000)i;
insert into bar select i%3, i%3 from generate_series(1000,10000)i;
insert into foo select 1,1 from generate_series(1,1000)i;
analyze foo; analyze bar;
set work_mem=64;
On master, explain analyze looked like this
postgres=# explain analyze verbose select * from foo, bar where a = c;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=339.50..53256.27 rows=4011001 width=16) (actual
time=28.962..1048.442 rows=4008001 loops=1)
Output: foo.a, foo.b, bar.c, bar.d
Hash Cond: (bar.c = foo.a)
-> Seq Scan on public.bar (cost=0.00..145.01 rows=10001 width=8)
(actual time=0.030..1.777 rows=10001 loops=1)
Output: bar.c, bar.d
-> Hash (cost=159.00..159.00 rows=11000 width=8) (actual
time=12.285..12.285 rows=11000 loops=1)
Output: foo.a, foo.b
Buckets: 2048 (originally 2048) Batches: 64 (originally 16)
Memory Usage: 49kB
-> Seq Scan on public.foo (cost=0.00..159.00 rows=11000 width=8)
(actual time=0.023..3.786 rows=11000 loops=1)
Output: foo.a, foo.b
Planning Time: 0.435 ms
Execution Time: 1206.904 ms
(12 rows)
and with your patch, it looked like this.
postgres=# explain analyze verbose select * from foo, bar where a = c;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=339.50..53256.27 rows=4011001 width=16) (actual
time=28.256..1102.026 rows=4008001 loops=1)
Output: foo.a, foo.b, bar.c, bar.d
Hash Cond: (bar.c = foo.a)
-> Seq Scan on public.bar (cost=0.00..145.01 rows=10001 width=8)
(actual time=0.040..1.717 rows=10001 loops=1)
Output: bar.c, bar.d
-> Hash (cost=159.00..159.00 rows=11000 width=8) (actual
time=12.327..12.327 rows=11000 loops=1)
Output: foo.a, foo.b
Buckets: 2048 (originally 2048) Batches: 16384 (originally 16,
in-memory 2) Memory Usage: 131160kB
-> Seq Scan on public.foo (cost=0.00..159.00 rows=11000 width=8)
(actual time=0.029..3.569 rows=11000 loops=1)
Output: foo.a, foo.b
Planning Time: 0.260 ms
Execution Time: 1264.995 ms
(12 rows)
I noticed that the number of batches is much higher with the patch,
and, I was checking $PGDATA/base/pgsql_tmp and saw that the number of
temp files which are the overflow files any given time was quite high.
I would imagine that the desired behaviour is to keep memory usage
within work_mem.
In this example, the number of slices is about 8000, each of which
would have an overflow file. Is this the case you mention in the
comment in ExecChooseHashTableSize ?
* We ignore (per-slice)
* overflow files, because those serve as "damage control" for cases
* when per-batch BufFiles would exceed work_mem. Given enough batches
* it's impossible to enforce work_mem strictly, because the overflow
* files alone will consume more memory.
--
Melanie Plageman
On Tue, May 21, 2019 at 05:38:50PM -0700, Melanie Plageman wrote:
On Wed, May 8, 2019 at 8:08 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:On Tue, May 07, 2019 at 05:30:27PM -0700, Melanie Plageman wrote:
One thing I was a little confused by was the nbatch_inmemory member
of the hashtable. The comment in ExecChooseHashTableSize says that
it is determining the number of batches we can fit in memory. I
thought that the problem was the amount of space taken up by the
BufFile data structure itself--which is related to the number of
open BufFiles you need at a time. This comment in
ExecChooseHashTableSize makes it sound like you are talking about
fitting more than one batch of tuples into memory at a time. I was
under the impression that you could only fit one batch of tuples in
memory at a time.I suppose you mean this chunk:
/*
* See how many batches we can fit into memory (driven mostly by size
* of BufFile, with PGAlignedBlock being the largest part of that).
* We need one BufFile for inner and outer side, so we count it twice
* for each batch, and we stop once we exceed (work_mem/2).
*/
while ((nbatch_inmemory * 2) * sizeof(PGAlignedBlock) * 2
<= (work_mem * 1024L / 2))
nbatch_inmemory *= 2;Yeah, that comment is a bit confusing. What the code actually does is
computing the largest "slice" of batches for which we can keep the
BufFile structs in memory, without exceeding work_mem/2.Maybe the nbatch_inmemory should be renamed to nbatch_slice, not sure.
I definitely would prefer to see hashtable->nbatch_inmemory renamed to
hashtable->nbatch_slice--or maybe hashtable->nbuff_inmemory?I've been poking around the code for awhile today, and, even though I
know that the nbatch_inmemory is referring to the buffiles that can
fit in memory, I keep forgetting and thinking it is referring to the
tuple data that can fit in memory.
That's a fair point. I think nbatch_slice is a good name.
It might be worth explicitly calling out somewhere in the comments
that overflow slices will only be created either when the number of
batches was underestimated as part of ExecHashIncreaseNumBatches and
the new number of batches exceeds the value for
hashtable->nbatch_inmemory or when creating the hashtable initially
and the number of batches exceeds the value for
hashtable->nbatch_inmemory (the name confuses this for me at hashtable
creation time especially) -- the number of actual buffiles that can be
managed in memory.
Yes, this definitely needs to be explained somewhere - possibly in a
comment at the beginning of nodeHash.c or something like that.
FWIW I wonder if this "slicing" would be useful even with correct
estimates. E.g. let's say we can fit 128 batches into work_mem, but we
expect to need 256 (and it's accurate). At that point it's probably too
aggressive to disable hash joins - a merge join is likely more expensive
than just using the slicing. But that should be a cost-based decision.
Attached is an updated patch, fixing this. I tried to clarify some of
the comments too, and I fixed another bug I found while running the
regression tests. It's still very much a crappy PoC code, though.So, I ran the following example on master and with your patch.
drop table foo;
drop table bar;
create table foo(a int, b int);
create table bar(c int, d int);
insert into foo select i, i from generate_series(1,10000)i;
insert into bar select 1, 1 from generate_series(1,1000)i;
insert into bar select i%3, i%3 from generate_series(1000,10000)i;
insert into foo select 1,1 from generate_series(1,1000)i;
analyze foo; analyze bar;
set work_mem=64;On master, explain analyze looked like this
postgres=# explain analyze verbose select * from foo, bar where a = c;
QUERY PLAN--------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=339.50..53256.27 rows=4011001 width=16) (actual
time=28.962..1048.442 rows=4008001 loops=1)
Output: foo.a, foo.b, bar.c, bar.d
Hash Cond: (bar.c = foo.a)
-> Seq Scan on public.bar (cost=0.00..145.01 rows=10001 width=8)
(actual time=0.030..1.777 rows=10001 loops=1)
Output: bar.c, bar.d
-> Hash (cost=159.00..159.00 rows=11000 width=8) (actual
time=12.285..12.285 rows=11000 loops=1)
Output: foo.a, foo.b
Buckets: 2048 (originally 2048) Batches: 64 (originally 16)
Memory Usage: 49kB
-> Seq Scan on public.foo (cost=0.00..159.00 rows=11000 width=8)
(actual time=0.023..3.786 rows=11000 loops=1)
Output: foo.a, foo.b
Planning Time: 0.435 ms
Execution Time: 1206.904 ms
(12 rows)and with your patch, it looked like this.
postgres=# explain analyze verbose select * from foo, bar where a = c;
QUERY PLAN--------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=339.50..53256.27 rows=4011001 width=16) (actual
time=28.256..1102.026 rows=4008001 loops=1)
Output: foo.a, foo.b, bar.c, bar.d
Hash Cond: (bar.c = foo.a)
-> Seq Scan on public.bar (cost=0.00..145.01 rows=10001 width=8)
(actual time=0.040..1.717 rows=10001 loops=1)
Output: bar.c, bar.d
-> Hash (cost=159.00..159.00 rows=11000 width=8) (actual
time=12.327..12.327 rows=11000 loops=1)
Output: foo.a, foo.b
Buckets: 2048 (originally 2048) Batches: 16384 (originally 16,
in-memory 2) Memory Usage: 131160kB
-> Seq Scan on public.foo (cost=0.00..159.00 rows=11000 width=8)
(actual time=0.029..3.569 rows=11000 loops=1)
Output: foo.a, foo.b
Planning Time: 0.260 ms
Execution Time: 1264.995 ms
(12 rows)I noticed that the number of batches is much higher with the patch,
and, I was checking $PGDATA/base/pgsql_tmp and saw that the number of
temp files which are the overflow files any given time was quite high.I would imagine that the desired behaviour is to keep memory usage
within work_mem.
There's definitely something fishy going on. I suspect it's either because
of the duplicate values (which might fit into 64kB on master, but not when
accounting for BufFile). Or maybe it's because the initial 16 batches
can't possibly fit into work_mem.
If you try with a larger work_mem, say 256kB, does that behave OK?
In this example, the number of slices is about 8000, each of which
would have an overflow file. Is this the case you mention in the
comment in ExecChooseHashTableSize ?* We ignore (per-slice)
* overflow files, because those serve as "damage control" for cases
* when per-batch BufFiles would exceed work_mem. Given enough batches
* it's impossible to enforce work_mem strictly, because the overflow
* files alone will consume more memory.
Yes. 8000 slices is ~64MB, so considering we need them on both sides of
the join that'd be ~128MB. Which is pretty much exactly 131160kB.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, May 4, 2019 at 8:34 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
The root cause is that hash join treats batches as pretty much free, but
that's not really true - we do allocate two BufFile structs per batch,
and each BufFile is ~8kB as it includes PGAlignedBuffer.The OOM is not very surprising, because with 524288 batches it'd need
about 8GB of memory, and the system only has 8GB RAM installed.The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).
Hi Tomas
I read your second patch which uses overflow buf files to reduce the total
number of batches.
It would solve the hash join OOM problem what you discussed above: 8K per
batch leads to batch bloating problem.
I mentioned in another thread:
/messages/by-id/CAB0yrekv=6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA@mail.gmail.com
There is another hashjoin OOM problem which disables splitting batches too
early. PG uses a flag hashtable->growEnable to determine whether to split
batches. Once one splitting failed(all the tuples are assigned to only one
batch of two split ones) The growEnable flag would be turned off forever.
The is an opposite side of batch bloating problem. It only contains too few
batches and makes the in-memory hash table too large to fit into memory.
Here is the tradeoff: one batch takes more than 8KB(8KB makes sense, due to
performance), in-memory hash table takes memory as well and splitting
batched may(not must) reduce the in-memory hash table size but introduce
more batches(and thus more memory usage 8KB*#batch).
Can we conclude that it would be worth to splitting if satisfy:
(The reduced memory of in-memory hash table) - (8KB * number of new
batches) > 0
So I'm considering to combine our patch with your patch to fix join OOM
problem. No matter the OOM is introduced by (the memory usage of in-memory
hash table) or (8KB * number of batches).
nbatch_inmemory in your patch could also use the upper rule to redefine.
What's your opinion?
Thanks
Hubert Zhang
Hi Tomas,
Here is the patch, it's could be compatible with your patch and it focus on
when to regrow the batch.
On Tue, May 28, 2019 at 3:40 PM Hubert Zhang <hzhang@pivotal.io> wrote:
On Sat, May 4, 2019 at 8:34 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:The root cause is that hash join treats batches as pretty much free, but
that's not really true - we do allocate two BufFile structs per batch,
and each BufFile is ~8kB as it includes PGAlignedBuffer.The OOM is not very surprising, because with 524288 batches it'd need
about 8GB of memory, and the system only has 8GB RAM installed.The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).Hi Tomas
I read your second patch which uses overflow buf files to reduce the total
number of batches.
It would solve the hash join OOM problem what you discussed above: 8K per
batch leads to batch bloating problem.I mentioned in another thread:
/messages/by-id/CAB0yrekv=6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA@mail.gmail.com
There is another hashjoin OOM problem which disables splitting batches too
early. PG uses a flag hashtable->growEnable to determine whether to split
batches. Once one splitting failed(all the tuples are assigned to only one
batch of two split ones) The growEnable flag would be turned off forever.The is an opposite side of batch bloating problem. It only contains too
few batches and makes the in-memory hash table too large to fit into memory.Here is the tradeoff: one batch takes more than 8KB(8KB makes sense, due
to performance), in-memory hash table takes memory as well and splitting
batched may(not must) reduce the in-memory hash table size but introduce
more batches(and thus more memory usage 8KB*#batch).
Can we conclude that it would be worth to splitting if satisfy:
(The reduced memory of in-memory hash table) - (8KB * number of new
batches) > 0So I'm considering to combine our patch with your patch to fix join OOM
problem. No matter the OOM is introduced by (the memory usage of
in-memory hash table) or (8KB * number of batches).nbatch_inmemory in your patch could also use the upper rule to redefine.
What's your opinion?
Thanks
Hubert Zhang
--
Thanks
Hubert Zhang
Attachments:
On Tue, May 28, 2019 at 03:40:01PM +0800, Hubert Zhang wrote:
On Sat, May 4, 2019 at 8:34 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:Hi Tomas
I read your second patch which uses overflow buf files to reduce the total
number of batches.
It would solve the hash join OOM problem what you discussed above: 8K per
batch leads to batch bloating problem.I mentioned in another thread:
/messages/by-id/CAB0yrekv=6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA@mail.gmail.com
There is another hashjoin OOM problem which disables splitting batches too
early. PG uses a flag hashtable->growEnable to determine whether to split
batches. Once one splitting failed(all the tuples are assigned to only one
batch of two split ones) The growEnable flag would be turned off forever.The is an opposite side of batch bloating problem. It only contains too few
batches and makes the in-memory hash table too large to fit into memory.
Yes. There are deffinitely multiple separate issues in the hashjoin code,
and the various improvements discussed in this (and other) thread usually
address just a subset of them. We need to figure out how to combine them
or maybe devise some more generic solution.
So I think we need to take a step back, and figure out how to combine
these improvements - otherwise we might commit a fix for one issue, making
it much harder/impossible to improve the other issues.
The other important question is whether we see these cases as outliers
(and the solutions as last-resort-attempt-to-survive kind of fix) or more
widely applicable optimizations. I've seen some interesting speedups with
the overflow-batches patch, but my feeling is we should really treat it as
a last-resort to survive.
I had a chat about this with Thomas Munro yesterday. Unfortunately, some
beer was involved but I do vaguely remember he more or less convinced me
the BNL (block nested loop join) might be the right approach here. We
don't have any patch for that yet, though :-(
Here is the tradeoff: one batch takes more than 8KB(8KB makes sense, due to
performance), in-memory hash table takes memory as well and splitting
batched may(not must) reduce the in-memory hash table size but introduce
more batches(and thus more memory usage 8KB*#batch).
Can we conclude that it would be worth to splitting if satisfy:
(The reduced memory of in-memory hash table) - (8KB * number of new
batches) > 0
Something like that, yes.
So I'm considering to combine our patch with your patch to fix join OOM
problem. No matter the OOM is introduced by (the memory usage of in-memory
hash table) or (8KB * number of batches).nbatch_inmemory in your patch could also use the upper rule to redefine.
What's your opinion?
One of the issues with my "overflow batches" patch, pointed out to me by
Thomas yesterday, is that it only works with non-parallel hash join. And
we don't know how to make it work in the parallel mode :-(
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Okay, so, while I do have specific, actual code review/commitfest-y
feedback for the patch in this thread registered for this commitfest,
I wanted to defer that for a later email and use this one to cover off
on a few higher level issues.
1) How this patch's approach fits into the wider set of problems with
hybrid hashjoin.
2) Parallel HashJoin implementation of this patch's approach
I think implementing support for parallel hashjoin or explicitly
disabling it would be the bare minimum for this patch, which is why I
made 2 its own item. I've marked it as returned to author for this
reason.
I do think that accounting for Buffile overhead when estimating the
size of the hashtable during ExecChooseHashTableSize() so it can be
used during planning is a worthwhile patch by itself (though I know it
is not even part of this patch).
I'll start with 2 since I have less to say there.
From comments upthread, I take it this would not work with parallel
hashjoin as expected. Is this because each worker operates on batches
independently and now batches are lumped into slices?
Thinking through a parallel-aware implementation, it seems like you
would use slice-based barriers for the build phase but batch-based
barriers for the probe phase to avoid getting out of sync (workers
with outer tuples from one batch should not try and join those with
tuples from another batch, even if in the same slice).
You would, of course, need to add code to make slices work with
SharedTuplestore--caveat here is I still haven't tried to understand
how parallel-aware hashjoin works/uses SharedTuplestore.
Now, addressing 1, how this patch fits into the wider set of problem's
with current hybrid hashjoin:
Thomas Munro nicely summarized roughly what I'm about to lay out like
this (upthread) -- he called them "three separate but related
problems":
A. Broken escape valve: sometimes we generate a huge number of
batches while trying to split up many duplicates, because of the
presence of other more uniformly distributed keys. We could fix that
with (say) a 95% rule.
B. Lack of good alternative execution strategy when the escape valve
is triggered. A batch cannot be split effectively, but cannot fit in
work_mem, so for now we decide to ignore work_mem.
C. Unmetered explosion of batches and thus BufFiles, probably usually
caused by problem A, but theoretically also due to a real need for
partitions.
However, I would like to lay out the problem space a little bit
differently. (using the end of the alphabet to differentiate).
The following scenarios are how you could end up running out of
memory:
Y. Plan-time underestimation of the number of required batches with
relatively uniform data distribution
In this case, the best join execution strategy is a plain hashjoin
with spilling as needed.
nbatches should be increased as needed, because the data is ~evenly
distributed.
slicing should be employed when buffile overhead exceeds some
threshhold for the ratio of work_mem to be used for buffile overhead
Z. Plan and or execution time underestimation of the number of
required batches with skewed data
If you knew this at planning time, you could have picked another
join-type, though, there might be cases where it would actually be
less costly to use plain hashjoin for all batches except the bad batch
and fall back to hash block nested loop join just for the duplicate
values.
If you could not have known this at planning time, the best join
execution strategy is a hybrid hashjoin/hash block nested loop join.
To do this, preview if increasing nbatches would move tuples, and, if
it would, do this (also, employing slicing if buffile overhead exceeds
the threshold)
If increasing nbatches wouldn't move tuples, process this batch with
hash block nested loop join.
Essentially, what we want is logical units of tuples which are
work_mem-sized. In some cases, each unit may contain multiple batches
(a slice in Tomas' patch) and in other cases, each unit may contain
only part of a batch (a chunk is the term I used in my hash block
nested loop join patch [1]/messages/by-id/CAAKRu_ZkkukQgXCK8ADe-PmvcmpZh6G1Uin8pqqovL4x7P30mQ@mail.gmail.com).
For slicing, each unit, a slice, has multiple batches but one spill
file.
For hbnlj, each unit, a chunk, is one of multiple chunks in a single
batch, all of which are in the same spill file (1 batch = 1 spill
file).
Thinking through it, it seems to make the most sense to split the work
into ~ 3 independent pieces:
patch1 - "preview" a batch increase (not yet written [I think])
patch2 - slicing (Tomas' patch [2]/messages/by-id/20190508150844.rij36rtuk4lhvztw@development but add in threshhold for portion of
work_mem buffile overhead is using)
patch3 - hash block nested loop join (my patch [1]/messages/by-id/CAAKRu_ZkkukQgXCK8ADe-PmvcmpZh6G1Uin8pqqovL4x7P30mQ@mail.gmail.com)
patch1 allows us to re-enable growth and was mentioned upthread, but I
will quote it here for simplicity:
I think we can fix A by relaxing the escape valve condition, and then
rechecking it once in a while. So we fill work_mem, realize it didn't
actually reduce the batch size significantly and disable nbatch growth.
But at the same time we increase the threshold to 2x work_mem, and after
reaching it we "consider" a nbatch increase. That is, we walk the batch
and see how many tuples would move if we increased nbatch (that should be
fairly cheap) - if it helps, great, enable growth and split the batch. If
not, double the threshold again. Rinse and repeat.
We don't want to fill up work_mem with buffile overhead after
increasing nbatches many times just to move a few tuples for one batch
and end up disabling growth thus making it so that later we can't
increase nbatches and repartition for a batch that would nicely
partition (like Hubert's case, I believe [3]/messages/by-id/CAB0yre=e8ysPyoUvZqjKYAxc6-VB=JKHL-7XKZSxy0FT5vY7BQ@mail.gmail.com).
We want to identify when re-partitioning would help and only do it
then and, for times when it wouldn't help, use a fallback strategy
that still allows progress on the hashjoin, and, for some spiky data,
where we have re-partitioned for the right reasons, but there are
still a lot of batches that are small enough that they could all fit
in memory at once, we want to track them with as little overhead as
possible -- lump them into slices.
We should probably consider deciding to use slices based on some
threshold for the portion of work_mem which is allowed to be occupied
by buffile overhead instead of waiting until the buffile overhead is
literally taking up most of work_mem.
The above summary is to address the concern in this thread about a
holistic solution.
I think the slicing patch is independent of both the hash block nested
loop join patch and the "preview" mode for batch increasing.
If slicing is made to work for parallel-aware hashjoin and the code is
in a committable state (and probably has the threshold I mentioned
above), then I think that this patch should go in.
[1]: /messages/by-id/CAAKRu_ZkkukQgXCK8ADe-PmvcmpZh6G1Uin8pqqovL4x7P30mQ@mail.gmail.com
/messages/by-id/CAAKRu_ZkkukQgXCK8ADe-PmvcmpZh6G1Uin8pqqovL4x7P30mQ@mail.gmail.com
[2]: /messages/by-id/20190508150844.rij36rtuk4lhvztw@development
/messages/by-id/20190508150844.rij36rtuk4lhvztw@development
[3]: /messages/by-id/CAB0yre=e8ysPyoUvZqjKYAxc6-VB=JKHL-7XKZSxy0FT5vY7BQ@mail.gmail.com
/messages/by-id/CAB0yre=e8ysPyoUvZqjKYAxc6-VB=JKHL-7XKZSxy0FT5vY7BQ@mail.gmail.com
--
Melanie Plageman
On Wed, Jul 10, 2019 at 04:51:02PM -0700, Melanie Plageman wrote:
Okay, so, while I do have specific, actual code review/commitfest-y
feedback for the patch in this thread registered for this commitfest,
I wanted to defer that for a later email and use this one to cover off
on a few higher level issues.1) How this patch's approach fits into the wider set of problems with
hybrid hashjoin.2) Parallel HashJoin implementation of this patch's approach
I think implementing support for parallel hashjoin or explicitly
disabling it would be the bare minimum for this patch, which is why I
made 2 its own item. I've marked it as returned to author for this
reason.
OK. I'm a bit confused / unsure what exactly our solution to the various
hashjoin issues is. I have not been paying attention to all the various
threads, but I thought we kinda pivoted to the BNL approach, no? I'm not
against pushing this patch (the slicing one) forward and then maybe add
BNL on top.
I do think that accounting for Buffile overhead when estimating the
size of the hashtable during ExecChooseHashTableSize() so it can be
used during planning is a worthwhile patch by itself (though I know it
is not even part of this patch).
+1 to that
I'll start with 2 since I have less to say there.
From comments upthread, I take it this would not work with parallel
hashjoin as expected. Is this because each worker operates on batches
independently and now batches are lumped into slices?Thinking through a parallel-aware implementation, it seems like you
would use slice-based barriers for the build phase but batch-based
barriers for the probe phase to avoid getting out of sync (workers
with outer tuples from one batch should not try and join those with
tuples from another batch, even if in the same slice).You would, of course, need to add code to make slices work with
SharedTuplestore--caveat here is I still haven't tried to understand
how parallel-aware hashjoin works/uses SharedTuplestore.
I don't know. I haven't thought about the parallel version very much. I
wonder if Thomas Munro has some thoughts about it ...
Now, addressing 1, how this patch fits into the wider set of problem's
with current hybrid hashjoin:Thomas Munro nicely summarized roughly what I'm about to lay out like
this (upthread) -- he called them "three separate but related
problems":A. Broken escape valve: sometimes we generate a huge number of
batches while trying to split up many duplicates, because of the
presence of other more uniformly distributed keys. We could fix that
with (say) a 95% rule.
B. Lack of good alternative execution strategy when the escape valve
is triggered. A batch cannot be split effectively, but cannot fit in
work_mem, so for now we decide to ignore work_mem.
C. Unmetered explosion of batches and thus BufFiles, probably usually
caused by problem A, but theoretically also due to a real need for
partitions.However, I would like to lay out the problem space a little bit
differently. (using the end of the alphabet to differentiate).The following scenarios are how you could end up running out of
memory:Y. Plan-time underestimation of the number of required batches with
relatively uniform data distributionIn this case, the best join execution strategy is a plain hashjoin
with spilling as needed.
nbatches should be increased as needed, because the data is ~evenly
distributed.
slicing should be employed when buffile overhead exceeds some
threshhold for the ratio of work_mem to be used for buffile overhead
OK, makes sense. But at some point we get so many slices the overflow
files alone use more than work_mem. Of course, to hit that the
underestimate needs to be sufficiently serious. My understanding was we'll
roll until that point and then switch to BNL.
Z. Plan and or execution time underestimation of the number of
required batches with skewed dataIf you knew this at planning time, you could have picked another
join-type, though, there might be cases where it would actually be
less costly to use plain hashjoin for all batches except the bad batch
and fall back to hash block nested loop join just for the duplicate
values.If you could not have known this at planning time, the best join
execution strategy is a hybrid hashjoin/hash block nested loop join.To do this, preview if increasing nbatches would move tuples, and, if
it would, do this (also, employing slicing if buffile overhead exceeds
the threshold)If increasing nbatches wouldn't move tuples, process this batch with
hash block nested loop join.
OK.
Essentially, what we want is logical units of tuples which are
work_mem-sized. In some cases, each unit may contain multiple batches
(a slice in Tomas' patch) and in other cases, each unit may contain
only part of a batch (a chunk is the term I used in my hash block
nested loop join patch [1]).
OK, although with slicing the work_mem-sized unit is still one batch. The
slice just ensures the metadata we need to keep in memory does not grow as
O(N) with the number of batches (instead it's O(log(N)) I think).
For slicing, each unit, a slice, has multiple batches but one spill
file.
For hbnlj, each unit, a chunk, is one of multiple chunks in a single
batch, all of which are in the same spill file (1 batch = 1 spill
file).
Yep.
Thinking through it, it seems to make the most sense to split the work
into ~ 3 independent pieces:patch1 - "preview" a batch increase (not yet written [I think])
patch2 - slicing (Tomas' patch [2] but add in threshhold for portion of
work_mem buffile overhead is using)
patch3 - hash block nested loop join (my patch [1])patch1 allows us to re-enable growth and was mentioned upthread, but I
will quote it here for simplicity:I think we can fix A by relaxing the escape valve condition, and then
rechecking it once in a while. So we fill work_mem, realize it didn't
actually reduce the batch size significantly and disable nbatch growth.
But at the same time we increase the threshold to 2x work_mem, and after
reaching it we "consider" a nbatch increase. That is, we walk the batch
and see how many tuples would move if we increased nbatch (that should be
fairly cheap) - if it helps, great, enable growth and split the batch. If
not, double the threshold again. Rinse and repeat.We don't want to fill up work_mem with buffile overhead after
increasing nbatches many times just to move a few tuples for one batch
and end up disabling growth thus making it so that later we can't
increase nbatches and repartition for a batch that would nicely
partition (like Hubert's case, I believe [3]).
Yes, this seems like a very reasonable plan. Also, I now see it actually
explains what the plan with BNL vs. slicing is.
We want to identify when re-partitioning would help and only do it
then and, for times when it wouldn't help, use a fallback strategy
that still allows progress on the hashjoin, and, for some spiky data,
where we have re-partitioned for the right reasons, but there are
still a lot of batches that are small enough that they could all fit
in memory at once, we want to track them with as little overhead as
possible -- lump them into slices.We should probably consider deciding to use slices based on some
threshold for the portion of work_mem which is allowed to be occupied
by buffile overhead instead of waiting until the buffile overhead is
literally taking up most of work_mem.
But that heuristics is already there, no? That's the "Don't use more than
2*work_mem/3 for batch BufFiles" at which point we start adding slices.
The above summary is to address the concern in this thread about a
holistic solution.I think the slicing patch is independent of both the hash block nested
loop join patch and the "preview" mode for batch increasing.If slicing is made to work for parallel-aware hashjoin and the code is
in a committable state (and probably has the threshold I mentioned
above), then I think that this patch should go in.
Yes, I think this seems like a good plan.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, May 6, 2019 at 9:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Stepping back a bit, I think there is something fishy about the way we
detect extreme skew. Is that a factor in this case? Right now we
wait until we have a batch that gets split into child batches
containing exactly 0% and 100% of the tuples before we give up.
Previously I had thought of that as merely a waste of time, but
clearly it's also a waste of unmetered memory. Oops.I think our extreme skew detector should go off sooner, because
otherwise if you have N nicely distributed unique keys and also M
duplicates of one bad egg key that'll never fit in memory, we keep
repartitioning until none of the N keys fall into the batch containing
the key for the M duplicates before we give up! You can use
balls-into-bins maths to figure out the number, but I think that means
we expect to keep splitting until we have N * some_constant batches,
and that's just silly and liable to create massive numbers of
partitions proportional to N, even though we're trying to solve a
problem with M. In another thread I suggested we should stop when
(say) 95% of the tuples go to one child batch. I'm not sure how you
pick the number.
Another thing that is fishy about this is that we can't split a batch
or a bucket without splitting them all. Let's say that nbatches *
nbuckets = 16 million. One bucket in one batch contains 90% of the
tuples. Splitting *that* bucket might be a good idea if only 5% of the
tuples end up moving, perhaps even if only 1% end up moving. But, if
you have to double the total number of batches to get that benefit,
it's a lot less compelling, because now you have to rescan the outer
side more times.
I wonder whether we should be dividing things into batches unevenly,
based on the distribution of tuples we've seen so far. For example,
suppose we've gotten to 1024 buckets and that's all we can fit in
memory. If we decide to go to 2 batches, we'll use the next bit of the
hash key to decide which things go into batch 0 and which things go
into batch 1. But if we know that 50% of the data is in bucket 17, why
are we not making bucket 17 into a batch and everything else into
another batch? Then, when we process the batch that was derived from
bucket-17, we can use 10 completely new bits from the hash key to
slice the data from that bucket as finely as possible.
Now the bucket might be entirely duplicates, in which case no number
of additional bits will help. However, even in that case it's still a
good idea to make it its own batch, and then use some other algorithm
to process that batch. And if it's *not* entirely duplicates, but
there are say 2 or 3 really common values that unluckily hash to the
same bucket, then being able to use a lot more bits for that portion
of the data gives us the best chance of managing to spread it out into
different buckets.
Similarly, if we split the hash join into four batches, and batch 0
fits in memory but batch 1 does not, we cannot further split batch 1
without splitting batch 2 and batch 3 also. That's not good either,
because those batches might be small and not need splitting.
I guess what I'm trying to say is that our algorithms for dealing with
mis-estimation seem to be largely oblivious to the problem of skew,
and I don't think the problem is confined to extreme skew. Suppose you
have some data that is only moderately skewed, so that when you build
a hash table with 1024 buckets, 25% of the data is in buckets 0-19,
25% in buckets 20-768, 25% in buckets 769-946, and the last 25% in
buckets 947-1023. If you knew that, then when you discover that the
data is 4x too large to fit in memory, you can divide the data into 4
batches using those bucket number ranges, and get it done in exactly 4
batches. As it is, you'll need to split until every uniform range of
buckets fits in memory: 0-31 is going to be too big a range, so you're
going to go with 0-15, which means you'll have 64 batches instead of
4.
It seems to me that a good chunk of what's being proposed right now
basically ignores the fact that we're not really responding to the
skew in a very effective way. Thomas wants to stop splitting all the
buckets when splitting one of the buckets produces only a very small
benefit rather than when it produces no benefit at all, but he's not
asking why we're splitting all of the buckets in the first place.
Tomas wants to slice the array of batches because there are so many of
them, but why are there so many? As he said himself, "it gets to that
many batches because some of the values are very common and we don't
disable the growth earlier." Realistically, I don't see how there can
be so many batches that we can't even fit the metadata about the
matches into memory unless we're unnecessarily creating a lot of
little tiny batches that we don't really need.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Jul 12, 2019 at 1:16 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, May 6, 2019 at 9:49 PM Thomas Munro <thomas.munro@gmail.com>
wrote:Stepping back a bit, I think there is something fishy about the way we
detect extreme skew. Is that a factor in this case? Right now we
wait until we have a batch that gets split into child batches
containing exactly 0% and 100% of the tuples before we give up.
Previously I had thought of that as merely a waste of time, but
clearly it's also a waste of unmetered memory. Oops.I think our extreme skew detector should go off sooner, because
otherwise if you have N nicely distributed unique keys and also M
duplicates of one bad egg key that'll never fit in memory, we keep
repartitioning until none of the N keys fall into the batch containing
the key for the M duplicates before we give up! You can use
balls-into-bins maths to figure out the number, but I think that means
we expect to keep splitting until we have N * some_constant batches,
and that's just silly and liable to create massive numbers of
partitions proportional to N, even though we're trying to solve a
problem with M. In another thread I suggested we should stop when
(say) 95% of the tuples go to one child batch. I'm not sure how you
pick the number.Another thing that is fishy about this is that we can't split a batch
or a bucket without splitting them all. Let's say that nbatches *
nbuckets = 16 million. One bucket in one batch contains 90% of the
tuples. Splitting *that* bucket might be a good idea if only 5% of the
tuples end up moving, perhaps even if only 1% end up moving. But, if
you have to double the total number of batches to get that benefit,
it's a lot less compelling, because now you have to rescan the outer
side more times.
It seems to me that a good chunk of what's being proposed right now
basically ignores the fact that we're not really responding to the
skew in a very effective way. Thomas wants to stop splitting all the
buckets when splitting one of the buckets produces only a very small
benefit rather than when it produces no benefit at all, but he's not
asking why we're splitting all of the buckets in the first place.
Tomas wants to slice the array of batches because there are so many of
them, but why are there so many? As he said himself, "it gets to that
many batches because some of the values are very common and we don't
disable the growth earlier." Realistically, I don't see how there can
be so many batches that we can't even fit the metadata about the
matches into memory unless we're unnecessarily creating a lot of
little tiny batches that we don't really need.
+1 on Robert's suggestion. It's worth to find the root cause of the batch
explosion problem.
As Robert pointed out "we can't split a batch without spilling them all".
In fact, the hybrid hash join algorithm should only split the overflow
batch and avoid to split the small batch which could be processed in
memory. Planner should calculate the initial batch number which ensure the
average size batch could be processed in memory giving different data
distribution. Executor should spilt skew batch in an one-batch-a-time
manner.
I will firstly show an example to help understand batch explosion problem.
Suppose we are going to join R and S and planner calculate the initial
nbatch as 4.
In the first batch run, during HJ_BUILD_HASHTABLE state we Scan R and build
in memory hash table for batch1 and spill other tuples of R into different
batch files(R2-R4). During HJ_NEED_NEW_OUTER and HJ_SCAN_BUCKET state, we
do two things: 1. if tuple in S belong to current batch, match it with in
memory R1 and emit result to parent plan node; 2. if tuple in S doesn't
belong to current batch, spill it to batch files of S2-S4. As a result,
after the first batch run we get:
6 disk files: batch2(R2,S2), batch3(R3,S3) batch4(R4,S4)
Now we run into HJ_NEED_NEW_BATCH state and begin to process R2 and S2.
Suppose the second batch R2 is skewed and need to split batch number to 8.
When building in-memory hash table for R2, we also split some tuples in R2
into spill file R6.(Based on our hash function, tuples belong to R2 will
not be shuffled to batches except R6). After R2's hash table is built, we
begin to probe tuples in S2. Since batch number is changed from 4 to 8,
some of tuples in S2 now belong to S6 and we spilt them to disk file S6.
For other tuples in S2, we match them with R2 and output the result to
parent plannode. After the second batch processed, we got:
disk files: batch3(R3,S3), batch4(R4,S4),batch(R6,S6)
Next, we begin to process R3 and S3. The third batch R3 is not skewed, but
since our hash function depends on batch number, which is 8 now. So we have
to split some tuples in R3 to disk file R7, *which is not necessary*. When
Probing S3, we also need to spilt some tuples in S3 into S7, *which is not
necessary either*. Since R3 could be loaded into memory entirely, spill
part of R3 to disk file not only introduce more file and file buffers(which
is problem Tomas try to solve), but also slow down the performance. After
the third batch processed, we got:
disk files: batch4(R4,S4),batch(R6,S6),batch(R7,S7)
Next, we begin to process R4 and S4. Similar to R3, some tuples in R4 also
need to be spilled to file R8. But after this splitting, suppose R4 is
still skewed, and we increase the batch number again to 16. As a result,
some tuples in R4 will be spilled to file R12 and R16. When probing S4,
similarly we need to split some tuples in S4 into S8,S12 and S16. After
this step, we get:
disk files:
batch(R6,S6),batch(R7,S7),batch(R8,S8),batch(R12,S12),batch(R16,S16).
Next, when we begin to process R6 and S6, even if we could build hash table
for R6 all in memory, but we have to spilt R6 based on new batch number 16
and spill to file: R14. *It's not necessary.*
Now we could conclude that increasing batch number would introduce
unnecessary repeated spill not only on original batch(R3,S3) but also on
new generated batch(R6,S6) in a cascade way. *In a worse case, suppose R2
is super skew and need to split 10 times, while R3 is OK to build hash
table all in memory. In this case, we have to introduce R7,R11,....,R4095,
total 1023 unnecessary spill files. Each of these files may only contain
less than ten tuples. Also, we need to palloc file buffer(512KB) for these
spill files. This is the so called batch explosion problem.*
*Solutions:*
To avoid these unnecessary repeated spill, I propose to make function
ExecHashGetBucketAndBatch
as a hash function chain to determine the batchno.
Here is the original implementation of ExecHashGetBucketAndBatch
```
//nbatch is the global batch number
*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
```
We can see the original hash function basically calculate MOD of global
batch number(IBN).
A real hybrid hash join should use a hash function chain to determine the
batchno. In the new algorithm, the component of hash function chain is
defined as: MOD of #IBN, MOD of #IBN*2, MOD of #IBN*4,MOD of #IBN*8
....etc. A small batch will just use the first hash function in chain,
while the skew batch will use the same number of hash functions in chain as
the times it is split.
Here is the new implementation of ExecHashGetBucketAndBatch
```
/* i is the current batchno we are processing */
/* hashChainLen record the times batch i is spilt */
for (j=0;j<hashChainLen[i];j++)
{
batchno = (hashvalue >> hashtable->log2_nbuckets) & ((#initialBatch)*
(2^j) - 1);
/* if the calculated batchno is still i, we need to call more hash
functions
* in chain to determine the final bucketno, else we could return
directly.
*/
if ( batchno != i )
return batchno;
}
return batchno;
```
A quick example, Suppose R3's input is 3,7,11,15,19,23,27,31,35,15,27(we
could ensure they MOD4=3)
Suppose Initial batch number is 4 and memory could contain 4 tuples, the
5th tuple need to do batch spilt.
Step1: batch3 process 3,7,11,15,19 and now need to split,
chainLen[3]=2
batch3: 3,11,19
batch7: 7,15
Step2: 23,27,31 coming
batch3: 3,11,19,27
batch7: 7,15,23,31
Step 3: 35 coming, batch3 need to split again
chainLen[3]=3
batch3: 3,19,35
batch7: 7,15,23,31
batch11: 11,27
Step 4 15 coming, HashFun1 15%4=3, HashFun2 15%8=7;
since 7!=3 spill 15 to batch7.
Step 5 27 coming, 27%4=3, 27%8=3, 27%16 =11
since 27!=3 spill 27 to batch 11.
Final state:
chainLen[3]=3
batch3: 3,19,35
batch7: 7,15,23,31,15
batch11: 11,27,27
Here is pseudo code of processing of batch i:
```
/*Step 1: build hash table for Ri*/
tuple = ReadFromFile(Ri);
/* get batchno by the new function*/
batchno =NewExecHashGetBucketAndBatch()
/* do spill if not belong to current batch*/
if(batchno != i)
spill to file[batchno]
flag = InsertTupleToHashTable(HT, tuple);
if (flag == NEED_SPILT)
{
hashChainLen[i] ++;
/* then call ExecHashIncreaseNumBatches() to do the real spill */
}
/* probe stage */
tuple = ReadFromFile(S[i+Bi*k]);
batchno = NewExecHashGetBucketAndBatch()
if (batchno == curbatch)
probe and match
else
spillToFile(tuple, batchno)
}
```
This solution only split the batch which needs to be split in a lazy way.
If this solution makes sense, I would like write the real patch.
Any comment?
--
Thanks
Hubert Zhang
On 2019-Jul-11, Tomas Vondra wrote:
On Wed, Jul 10, 2019 at 04:51:02PM -0700, Melanie Plageman wrote:
I think implementing support for parallel hashjoin or explicitly
disabling it would be the bare minimum for this patch, which is why I
made 2 its own item. I've marked it as returned to author for this
reason.OK. I'm a bit confused / unsure what exactly our solution to the various
hashjoin issues is. I have not been paying attention to all the various
threads, but I thought we kinda pivoted to the BNL approach, no? I'm not
against pushing this patch (the slicing one) forward and then maybe add
BNL on top.
So what's a good way forward for this patch? Stalling forever like a
glacier is not an option; it'll probably end up melting. There's a lot
of discussion on this thread which I haven't read, and it's not
immediately clear to me whether this patch should just be thrown away in
favor of something completely different, or it can be considered a first
step in a long road.
--
Ãlvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 3, 2019 at 9:36 AM Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
On 2019-Jul-11, Tomas Vondra wrote:
On Wed, Jul 10, 2019 at 04:51:02PM -0700, Melanie Plageman wrote:
I think implementing support for parallel hashjoin or explicitly
disabling it would be the bare minimum for this patch, which is why I
made 2 its own item. I've marked it as returned to author for this
reason.OK. I'm a bit confused / unsure what exactly our solution to the various
hashjoin issues is. I have not been paying attention to all the various
threads, but I thought we kinda pivoted to the BNL approach, no? I'm not
against pushing this patch (the slicing one) forward and then maybe add
BNL on top.So what's a good way forward for this patch? Stalling forever like a
glacier is not an option; it'll probably end up melting. There's a lot
of discussion on this thread which I haven't read, and it's not
immediately clear to me whether this patch should just be thrown away in
favor of something completely different, or it can be considered a first
step in a long road.
So, I have been working on the fallback to block nested loop join
patch--latest non-parallel version posted here [1]/messages/by-id/CAAKRu_ZsRU+nszShs3AGVorx=e+2jYkL7X=jiNO6+qbho7vRpw@mail.gmail.com. I am currently
still working on the parallel version but don't have a complete
working patch yet. I am hoping to finish it and solicit feedback in
the next couple weeks.
My patch chunks up a bad inner side batch and processes it a chunk
at a time. I haven't spent too much time yet thinking about Hubert's
suggestion proposed upthread. In the past I had asked Tomas about the
idea of splitting up only "bad batches" to avoid having other batches
which are very small. It seemed like this introduced additional
complexity for future spilled tuples finding a home, however, I had
not considered the hash function chain method Hubert is mentioning.
Even if we implemented additional strategies like the one Hubert is
suggesting, I still think that both the slicing patch originally
proposed in this thread as well as a BNLJ fallback option could all
work together, as I believe they solve slightly different problems.
If Tomas or someone else has time to pick up and modify BufFile
accounting patch, committing that still seems like the nest logical
step.
I will work on getting a complete (parallel-aware) BNLJ patch posted
soon.
[1]: /messages/by-id/CAAKRu_ZsRU+nszShs3AGVorx=e+2jYkL7X=jiNO6+qbho7vRpw@mail.gmail.com
/messages/by-id/CAAKRu_ZsRU+nszShs3AGVorx=e+2jYkL7X=jiNO6+qbho7vRpw@mail.gmail.com
--
Melanie Plageman
On Thu, Sep 05, 2019 at 09:54:33AM -0700, Melanie Plageman wrote:
On Tue, Sep 3, 2019 at 9:36 AM Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:On 2019-Jul-11, Tomas Vondra wrote:
On Wed, Jul 10, 2019 at 04:51:02PM -0700, Melanie Plageman wrote:
I think implementing support for parallel hashjoin or explicitly
disabling it would be the bare minimum for this patch, which is why I
made 2 its own item. I've marked it as returned to author for this
reason.OK. I'm a bit confused / unsure what exactly our solution to the various
hashjoin issues is. I have not been paying attention to all the various
threads, but I thought we kinda pivoted to the BNL approach, no? I'm not
against pushing this patch (the slicing one) forward and then maybe add
BNL on top.So what's a good way forward for this patch? Stalling forever like a
glacier is not an option; it'll probably end up melting. There's a lot
of discussion on this thread which I haven't read, and it's not
immediately clear to me whether this patch should just be thrown away in
favor of something completely different, or it can be considered a first
step in a long road.So, I have been working on the fallback to block nested loop join
patch--latest non-parallel version posted here [1]. I am currently
still working on the parallel version but don't have a complete
working patch yet. I am hoping to finish it and solicit feedback in
the next couple weeks.My patch chunks up a bad inner side batch and processes it a chunk
at a time. I haven't spent too much time yet thinking about Hubert's
suggestion proposed upthread. In the past I had asked Tomas about the
idea of splitting up only "bad batches" to avoid having other batches
which are very small. It seemed like this introduced additional
complexity for future spilled tuples finding a home, however, I had
not considered the hash function chain method Hubert is mentioning.Even if we implemented additional strategies like the one Hubert is
suggesting, I still think that both the slicing patch originally
proposed in this thread as well as a BNLJ fallback option could all
work together, as I believe they solve slightly different problems.
I have to admit I kinda lost track of how exactly all the HJ patches
posted in various -hackers threads shall work together in the end. We have
far too many in-flight patches dealing with this part of the code at the
moment. It's a bit like with the buses - for years there were no patches
fixing those issues, and now we have 17 ;-)
My feeling is that we should get the BNLJ committed first, and then maybe
use some of those additional strategies as fallbacks (depending on which
issues are still unsolved by the BNLJ).
If Tomas or someone else has time to pick up and modify BufFile
accounting patch, committing that still seems like the nest logical
step.
OK, I'll look into that (i.e. considering BufFile memory during planning,
and disabling HJ if not possible).
I will work on getting a complete (parallel-aware) BNLJ patch posted
soon.
Good!
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 10, 2019 at 03:47:51PM +0200, Tomas Vondra wrote:
My feeling is that we should get the BNLJ committed first, and then maybe
use some of those additional strategies as fallbacks (depending on which
issues are still unsolved by the BNLJ).
The glacier is melting more. Tomas, what's your status here? The
patch has been waiting on author for two months now. If you are not
planning to work more on this one, then it should be marked as
returned with feedback?
--
Michael
On Mon, Nov 25, 2019 at 05:33:35PM +0900, Michael Paquier wrote:
On Tue, Sep 10, 2019 at 03:47:51PM +0200, Tomas Vondra wrote:
My feeling is that we should get the BNLJ committed first, and then maybe
use some of those additional strategies as fallbacks (depending on which
issues are still unsolved by the BNLJ).The glacier is melting more. Tomas, what's your status here? The
patch has been waiting on author for two months now. If you are not
planning to work more on this one, then it should be marked as
returned with feedback?
I'm not planning to do any any immediate work on this, so I agree with
marking it as RWF. I think Melanie is working on the BNL patch, which
seems like the right solution.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Nov 25, 2019 at 07:11:19PM +0100, Tomas Vondra wrote:
I'm not planning to do any any immediate work on this, so I agree with
marking it as RWF. I think Melanie is working on the BNL patch, which
seems like the right solution.
Thanks, I have switched the patch as returned with feedback.
--
Michael
On Mon, Nov 25, 2019 at 10:11 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
On Mon, Nov 25, 2019 at 05:33:35PM +0900, Michael Paquier wrote:
On Tue, Sep 10, 2019 at 03:47:51PM +0200, Tomas Vondra wrote:
My feeling is that we should get the BNLJ committed first, and then
maybe
use some of those additional strategies as fallbacks (depending on which
issues are still unsolved by the BNLJ).The glacier is melting more. Tomas, what's your status here? The
patch has been waiting on author for two months now. If you are not
planning to work more on this one, then it should be marked as
returned with feedback?I'm not planning to do any any immediate work on this, so I agree with
marking it as RWF. I think Melanie is working on the BNL patch, which
seems like the right solution.
Sorry for the delay. I have posted the parallel-aware version BNLJ
(adaptive HJ) of this in the thread which originally had all of the
patches for it [1]/messages/by-id/CAAKRu_YsWm7gc_b2nBGWFPE6wuhdOLfc1LBZ786DUzaCPUDXCA@mail.gmail.com -- Melanie Plageman. It's not near committable, so I wasn't going to
register it for a commitfest yet, but I would love feedback on the
prototype.
[1]: /messages/by-id/CAAKRu_YsWm7gc_b2nBGWFPE6wuhdOLfc1LBZ786DUzaCPUDXCA@mail.gmail.com -- Melanie Plageman
/messages/by-id/CAAKRu_YsWm7gc_b2nBGWFPE6wuhdOLfc1LBZ786DUzaCPUDXCA@mail.gmail.com
--
Melanie Plageman