accounting for memory used for BufFile during hash joins

Started by Tomas Vondraalmost 7 years ago30 messages

tomas.vondra@2ndquadrant.com

almost 7 years ago

Hi,

I'm starting this thread mostly to keep track of patches developed in
response to issue [1]/messages/by-id/bc138e9f-c89e-9147-5395-61d51a757b3b@gusw.net reported on pgsql-performance. The symptoms are
very simple - query performing a hash join ends up using much more
memory than expected (pretty much ignoring work_mem), and possibly
ending up with OOM.

The root cause is that hash join treats batches as pretty much free, but
that's not really true - we do allocate two BufFile structs per batch,
and each BufFile is ~8kB as it includes PGAlignedBuffer.

This is not ideal even if we happen to estimate everything correctly,
because for example with work_mem=4MB and nbatch=1024, it means we'll
use about 16MB (2*8kB*1024) for the BufFile structures alone, plus the
work_mem for hash table itself.

But it can easily explode when we under-estimate the hash side. In the
pgsql-performance message, the hash side (with the patches applied,
allowing the query to complete) it looks like this:

Hash (cost=2823846.37..2823846.37 rows=34619 width=930)
(actual time=252946.367..252946.367 rows=113478127 loops=1)

So it's 3277x under-estimated. It starts with 16 batches, and ends up
adding more and more batches until it fails with 524288 of them (it gets
to that many batches because some of the values are very common and we
don't disable the growth earlier).

The OOM is not very surprising, because with 524288 batches it'd need
about 8GB of memory, and the system only has 8GB RAM installed.

The two attached patches both account for the BufFile memory, but then
use very different strategies when the work_mem limit is reached.

The first patch realizes it's impossible to keep adding batches without
breaking the work_mem limit, because at some point the BufFile will need
more memory than that. But it does not make sense to stop adding batches
entirely, because then the hash table could grow indefinitely.

So the patch abandons the idea of enforcing work_mem in this situation,
and instead attempts to minimize memory usage over time - it increases
the spaceAllowed in a way that ensures doubling the number of batches
actually reduces memory usage in the long run.

The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).

Neither of those patches tweaks ExecChooseHashTableSize() to consider
memory needed for BufFiles while deciding how many batches will be
needed. That's something that probably needs to happen, but it would not
help with the underestimate issue.

I'm not entirely sure which of those approaches is the right one. The
first one is clearly just a "damage control" for cases where the hash
side turned out to be much larger than we expected. With good estimates
we probably would not have picked a hash join for those (that is, we
should have realized we can't keep work_mem and prohibit hash join).

The second patch however makes hash join viable for some of those cases,
and it seems to work pretty well (there are some numbers in the message
posted to pgsql-performance thread). So I kinda like this second one.

It's all just PoC quality, at this point, far from committable state.

[1]: /messages/by-id/bc138e9f-c89e-9147-5395-61d51a757b3b@gusw.net

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Melanie Plageman

melanieplageman@gmail.com

almost 7 years ago

In reply to: Tomas Vondra (#1)

Re: accounting for memory used for BufFile during hash joins

On Fri, May 3, 2019 at 5:34 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).

I want to see if I understand the implications of the per-slice-overflow

patch
for execution of hashjoin:
For each bucket in the hashtable, when attempting to double the number of
batches, if the memory that the BufFile structs will occupy once this is
done
will exceed the work_mem, split each batch into slices that fit into memory.
This means that, for each probe-side tuple hashing to that bucket, you have
to
load every slice of each batch separately into memory to ensure correct
results.
Is this right?

I'm not entirely sure which of those approaches is the right one. The
first one is clearly just a "damage control" for cases where the hash
side turned out to be much larger than we expected. With good estimates
we probably would not have picked a hash join for those (that is, we
should have realized we can't keep work_mem and prohibit hash join).

The second patch however makes hash join viable for some of those cases,
and it seems to work pretty well (there are some numbers in the message
posted to pgsql-performance thread). So I kinda like this second one.

So, my initial reaction after taking a look at the patches is that I

prefer the
first approach--increasing the resize threshhold. The second patch, the
per-slice-overflow patch, adds a major new mechanic to hashjoin in order to
address what is, based on my understanding, an edge case.

--
Melanie Plageman

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Melanie Plageman (#2)

Re: accounting for memory used for BufFile during hash joins

On Tue, May 7, 2019 at 9:58 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Fri, May 3, 2019 at 5:34 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).

I want to see if I understand the implications of the per-slice-overflow patch
for execution of hashjoin:
For each bucket in the hashtable, when attempting to double the number of
batches, if the memory that the BufFile structs will occupy once this is done
will exceed the work_mem, split each batch into slices that fit into memory.
This means that, for each probe-side tuple hashing to that bucket, you have to
load every slice of each batch separately into memory to ensure correct results.
Is this right?

Seems expensive for large numbers of slices -- you need to join the
outer batch against each inner slice. But I wonder how we'd deal with
outer joins, as Tom Lane asked in another thread:

/messages/by-id/12185.1488932980@sss.pgh.pa.us

I'm not entirely sure which of those approaches is the right one. The
first one is clearly just a "damage control" for cases where the hash
side turned out to be much larger than we expected. With good estimates
we probably would not have picked a hash join for those (that is, we
should have realized we can't keep work_mem and prohibit hash join).

The second patch however makes hash join viable for some of those cases,
and it seems to work pretty well (there are some numbers in the message
posted to pgsql-performance thread). So I kinda like this second one.

So, my initial reaction after taking a look at the patches is that I prefer the
first approach--increasing the resize threshhold. The second patch, the
per-slice-overflow patch, adds a major new mechanic to hashjoin in order to
address what is, based on my understanding, an edge case.

Personally I'd like to make work_mem more reliable, even if it takes a
major new mechanism.

Stepping back a bit, I think there is something fishy about the way we
detect extreme skew. Is that a factor in this case? Right now we
wait until we have a batch that gets split into child batches
containing exactly 0% and 100% of the tuples before we give up.
Previously I had thought of that as merely a waste of time, but
clearly it's also a waste of unmetered memory. Oops.

I think our extreme skew detector should go off sooner, because
otherwise if you have N nicely distributed unique keys and also M
duplicates of one bad egg key that'll never fit in memory, we keep
repartitioning until none of the N keys fall into the batch containing
the key for the M duplicates before we give up! You can use
balls-into-bins maths to figure out the number, but I think that means
we expect to keep splitting until we have N * some_constant batches,
and that's just silly and liable to create massive numbers of
partitions proportional to N, even though we're trying to solve a
problem with M. In another thread I suggested we should stop when
(say) 95% of the tuples go to one child batch. I'm not sure how you
pick the number.

Of course that doesn't solve the problem that we don't have a better
plan for dealing with the M duplicates -- it just avoids a needless
batch explosions triggered by bad maths. I think we need something
like Tomas's #2, or a way to switch to sort-merge, or some other
scheme. I'm not sure how to compare the slice idea, which involves
processing outer tuples * inner slices with the sort-merge idea, which
involves sorting the inner and outer batch, plus the entirely new
concept of switching to another node at execution time.

I also wondered about reducing the buffer size of the BufFiles, but
that doesn't seem to be fixing the real problem.

--
Thomas Munro
https://enterprisedb.com

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 7 years ago

In reply to: Thomas Munro (#3)

Re: accounting for memory used for BufFile during hash joins

On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:

On Tue, May 7, 2019 at 9:58 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Fri, May 3, 2019 at 5:34 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).

I want to see if I understand the implications of the per-slice-overflow patch
for execution of hashjoin:
For each bucket in the hashtable, when attempting to double the number of
batches, if the memory that the BufFile structs will occupy once this is done
will exceed the work_mem, split each batch into slices that fit into memory.
This means that, for each probe-side tuple hashing to that bucket, you have to
load every slice of each batch separately into memory to ensure correct results.
Is this right?

Seems expensive for large numbers of slices -- you need to join the
outer batch against each inner slice.

Nope, that's not how it works. It's the array of batches that gets
sliced, not the batches themselves.

It does slightly increase the amount of data we need to shuffle between
the temp files, because we can't write the data directly to batches in
"future" slices. But that amplification is capped to ~2.2x (compared to
the ~1.4x in master) - I've shared some measurements in [1]/messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development.

[1]: /messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development

But I wonder how we'd deal with outer joins, as Tom Lane asked in
another thread:

/messages/by-id/12185.1488932980@sss.pgh.pa.us

That seems unrelated - we slice the array of batches, to keep memory
needed for BufFile under control. The hash table remains intact, so
there's no issue with outer joins.

I'm not entirely sure which of those approaches is the right one. The
first one is clearly just a "damage control" for cases where the hash
side turned out to be much larger than we expected. With good estimates
we probably would not have picked a hash join for those (that is, we
should have realized we can't keep work_mem and prohibit hash join).

The second patch however makes hash join viable for some of those cases,
and it seems to work pretty well (there are some numbers in the message
posted to pgsql-performance thread). So I kinda like this second one.

So, my initial reaction after taking a look at the patches is that I prefer the
first approach--increasing the resize threshhold. The second patch, the
per-slice-overflow patch, adds a major new mechanic to hashjoin in order to
address what is, based on my understanding, an edge case.

Personally I'd like to make work_mem more reliable, even if it takes a
major new mechanism.

Yeah, I share that attitude.

Stepping back a bit, I think there is something fishy about the way we
detect extreme skew. Is that a factor in this case? Right now we
wait until we have a batch that gets split into child batches
containing exactly 0% and 100% of the tuples before we give up.
Previously I had thought of that as merely a waste of time, but
clearly it's also a waste of unmetered memory. Oops.

Yes, that was a factor in the reported query - the data set contained
significant number of duplicate values (~10%) but it took a while to
disable growth because there always happened to be a couple rows with a
different value.

I think our extreme skew detector should go off sooner, because
otherwise if you have N nicely distributed unique keys and also M
duplicates of one bad egg key that'll never fit in memory, we keep
repartitioning until none of the N keys fall into the batch containing
the key for the M duplicates before we give up! You can use
balls-into-bins maths to figure out the number, but I think that means
we expect to keep splitting until we have N * some_constant batches,
and that's just silly and liable to create massive numbers of
partitions proportional to N, even though we're trying to solve a
problem with M. In another thread I suggested we should stop when
(say) 95% of the tuples go to one child batch. I'm not sure how you
pick the number.

I agree we should relax the 0%/100% split condition, and disable the
growth sooner. But I think we should also re-evaluate that decision
after a while - the data set may be correlated in some way, in which
case we may disable the growth prematurely. It may not reduce memory
usage now, but it may help in the future.

It's already an issue, but it would be even more likely if we disabled
growth e.g. with just 5%/95% splits.

FWIW I believe this is mostly orthogonal issue to what's discussed in
this thread.

Of course that doesn't solve the problem that we don't have a better
plan for dealing with the M duplicates -- it just avoids a needless
batch explosions triggered by bad maths. I think we need something
like Tomas's #2, or a way to switch to sort-merge, or some other
scheme. I'm not sure how to compare the slice idea, which involves
processing outer tuples * inner slices with the sort-merge idea, which
involves sorting the inner and outer batch, plus the entirely new
concept of switching to another node at execution time.

Do we actually check how many duplicates are there during planning? I
wonder if we could penalize (of even disable) hashjoins when there are
too many duplicates to fit into work_mem. Of course, that's going to be
tricky with filtering, and so on.

Switching to some other algorithm during execution moves the goal posts
to the next galaxy, I'm afraid.

I also wondered about reducing the buffer size of the BufFiles, but
that doesn't seem to be fixing the real problem.

Yeah. It might help a bit, but it's very limited - even if you reduce
the buffer to say 1kB, it's just a factor of 8. And I'm not sure what
would be the impact on performance.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Tomas Vondra (#4)

Re: accounting for memory used for BufFile during hash joins

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

Do we actually check how many duplicates are there during planning?

Certainly that's part of the planner's cost estimates ... but it's
only as good as the planner's statistical knowledge.

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Tomas Vondra (#4)

Re: accounting for memory used for BufFile during hash joins

On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:

Seems expensive for large numbers of slices -- you need to join the
outer batch against each inner slice.

Nope, that's not how it works. It's the array of batches that gets
sliced, not the batches themselves.

Sorry, I read only the description and not the code, and got confused
about that. So, I see three separate but related problems:

A. Broken escape valve: sometimes we generate a huge number of
batches while trying to split up many duplicates, because of the
presence of other more uniformly distributed keys. We could fix that
with (say) a 95% rule.
B. Lack of good alternative execution strategy when the escape valve
is triggered. A batch cannot be split effectively, but cannot fit in
work_mem, so for now we decide to ignore work_mem.
C. Unmetered explosion of batches and thus BufFiles, probably usually
caused by problem A, but theoretically also due to a real need for
partitions.

But I wonder how we'd deal with outer joins, as Tom Lane asked in
another thread:

/messages/by-id/12185.1488932980@sss.pgh.pa.us

That seems unrelated - we slice the array of batches, to keep memory
needed for BufFile under control. The hash table remains intact, so
there's no issue with outer joins.

Right, sorry, my confusion. I thought you were describing
https://en.wikipedia.org/wiki/Block_nested_loop. (I actually think we
can make that work for left outer joins without too much fuss by
writing out a stream of match bits to a new temporary file. Googling,
I see that MySQL originally didn't support BNL for outer joins and
then added some match flag propagation thing recently.)

I agree we should relax the 0%/100% split condition, and disable the
growth sooner. But I think we should also re-evaluate that decision
after a while - the data set may be correlated in some way, in which
case we may disable the growth prematurely. It may not reduce memory
usage now, but it may help in the future.

It's already an issue, but it would be even more likely if we disabled
growth e.g. with just 5%/95% splits.

FWIW I believe this is mostly orthogonal issue to what's discussed in
this thread.

But isn't problem A the root cause of problem C, in most cases? There
must also be "genuine" cases of problem C that would occur even if we
fix that, of course: someone has small work_mem, and data that can be
effectively partitioned to fit it, but it just takes a huge number of
partitions to do it. So that we don't behave badly in those cases, I
agree with you 100%: we should fix the memory accounting to count
BufFile overheads as you are proposing, and then I guess ideally
switch to our alternative strategy (BNL or sort-merge or ...) when we
see that BufFiles are wasting to much work_mem and its time to try
something else. It seems you don't actually have one of those cases
here, though?

I think we should fix problem A. Then handle problem C by accounting
for BufFiles, and figure out a way to switch to our alternative
strategy (currently: ignore work_mem), when we think that creating
more BufFiles will be futile (not sure exactly what the rule there
should be). And then work on fixing B properly with a good strategy.
Here's a straw-man idea: we could adopt BNL, and then entirely remove
our repartitioning code. If the planner's number of partitions turns
out to be not enough, we'll just handle it using BNL loops.

Switching to some other algorithm during execution moves the goal posts
to the next galaxy, I'm afraid.

The main problem I'm aware of with sort-merge join is: not all that is
hashable is sortable. So BNL is actually the only solution I'm aware
of for problem B that doesn't involve changing a fundamental thing
about PostgreSQL's data type requirements.

--
Thomas Munro
https://enterprisedb.com

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#5)

Re: accounting for memory used for BufFile during hash joins

On Mon, May 06, 2019 at 11:18:28PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

Do we actually check how many duplicates are there during planning?

Certainly that's part of the planner's cost estimates ... but it's
only as good as the planner's statistical knowledge.

I'm looking at the code, and the only place where I see code dealing with
MCVs (probably the best place for info about duplicate values) is
estimate_hash_bucketsize in final_cost_hashjoin. That's not quite what I
had in mind - I was thinking more about something along the lines "See the
larget group of duplicate values, disable hash join if it can't fit into
work_mem at all."

Of course, if the input estimates are off, that may not work too well. It
would certainly not help the query failing with OOM, because that was a
case of severe underestimate.

Or did you mean some other piece of code that I have missed.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 7 years ago

In reply to: Thomas Munro (#6)

Re: accounting for memory used for BufFile during hash joins

On Tue, May 07, 2019 at 04:28:36PM +1200, Thomas Munro wrote:

On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:

Seems expensive for large numbers of slices -- you need to join the
outer batch against each inner slice.

Nope, that's not how it works. It's the array of batches that gets
sliced, not the batches themselves.

Sorry, I read only the description and not the code, and got confused
about that. So, I see three separate but related problems:

A. Broken escape valve: sometimes we generate a huge number of
batches while trying to split up many duplicates, because of the
presence of other more uniformly distributed keys. We could fix that
with (say) a 95% rule.
B. Lack of good alternative execution strategy when the escape valve
is triggered. A batch cannot be split effectively, but cannot fit in
work_mem, so for now we decide to ignore work_mem.
C. Unmetered explosion of batches and thus BufFiles, probably usually
caused by problem A, but theoretically also due to a real need for
partitions.

Right. I don't think a single solution addressing all those issues exists.
It's more likely we need multiple improvements.

But I wonder how we'd deal with outer joins, as Tom Lane asked in
another thread:

/messages/by-id/12185.1488932980@sss.pgh.pa.us

That seems unrelated - we slice the array of batches, to keep memory
needed for BufFile under control. The hash table remains intact, so
there's no issue with outer joins.

Right, sorry, my confusion. I thought you were describing
https://en.wikipedia.org/wiki/Block_nested_loop. (I actually think we
can make that work for left outer joins without too much fuss by
writing out a stream of match bits to a new temporary file. Googling,
I see that MySQL originally didn't support BNL for outer joins and
then added some match flag propagation thing recently.)

Possibly, I'm not against implementing that, although I don't have very
good idea what the benefits of BNL joins are (performance-wise). In any
case, I think entirely unrelated to hash joins.

I agree we should relax the 0%/100% split condition, and disable the
growth sooner. But I think we should also re-evaluate that decision
after a while - the data set may be correlated in some way, in which
case we may disable the growth prematurely. It may not reduce memory
usage now, but it may help in the future.

It's already an issue, but it would be even more likely if we disabled
growth e.g. with just 5%/95% splits.

FWIW I believe this is mostly orthogonal issue to what's discussed in
this thread.

But isn't problem A the root cause of problem C, in most cases? There
must also be "genuine" cases of problem C that would occur even if we
fix that, of course: someone has small work_mem, and data that can be
effectively partitioned to fit it, but it just takes a huge number of
partitions to do it. So that we don't behave badly in those cases, I
agree with you 100%: we should fix the memory accounting to count
BufFile overheads as you are proposing, and then I guess ideally
switch to our alternative strategy (BNL or sort-merge or ...) when we
see that BufFiles are wasting to much work_mem and its time to try
something else. It seems you don't actually have one of those cases
here, though?

Maybe. Or maybe not. I don't have enough data to make such judgements
about the causes in general. We have one query from pgsql-performance.
There might be more, but IMO that's probably biased data set.

But even that reported query actually is not the case that A causes C.
The outer side of the hash join was significantly underestimated (34619
vs. 113478127) due to highly-correlated conditions.

And in that case it's trivial to cause nbatch explosion even with perfect
data sets with no duplicates (so no escape valve failure).

I think we should fix problem A. Then handle problem C by accounting
for BufFiles, and figure out a way to switch to our alternative
strategy (currently: ignore work_mem), when we think that creating
more BufFiles will be futile (not sure exactly what the rule there
should be). And then work on fixing B properly with a good strategy.
Here's a straw-man idea: we could adopt BNL, and then entirely remove
our repartitioning code. If the planner's number of partitions turns
out to be not enough, we'll just handle it using BNL loops.

Yeah, something like that.

I think we can fix A by relaxing the escape valve condition, and then
rechecking it once in a while. So we fill work_mem, realize it didn't
actually reduce the batch size significantly and disable nbatch growth.
But at the same time we increase the threshold to 2x work_mem, and after
reaching it we "consider" a nbatch increase. That is, we walk the batch
and see how many tuples would move if we increased nbatch (that should be
fairly cheap) - if it helps, great, enable growth and split the batch. If
not, double the threshold again. Rinse and repeat.

For C, I think we can use either of the two approaches I proposed. I like
the second option better, as it actually enforces work_mem. The first
option kinda helped with A too, although in different way, ana I think the
solution I outlined in the previous paragraph will work better.

No opinion regarding the switch to BNL, at the moment.

Switching to some other algorithm during execution moves the goal posts
to the next galaxy, I'm afraid.

The main problem I'm aware of with sort-merge join is: not all that is
hashable is sortable. So BNL is actually the only solution I'm aware
of for problem B that doesn't involve changing a fundamental thing
about PostgreSQL's data type requirements.

Sure, each of those algorithms has limitations. But I think that's mostly
irrelevant to the main issue - switching between algorithms mid-execution.
At that point some of the tuples might have been already sent sent to the
other nodes, and I have no idea how to "resume" the tuple stream short of
buffering everything locally until the join completes. And that would be
rather terrible, I guess.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Tomas Vondra (#7)

Re: accounting for memory used for BufFile during hash joins

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Mon, May 06, 2019 at 11:18:28PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

Do we actually check how many duplicates are there during planning?

Certainly that's part of the planner's cost estimates ... but it's
only as good as the planner's statistical knowledge.

I'm looking at the code, and the only place where I see code dealing with
MCVs (probably the best place for info about duplicate values) is
estimate_hash_bucketsize in final_cost_hashjoin.

What I'm thinking of is this bit in final_cost_hashjoin:

/*
* If the bucket holding the inner MCV would exceed work_mem, we don't
* want to hash unless there is really no other alternative, so apply
* disable_cost. (The executor normally copes with excessive memory usage
* by splitting batches, but obviously it cannot separate equal values
* that way, so it will be unable to drive the batch size below work_mem
* when this is true.)
*/
if (relation_byte_size(clamp_row_est(inner_path_rows * innermcvfreq),
inner_path->pathtarget->width) >
(work_mem * 1024L))
startup_cost += disable_cost;

It's certainly likely that that logic needs improvement in view of this
discussion --- I was just pushing back on the claim that we weren't
considering the issue at all.

regards, tom lane

#10

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#9)

Re: accounting for memory used for BufFile during hash joins

On Tue, May 07, 2019 at 10:42:36AM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Mon, May 06, 2019 at 11:18:28PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

Do we actually check how many duplicates are there during planning?

Certainly that's part of the planner's cost estimates ... but it's
only as good as the planner's statistical knowledge.

I'm looking at the code, and the only place where I see code dealing with
MCVs (probably the best place for info about duplicate values) is
estimate_hash_bucketsize in final_cost_hashjoin.

What I'm thinking of is this bit in final_cost_hashjoin:

/*
* If the bucket holding the inner MCV would exceed work_mem, we don't
* want to hash unless there is really no other alternative, so apply
* disable_cost. (The executor normally copes with excessive memory usage
* by splitting batches, but obviously it cannot separate equal values
* that way, so it will be unable to drive the batch size below work_mem
* when this is true.)
*/
if (relation_byte_size(clamp_row_est(inner_path_rows * innermcvfreq),
inner_path->pathtarget->width) >
(work_mem * 1024L))
startup_cost += disable_cost;

It's certainly likely that that logic needs improvement in view of this
discussion --- I was just pushing back on the claim that we weren't
considering the issue at all.

Ah, this code is new in 11, and I was looking at code from 10 for some
reason. I don't think we can do much better than this, except perhaps
falling back to (1/ndistinct) when there's no MCV available.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11

Melanie Plageman

melanieplageman@gmail.com

almost 7 years ago

In reply to: Tomas Vondra (#4)

Re: accounting for memory used for BufFile during hash joins

On Mon, May 6, 2019 at 8:15 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Nope, that's not how it works. It's the array of batches that gets
sliced, not the batches themselves.

It does slightly increase the amount of data we need to shuffle between
the temp files, because we can't write the data directly to batches in
"future" slices. But that amplification is capped to ~2.2x (compared to
the ~1.4x in master) - I've shared some measurements in [1].

[1]
/messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development

Cool, I misunderstood. I looked at the code again today, and, at the email
thread where you measured "amplification".

In terms of how many times you write each tuple, is it accurate to say that
a
tuple can now be spilled three times (in the worst case) whereas, before, it
could be spilled only twice?

1 - when building the inner side hashtable, tuple is spilled to a "slice"
file
2 - (assuming the number of batches was increased) during execution, when a
tuple belonging to a later slice's spill file is found, it is re-spilled to
that
slice's spill file
3 - during execution, when reading from its slice file, it is re-spilled
(again)
to its batch's spill file

Is it correct that the max number of BufFile structs you will have is equal
to
the number of slices + number of batches in a slice
because that is the max number of open BufFiles you would have at a time?

By the way, applying v4 patch on master, in an assert build, I am tripping
some
asserts -- starting with
Assert(!file->readOnly);
in BufFileWrite

One thing I was a little confused by was the nbatch_inmemory member of the
hashtable. The comment in ExecChooseHashTableSize says that it is
determining
the number of batches we can fit in memory. I thought that the problem was
the
amount of space taken up by the BufFile data structure itself--which is
related
to the number of open BufFiles you need at a time. This comment in
ExecChooseHashTableSize makes it sound like you are talking about fitting
more
than one batch of tuples into memory at a time. I was under the impression
that
you could only fit one batch of tuples in memory at a time.

So, I was stepping through the code with work_mem set to the lower bound,
and in
ExecHashIncreaseNumBatches, I got confused.
hashtable->nbatch_inmemory was 2 for me, thus, nbatch_tmp was 2
so, I didn't meet this condition
if (nbatch_tmp > hashtable->nbatch_inmemory)
since I just set nbatch_tmp using hashtable->nbatch_inmemory
So, I didn't increase the number of slices, which is what I was expecting.
What happens when hashtable->nbatch_inmemory is equal to nbatch_tmp?

--
Melanie Plageman

#12

Melanie Plageman

melanieplageman@gmail.com

almost 7 years ago

In reply to: Tomas Vondra (#8)

Re: accounting for memory used for BufFile during hash joins

On Tue, May 7, 2019 at 6:59 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Tue, May 07, 2019 at 04:28:36PM +1200, Thomas Munro wrote:

On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
Switching to some other algorithm during execution moves the goal posts
to the next galaxy, I'm afraid.

The main problem I'm aware of with sort-merge join is: not all that is
hashable is sortable. So BNL is actually the only solution I'm aware
of for problem B that doesn't involve changing a fundamental thing
about PostgreSQL's data type requirements.

Sure, each of those algorithms has limitations. But I think that's mostly
irrelevant to the main issue - switching between algorithms mid-execution.
At that point some of the tuples might have been already sent sent to the
other nodes, and I have no idea how to "resume" the tuple stream short of
buffering everything locally until the join completes. And that would be
rather terrible, I guess.

What if you switched to NLJ on a batch-by-batch basis and did it before
starting
execution of the join but after building the inner side of the hash table.
That
way, no tuples will have been sent to other nodes yet.

--
Melanie Plageman

#13

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 7 years ago

In reply to: Melanie Plageman (#12)

Re: accounting for memory used for BufFile during hash joins

On Tue, May 07, 2019 at 05:43:56PM -0700, Melanie Plageman wrote:

On Tue, May 7, 2019 at 6:59 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Tue, May 07, 2019 at 04:28:36PM +1200, Thomas Munro wrote:

On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
Switching to some other algorithm during execution moves the goal

posts

to the next galaxy, I'm afraid.

The main problem I'm aware of with sort-merge join is: not all that is
hashable is sortable.ï¿½ So BNL is actually the only solution I'm aware
of for problem B that doesn't involve changing a fundamental thing
about PostgreSQL's data type requirements.

Sure, each of those algorithms has limitations. But I think that's
mostly
irrelevant to the main issue - switching between algorithms
mid-execution.
At that point some of the tuples might have been already sent sent to
the
other nodes, and I have no idea how to "resume" the tuple stream short
of
buffering everything locally until the join completes. And that would be
rather terrible, I guess.

What if you switched to NLJ on a batch-by-batch basis and did it before
starting
execution of the join but after building the inner side of the hash
table.ï¿½ That
way, no tuples will have been sent to other nodes yet.

Interesting idea! I think you're right doing it on a per-batch basis
would solve that problem. Essentially, if all (or >95%) of the tuples
has the same hash value, we could switch to a special "degraded" mode
doing something like a NL. At that point the hash table benefits are
lost anyway, because all the tuples are in a single chain, so it's not
going to be much slower.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 7 years ago

In reply to: Melanie Plageman (#11)

Re: accounting for memory used for BufFile during hash joins

On Tue, May 07, 2019 at 05:30:27PM -0700, Melanie Plageman wrote:

On Mon, May 6, 2019 at 8:15 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Nope, that's not how it works. It's the array of batches that gets
sliced, not the batches themselves.

It does slightly increase the amount of data we need to shuffle between
the temp files, because we can't write the data directly to batches in
"future" slices. But that amplification is capped to ~2.2x (compared to
the ~1.4x in master) - I've shared some measurements in [1].

[1]
/messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development

Cool, I misunderstood. I looked at the code again today, and, at the email
thread where you measured "amplification".

Oh! I hope you're not too disgusted by the code in that PoC patch ;-)

In terms of how many times you write each tuple, is it accurate to
say that a tuple can now be spilled three times (in the worst case)
whereas, before, it could be spilled only twice?

1 - when building the inner side hashtable, tuple is spilled to a "slice"
file
2 - (assuming the number of batches was increased) during execution, when
a tuple belonging to a later slice's spill file is found, it is re-spilled
to that slice's spill file
3 - during execution, when reading from its slice file, it is re-spilled
(again) to its batch's spill file

Yes, that's mostly accurate understanding. Essentially this might add
one extra step of "reshuffling" from the per-slice to per-batch files.

Is it correct that the max number of BufFile structs you will have
is equal to the number of slices + number of batches in a slice
because that is the max number of open BufFiles you would have at a
time?

Yes. With the caveat that we need twice that number of BufFile structs,
because we need them on both sides of the join.

By the way, applying v4 patch on master, in an assert build, I am tripping
some
asserts -- starting with
Assert(!file->readOnly);
in BufFileWrite

Whoooops :-/

One thing I was a little confused by was the nbatch_inmemory member
of the hashtable.ï¿½ The comment in ExecChooseHashTableSize says that
it is determining the number of batches we can fit in memory.ï¿½ I
thought that the problem was the amount of space taken up by the
BufFile data structure itself--which is related to the number of
open BufFiles you need at a time. This comment in
ExecChooseHashTableSize makes it sound like you are talking about
fitting more than one batch of tuples into memory at a time. I was
under the impression that you could only fit one batch of tuples in
memory at a time.

I suppose you mean this chunk:

/*
* See how many batches we can fit into memory (driven mostly by size
* of BufFile, with PGAlignedBlock being the largest part of that).
* We need one BufFile for inner and outer side, so we count it twice
* for each batch, and we stop once we exceed (work_mem/2).
*/
while ((nbatch_inmemory * 2) * sizeof(PGAlignedBlock) * 2
<= (work_mem * 1024L / 2))
nbatch_inmemory *= 2;

Yeah, that comment is a bit confusing. What the code actually does is
computing the largest "slice" of batches for which we can keep the
BufFile structs in memory, without exceeding work_mem/2.

Maybe the nbatch_inmemory should be renamed to nbatch_slice, not sure.

So, I was stepping through the code with work_mem set to the lower
bound, and in ExecHashIncreaseNumBatches, I got confused.
hashtable->nbatch_inmemory was 2 for me, thus, nbatch_tmp was 2 so,
I didn't meet this condition if (nbatch_tmp >
hashtable->nbatch_inmemory) since I just set nbatch_tmp using
hashtable->nbatch_inmemory So, I didn't increase the number of
slices, which is what I was expecting. What happens when
hashtable->nbatch_inmemory is equal to nbatch_tmp?

Ah, good catch. The condition you're refering to

if (nbatch_tmp > hashtable->nbatch_inmemory)

should actually be

if (nbatch > hashtable->nbatch_inmemory)

because the point is to initialize BufFile structs for the overflow
files, and we need to do that once we cross nbatch_inmemory.

And it turns out this actually causes the assert failures in regression
tests, you reported earlier. It failed to initialize the overflow files
in some cases, so the readOnly flag seemed to be set.

Attached is an updated patch, fixing this. I tried to clarify some of
the comments too, and I fixed another bug I found while running the
regression tests. It's still very much a crappy PoC code, though.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15

Melanie Plageman

melanieplageman@gmail.com

almost 7 years ago

In reply to: Tomas Vondra (#14)

Re: accounting for memory used for BufFile during hash joins

On Wed, May 8, 2019 at 8:08 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Tue, May 07, 2019 at 05:30:27PM -0700, Melanie Plageman wrote:

One thing I was a little confused by was the nbatch_inmemory member
of the hashtable. The comment in ExecChooseHashTableSize says that
it is determining the number of batches we can fit in memory. I
thought that the problem was the amount of space taken up by the
BufFile data structure itself--which is related to the number of
open BufFiles you need at a time. This comment in
ExecChooseHashTableSize makes it sound like you are talking about
fitting more than one batch of tuples into memory at a time. I was
under the impression that you could only fit one batch of tuples in
memory at a time.

I suppose you mean this chunk:

/*
* See how many batches we can fit into memory (driven mostly by size
* of BufFile, with PGAlignedBlock being the largest part of that).
* We need one BufFile for inner and outer side, so we count it twice
* for each batch, and we stop once we exceed (work_mem/2).
*/
while ((nbatch_inmemory * 2) * sizeof(PGAlignedBlock) * 2
<= (work_mem * 1024L / 2))
nbatch_inmemory *= 2;

Yeah, that comment is a bit confusing. What the code actually does is
computing the largest "slice" of batches for which we can keep the
BufFile structs in memory, without exceeding work_mem/2.

Maybe the nbatch_inmemory should be renamed to nbatch_slice, not sure.

I definitely would prefer to see hashtable->nbatch_inmemory renamed to
hashtable->nbatch_slice--or maybe hashtable->nbuff_inmemory?

I've been poking around the code for awhile today, and, even though I
know that the nbatch_inmemory is referring to the buffiles that can
fit in memory, I keep forgetting and thinking it is referring to the
tuple data that can fit in memory.

It might be worth explicitly calling out somewhere in the comments
that overflow slices will only be created either when the number of
batches was underestimated as part of ExecHashIncreaseNumBatches and
the new number of batches exceeds the value for
hashtable->nbatch_inmemory or when creating the hashtable initially
and the number of batches exceeds the value for
hashtable->nbatch_inmemory (the name confuses this for me at hashtable
creation time especially) -- the number of actual buffiles that can be
managed in memory.

Attached is an updated patch, fixing this. I tried to clarify some of
the comments too, and I fixed another bug I found while running the
regression tests. It's still very much a crappy PoC code, though.

So, I ran the following example on master and with your patch.

drop table foo;
drop table bar;
create table foo(a int, b int);
create table bar(c int, d int);
insert into foo select i, i from generate_series(1,10000)i;
insert into bar select 1, 1 from generate_series(1,1000)i;
insert into bar select i%3, i%3 from generate_series(1000,10000)i;
insert into foo select 1,1 from generate_series(1,1000)i;
analyze foo; analyze bar;
set work_mem=64;

On master, explain analyze looked like this

postgres=# explain analyze verbose select * from foo, bar where a = c;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=339.50..53256.27 rows=4011001 width=16) (actual
time=28.962..1048.442 rows=4008001 loops=1)
Output: foo.a, foo.b, bar.c, bar.d
Hash Cond: (bar.c = foo.a)
-> Seq Scan on public.bar (cost=0.00..145.01 rows=10001 width=8)
(actual time=0.030..1.777 rows=10001 loops=1)
Output: bar.c, bar.d
-> Hash (cost=159.00..159.00 rows=11000 width=8) (actual
time=12.285..12.285 rows=11000 loops=1)
Output: foo.a, foo.b
Buckets: 2048 (originally 2048) Batches: 64 (originally 16)
Memory Usage: 49kB
-> Seq Scan on public.foo (cost=0.00..159.00 rows=11000 width=8)
(actual time=0.023..3.786 rows=11000 loops=1)
Output: foo.a, foo.b
Planning Time: 0.435 ms
Execution Time: 1206.904 ms
(12 rows)

and with your patch, it looked like this.

postgres=# explain analyze verbose select * from foo, bar where a = c;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=339.50..53256.27 rows=4011001 width=16) (actual
time=28.256..1102.026 rows=4008001 loops=1)
Output: foo.a, foo.b, bar.c, bar.d
Hash Cond: (bar.c = foo.a)
-> Seq Scan on public.bar (cost=0.00..145.01 rows=10001 width=8)
(actual time=0.040..1.717 rows=10001 loops=1)
Output: bar.c, bar.d
-> Hash (cost=159.00..159.00 rows=11000 width=8) (actual
time=12.327..12.327 rows=11000 loops=1)
Output: foo.a, foo.b
Buckets: 2048 (originally 2048) Batches: 16384 (originally 16,
in-memory 2) Memory Usage: 131160kB
-> Seq Scan on public.foo (cost=0.00..159.00 rows=11000 width=8)
(actual time=0.029..3.569 rows=11000 loops=1)
Output: foo.a, foo.b
Planning Time: 0.260 ms
Execution Time: 1264.995 ms
(12 rows)

I noticed that the number of batches is much higher with the patch,
and, I was checking $PGDATA/base/pgsql_tmp and saw that the number of
temp files which are the overflow files any given time was quite high.

I would imagine that the desired behaviour is to keep memory usage
within work_mem.
In this example, the number of slices is about 8000, each of which
would have an overflow file. Is this the case you mention in the
comment in ExecChooseHashTableSize ?

* We ignore (per-slice)
* overflow files, because those serve as "damage control" for cases
* when per-batch BufFiles would exceed work_mem. Given enough batches
* it's impossible to enforce work_mem strictly, because the overflow
* files alone will consume more memory.

--
Melanie Plageman

#16

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 7 years ago

In reply to: Melanie Plageman (#15)

Re: accounting for memory used for BufFile during hash joins

On Tue, May 21, 2019 at 05:38:50PM -0700, Melanie Plageman wrote:

On Wed, May 8, 2019 at 8:08 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Tue, May 07, 2019 at 05:30:27PM -0700, Melanie Plageman wrote:

One thing I was a little confused by was the nbatch_inmemory member
of the hashtable. The comment in ExecChooseHashTableSize says that
it is determining the number of batches we can fit in memory. I
thought that the problem was the amount of space taken up by the
BufFile data structure itself--which is related to the number of
open BufFiles you need at a time. This comment in
ExecChooseHashTableSize makes it sound like you are talking about
fitting more than one batch of tuples into memory at a time. I was
under the impression that you could only fit one batch of tuples in
memory at a time.

I suppose you mean this chunk:

/*
* See how many batches we can fit into memory (driven mostly by size
* of BufFile, with PGAlignedBlock being the largest part of that).
* We need one BufFile for inner and outer side, so we count it twice
* for each batch, and we stop once we exceed (work_mem/2).
*/
while ((nbatch_inmemory * 2) * sizeof(PGAlignedBlock) * 2
<= (work_mem * 1024L / 2))
nbatch_inmemory *= 2;

Yeah, that comment is a bit confusing. What the code actually does is
computing the largest "slice" of batches for which we can keep the
BufFile structs in memory, without exceeding work_mem/2.

Maybe the nbatch_inmemory should be renamed to nbatch_slice, not sure.

I definitely would prefer to see hashtable->nbatch_inmemory renamed to
hashtable->nbatch_slice--or maybe hashtable->nbuff_inmemory?

I've been poking around the code for awhile today, and, even though I
know that the nbatch_inmemory is referring to the buffiles that can
fit in memory, I keep forgetting and thinking it is referring to the
tuple data that can fit in memory.

That's a fair point. I think nbatch_slice is a good name.

It might be worth explicitly calling out somewhere in the comments
that overflow slices will only be created either when the number of
batches was underestimated as part of ExecHashIncreaseNumBatches and
the new number of batches exceeds the value for
hashtable->nbatch_inmemory or when creating the hashtable initially
and the number of batches exceeds the value for
hashtable->nbatch_inmemory (the name confuses this for me at hashtable
creation time especially) -- the number of actual buffiles that can be
managed in memory.

Yes, this definitely needs to be explained somewhere - possibly in a
comment at the beginning of nodeHash.c or something like that.

FWIW I wonder if this "slicing" would be useful even with correct
estimates. E.g. let's say we can fit 128 batches into work_mem, but we
expect to need 256 (and it's accurate). At that point it's probably too
aggressive to disable hash joins - a merge join is likely more expensive
than just using the slicing. But that should be a cost-based decision.

Attached is an updated patch, fixing this. I tried to clarify some of
the comments too, and I fixed another bug I found while running the
regression tests. It's still very much a crappy PoC code, though.

So, I ran the following example on master and with your patch.

drop table foo;
drop table bar;
create table foo(a int, b int);
create table bar(c int, d int);
insert into foo select i, i from generate_series(1,10000)i;
insert into bar select 1, 1 from generate_series(1,1000)i;
insert into bar select i%3, i%3 from generate_series(1000,10000)i;
insert into foo select 1,1 from generate_series(1,1000)i;
analyze foo; analyze bar;
set work_mem=64;

On master, explain analyze looked like this

postgres=# explain analyze verbose select * from foo, bar where a = c;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=339.50..53256.27 rows=4011001 width=16) (actual
time=28.962..1048.442 rows=4008001 loops=1)
Output: foo.a, foo.b, bar.c, bar.d
Hash Cond: (bar.c = foo.a)
-> Seq Scan on public.bar (cost=0.00..145.01 rows=10001 width=8)
(actual time=0.030..1.777 rows=10001 loops=1)
Output: bar.c, bar.d
-> Hash (cost=159.00..159.00 rows=11000 width=8) (actual
time=12.285..12.285 rows=11000 loops=1)
Output: foo.a, foo.b
Buckets: 2048 (originally 2048) Batches: 64 (originally 16)
Memory Usage: 49kB
-> Seq Scan on public.foo (cost=0.00..159.00 rows=11000 width=8)
(actual time=0.023..3.786 rows=11000 loops=1)
Output: foo.a, foo.b
Planning Time: 0.435 ms
Execution Time: 1206.904 ms
(12 rows)

and with your patch, it looked like this.

postgres=# explain analyze verbose select * from foo, bar where a = c;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=339.50..53256.27 rows=4011001 width=16) (actual
time=28.256..1102.026 rows=4008001 loops=1)
Output: foo.a, foo.b, bar.c, bar.d
Hash Cond: (bar.c = foo.a)
-> Seq Scan on public.bar (cost=0.00..145.01 rows=10001 width=8)
(actual time=0.040..1.717 rows=10001 loops=1)
Output: bar.c, bar.d
-> Hash (cost=159.00..159.00 rows=11000 width=8) (actual
time=12.327..12.327 rows=11000 loops=1)
Output: foo.a, foo.b
Buckets: 2048 (originally 2048) Batches: 16384 (originally 16,
in-memory 2) Memory Usage: 131160kB
-> Seq Scan on public.foo (cost=0.00..159.00 rows=11000 width=8)
(actual time=0.029..3.569 rows=11000 loops=1)
Output: foo.a, foo.b
Planning Time: 0.260 ms
Execution Time: 1264.995 ms
(12 rows)

I noticed that the number of batches is much higher with the patch,
and, I was checking $PGDATA/base/pgsql_tmp and saw that the number of
temp files which are the overflow files any given time was quite high.

I would imagine that the desired behaviour is to keep memory usage
within work_mem.

There's definitely something fishy going on. I suspect it's either because
of the duplicate values (which might fit into 64kB on master, but not when
accounting for BufFile). Or maybe it's because the initial 16 batches
can't possibly fit into work_mem.

If you try with a larger work_mem, say 256kB, does that behave OK?

In this example, the number of slices is about 8000, each of which
would have an overflow file. Is this the case you mention in the
comment in ExecChooseHashTableSize ?

* We ignore (per-slice)
* overflow files, because those serve as "damage control" for cases
* when per-batch BufFiles would exceed work_mem. Given enough batches
* it's impossible to enforce work_mem strictly, because the overflow
* files alone will consume more memory.

Yes. 8000 slices is ~64MB, so considering we need them on both sides of
the join that'd be ~128MB. Which is pretty much exactly 131160kB.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#17

Hubert Zhang

hzhang@pivotal.io

almost 7 years ago

In reply to: Tomas Vondra (#1)

Re: accounting for memory used for BufFile during hash joins

On Sat, May 4, 2019 at 8:34 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

The root cause is that hash join treats batches as pretty much free, but
that's not really true - we do allocate two BufFile structs per batch,
and each BufFile is ~8kB as it includes PGAlignedBuffer.

The OOM is not very surprising, because with 524288 batches it'd need
about 8GB of memory, and the system only has 8GB RAM installed.

The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).

Hi Tomas

I read your second patch which uses overflow buf files to reduce the total
number of batches.
It would solve the hash join OOM problem what you discussed above: 8K per
batch leads to batch bloating problem.

I mentioned in another thread:

/messages/by-id/CAB0yrekv=6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA@mail.gmail.com
There is another hashjoin OOM problem which disables splitting batches too
early. PG uses a flag hashtable->growEnable to determine whether to split
batches. Once one splitting failed(all the tuples are assigned to only one
batch of two split ones) The growEnable flag would be turned off forever.

The is an opposite side of batch bloating problem. It only contains too few
batches and makes the in-memory hash table too large to fit into memory.

Here is the tradeoff: one batch takes more than 8KB(8KB makes sense, due to
performance), in-memory hash table takes memory as well and splitting
batched may(not must) reduce the in-memory hash table size but introduce
more batches(and thus more memory usage 8KB*#batch).
Can we conclude that it would be worth to splitting if satisfy:
(The reduced memory of in-memory hash table) - (8KB * number of new
batches) > 0

So I'm considering to combine our patch with your patch to fix join OOM
problem. No matter the OOM is introduced by (the memory usage of in-memory
hash table) or (8KB * number of batches).

nbatch_inmemory in your patch could also use the upper rule to redefine.

What's your opinion?

Thanks

Hubert Zhang

#18

Hubert Zhang

hzhang@pivotal.io

almost 7 years ago

In reply to: Hubert Zhang (#17)

Re: accounting for memory used for BufFile during hash joins

Hi Tomas,

Here is the patch, it's could be compatible with your patch and it focus on
when to regrow the batch.

On Tue, May 28, 2019 at 3:40 PM Hubert Zhang <hzhang@pivotal.io> wrote:

On Sat, May 4, 2019 at 8:34 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

The root cause is that hash join treats batches as pretty much free, but
that's not really true - we do allocate two BufFile structs per batch,
and each BufFile is ~8kB as it includes PGAlignedBuffer.

The OOM is not very surprising, because with 524288 batches it'd need
about 8GB of memory, and the system only has 8GB RAM installed.

The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).

Hi Tomas

I read your second patch which uses overflow buf files to reduce the total
number of batches.
It would solve the hash join OOM problem what you discussed above: 8K per
batch leads to batch bloating problem.

I mentioned in another thread:

/messages/by-id/CAB0yrekv=6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA@mail.gmail.com
There is another hashjoin OOM problem which disables splitting batches too
early. PG uses a flag hashtable->growEnable to determine whether to split
batches. Once one splitting failed(all the tuples are assigned to only one
batch of two split ones) The growEnable flag would be turned off forever.

The is an opposite side of batch bloating problem. It only contains too
few batches and makes the in-memory hash table too large to fit into memory.

Here is the tradeoff: one batch takes more than 8KB(8KB makes sense, due
to performance), in-memory hash table takes memory as well and splitting
batched may(not must) reduce the in-memory hash table size but introduce
more batches(and thus more memory usage 8KB*#batch).
Can we conclude that it would be worth to splitting if satisfy:
(The reduced memory of in-memory hash table) - (8KB * number of new
batches) > 0

So I'm considering to combine our patch with your patch to fix join OOM
problem. No matter the OOM is introduced by (the memory usage of
in-memory hash table) or (8KB * number of batches).

nbatch_inmemory in your patch could also use the upper rule to redefine.

What's your opinion?

Thanks

Hubert Zhang

--
Thanks

Hubert Zhang

#19

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 7 years ago

In reply to: Hubert Zhang (#17)

Re: accounting for memory used for BufFile during hash joins

On Tue, May 28, 2019 at 03:40:01PM +0800, Hubert Zhang wrote:

On Sat, May 4, 2019 at 8:34 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Hi Tomas

I read your second patch which uses overflow buf files to reduce the total
number of batches.
It would solve the hash join OOM problem what you discussed above: 8K per
batch leads to batch bloating problem.

I mentioned in another thread:

/messages/by-id/CAB0yrekv=6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA@mail.gmail.com
There is another hashjoin OOM problem which disables splitting batches too
early. PG uses a flag hashtable->growEnable to determine whether to split
batches. Once one splitting failed(all the tuples are assigned to only one
batch of two split ones) The growEnable flag would be turned off forever.

The is an opposite side of batch bloating problem. It only contains too few
batches and makes the in-memory hash table too large to fit into memory.

Yes. There are deffinitely multiple separate issues in the hashjoin code,
and the various improvements discussed in this (and other) thread usually
address just a subset of them. We need to figure out how to combine them
or maybe devise some more generic solution.

So I think we need to take a step back, and figure out how to combine
these improvements - otherwise we might commit a fix for one issue, making
it much harder/impossible to improve the other issues.

The other important question is whether we see these cases as outliers
(and the solutions as last-resort-attempt-to-survive kind of fix) or more
widely applicable optimizations. I've seen some interesting speedups with
the overflow-batches patch, but my feeling is we should really treat it as
a last-resort to survive.

I had a chat about this with Thomas Munro yesterday. Unfortunately, some
beer was involved but I do vaguely remember he more or less convinced me
the BNL (block nested loop join) might be the right approach here. We
don't have any patch for that yet, though :-(

Here is the tradeoff: one batch takes more than 8KB(8KB makes sense, due to
performance), in-memory hash table takes memory as well and splitting
batched may(not must) reduce the in-memory hash table size but introduce
more batches(and thus more memory usage 8KB*#batch).
Can we conclude that it would be worth to splitting if satisfy:
(The reduced memory of in-memory hash table) - (8KB * number of new
batches) > 0

Something like that, yes.

So I'm considering to combine our patch with your patch to fix join OOM
problem. No matter the OOM is introduced by (the memory usage of in-memory
hash table) or (8KB * number of batches).

nbatch_inmemory in your patch could also use the upper rule to redefine.

What's your opinion?

One of the issues with my "overflow batches" patch, pointed out to me by
Thomas yesterday, is that it only works with non-parallel hash join. And
we don't know how to make it work in the parallel mode :-(

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#20

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#19)

Re: accounting for memory used for BufFile during hash joins

Okay, so, while I do have specific, actual code review/commitfest-y
feedback for the patch in this thread registered for this commitfest,
I wanted to defer that for a later email and use this one to cover off
on a few higher level issues.

1) How this patch's approach fits into the wider set of problems with
hybrid hashjoin.

2) Parallel HashJoin implementation of this patch's approach

I think implementing support for parallel hashjoin or explicitly
disabling it would be the bare minimum for this patch, which is why I
made 2 its own item. I've marked it as returned to author for this
reason.

I do think that accounting for Buffile overhead when estimating the
size of the hashtable during ExecChooseHashTableSize() so it can be
used during planning is a worthwhile patch by itself (though I know it
is not even part of this patch).

I'll start with 2 since I have less to say there.

From comments upthread, I take it this would not work with parallel
hashjoin as expected. Is this because each worker operates on batches
independently and now batches are lumped into slices?

Thinking through a parallel-aware implementation, it seems like you
would use slice-based barriers for the build phase but batch-based
barriers for the probe phase to avoid getting out of sync (workers
with outer tuples from one batch should not try and join those with
tuples from another batch, even if in the same slice).

You would, of course, need to add code to make slices work with
SharedTuplestore--caveat here is I still haven't tried to understand
how parallel-aware hashjoin works/uses SharedTuplestore.

Now, addressing 1, how this patch fits into the wider set of problem's
with current hybrid hashjoin:

Thomas Munro nicely summarized roughly what I'm about to lay out like
this (upthread) -- he called them "three separate but related
problems":

A. Broken escape valve: sometimes we generate a huge number of
batches while trying to split up many duplicates, because of the
presence of other more uniformly distributed keys. We could fix that
with (say) a 95% rule.
B. Lack of good alternative execution strategy when the escape valve
is triggered. A batch cannot be split effectively, but cannot fit in
work_mem, so for now we decide to ignore work_mem.
C. Unmetered explosion of batches and thus BufFiles, probably usually
caused by problem A, but theoretically also due to a real need for
partitions.

However, I would like to lay out the problem space a little bit
differently. (using the end of the alphabet to differentiate).

The following scenarios are how you could end up running out of
memory:

Y. Plan-time underestimation of the number of required batches with
relatively uniform data distribution

In this case, the best join execution strategy is a plain hashjoin
with spilling as needed.
nbatches should be increased as needed, because the data is ~evenly
distributed.
slicing should be employed when buffile overhead exceeds some
threshhold for the ratio of work_mem to be used for buffile overhead

Z. Plan and or execution time underestimation of the number of
required batches with skewed data

If you knew this at planning time, you could have picked another
join-type, though, there might be cases where it would actually be
less costly to use plain hashjoin for all batches except the bad batch
and fall back to hash block nested loop join just for the duplicate
values.

If you could not have known this at planning time, the best join
execution strategy is a hybrid hashjoin/hash block nested loop join.

To do this, preview if increasing nbatches would move tuples, and, if
it would, do this (also, employing slicing if buffile overhead exceeds
the threshold)

If increasing nbatches wouldn't move tuples, process this batch with
hash block nested loop join.

Essentially, what we want is logical units of tuples which are
work_mem-sized. In some cases, each unit may contain multiple batches
(a slice in Tomas' patch) and in other cases, each unit may contain
only part of a batch (a chunk is the term I used in my hash block
nested loop join patch [1]/messages/by-id/CAAKRu_ZkkukQgXCK8ADe-PmvcmpZh6G1Uin8pqqovL4x7P30mQ@mail.gmail.com).

For slicing, each unit, a slice, has multiple batches but one spill
file.
For hbnlj, each unit, a chunk, is one of multiple chunks in a single
batch, all of which are in the same spill file (1 batch = 1 spill
file).

Thinking through it, it seems to make the most sense to split the work
into ~ 3 independent pieces:

patch1 - "preview" a batch increase (not yet written [I think])
patch2 - slicing (Tomas' patch [2]/messages/by-id/20190508150844.rij36rtuk4lhvztw@development but add in threshhold for portion of
work_mem buffile overhead is using)
patch3 - hash block nested loop join (my patch [1]/messages/by-id/CAAKRu_ZkkukQgXCK8ADe-PmvcmpZh6G1Uin8pqqovL4x7P30mQ@mail.gmail.com)

patch1 allows us to re-enable growth and was mentioned upthread, but I
will quote it here for simplicity:

I think we can fix A by relaxing the escape valve condition, and then
rechecking it once in a while. So we fill work_mem, realize it didn't
actually reduce the batch size significantly and disable nbatch growth.
But at the same time we increase the threshold to 2x work_mem, and after
reaching it we "consider" a nbatch increase. That is, we walk the batch
and see how many tuples would move if we increased nbatch (that should be
fairly cheap) - if it helps, great, enable growth and split the batch. If
not, double the threshold again. Rinse and repeat.

We don't want to fill up work_mem with buffile overhead after
increasing nbatches many times just to move a few tuples for one batch
and end up disabling growth thus making it so that later we can't
increase nbatches and repartition for a batch that would nicely
partition (like Hubert's case, I believe [3]/messages/by-id/CAB0yre=e8ysPyoUvZqjKYAxc6-VB=JKHL-7XKZSxy0FT5vY7BQ@mail.gmail.com).

We want to identify when re-partitioning would help and only do it
then and, for times when it wouldn't help, use a fallback strategy
that still allows progress on the hashjoin, and, for some spiky data,
where we have re-partitioned for the right reasons, but there are
still a lot of batches that are small enough that they could all fit
in memory at once, we want to track them with as little overhead as
possible -- lump them into slices.

We should probably consider deciding to use slices based on some
threshold for the portion of work_mem which is allowed to be occupied
by buffile overhead instead of waiting until the buffile overhead is
literally taking up most of work_mem.

The above summary is to address the concern in this thread about a
holistic solution.

I think the slicing patch is independent of both the hash block nested
loop join patch and the "preview" mode for batch increasing.

If slicing is made to work for parallel-aware hashjoin and the code is
in a committable state (and probably has the threshold I mentioned
above), then I think that this patch should go in.

[1]: /messages/by-id/CAAKRu_ZkkukQgXCK8ADe-PmvcmpZh6G1Uin8pqqovL4x7P30mQ@mail.gmail.com
/messages/by-id/CAAKRu_ZkkukQgXCK8ADe-PmvcmpZh6G1Uin8pqqovL4x7P30mQ@mail.gmail.com
[2]: /messages/by-id/20190508150844.rij36rtuk4lhvztw@development
/messages/by-id/20190508150844.rij36rtuk4lhvztw@development
[3]: /messages/by-id/CAB0yre=e8ysPyoUvZqjKYAxc6-VB=JKHL-7XKZSxy0FT5vY7BQ@mail.gmail.com
/messages/by-id/CAB0yre=e8ysPyoUvZqjKYAxc6-VB=JKHL-7XKZSxy0FT5vY7BQ@mail.gmail.com

--
Melanie Plageman

#21

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Melanie Plageman (#20)

#22