Avoiding OOM in a hash join with many duplicate inner keys
The planner doesn't currently worry about work_mem restrictions when
planning a hash join, figuring that the executor should be able to
subdivide the data arbitrarily finely by splitting buckets at runtime.
However there's a thread here:
/messages/by-id/CACw4T0p4Lzd6VpwptxgPgoTMh2dEKTQBGu7NTaJ1+A0PRx1BGg@mail.gmail.com
exhibiting a case where a hash join was chosen even though a single
value accounts for three-quarters of the inner relation. Bucket
splitting obviously can never separate multiple instances of the
same value, so this choice forced the executor to try to load
three-quarters of the (very large) inner relation into memory at once;
unsurprisingly, it failed.
To fix this, I think we need to discourage use of hash joins whenever
a single bucket is predicted to exceed work_mem, as in the attached
draft patch. The patch results in changing from hash to merge join
in one regression test case, which is fine; that case only cares about
the join order not the types of the joins.
This might be overly aggressive, because it will pretty much shut off
any attempt to use hash joining on a large inner relation unless we
have statistics for it (and those stats are favorable). But having
seen this example, I think we need to be worried.
I'm inclined to treat this as a bug and back-patch it, but I wonder if
anyone is concerned about possible plan destabilization in the back
branches.
regards, tom lane
Attachments:
On Thu, Feb 16, 2017 at 2:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The planner doesn't currently worry about work_mem restrictions when
planning a hash join, figuring that the executor should be able to
subdivide the data arbitrarily finely by splitting buckets at runtime.
However there's a thread here:
/messages/by-id/CACw4T0p4Lzd6VpwptxgPgoTMh2dEKTQBGu7NTaJ1+A0PRx1BGg@mail.gmail.com
exhibiting a case where a hash join was chosen even though a single
value accounts for three-quarters of the inner relation. Bucket
splitting obviously can never separate multiple instances of the
same value, so this choice forced the executor to try to load
three-quarters of the (very large) inner relation into memory at once;
unsurprisingly, it failed.To fix this, I think we need to discourage use of hash joins whenever
a single bucket is predicted to exceed work_mem, as in the attached
draft patch. The patch results in changing from hash to merge join
in one regression test case, which is fine; that case only cares about
the join order not the types of the joins.This might be overly aggressive, because it will pretty much shut off
any attempt to use hash joining on a large inner relation unless we
have statistics for it (and those stats are favorable). But having
seen this example, I think we need to be worried.
I do think that's worrying, but on the other hand it seems like this
solution could disable many hash joins that would actually be fine. I
don't think the largest ndistinct estimates we ever generate are very
large, and therefore this seems highly prone to worry even when
worrying isn't really justified.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 16, 2017 at 11:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I do think that's worrying, but on the other hand it seems like this
solution could disable many hash joins that would actually be fine. I
don't think the largest ndistinct estimates we ever generate are very
large, and therefore this seems highly prone to worry even when
worrying isn't really justified.
+1. ndistinct has a general tendency to be wrong, owing to how ANALYZE
works, which we see problems with from time to time.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Feb 16, 2017 at 2:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
This might be overly aggressive, because it will pretty much shut off
any attempt to use hash joining on a large inner relation unless we
have statistics for it (and those stats are favorable). But having
seen this example, I think we need to be worried.
I do think that's worrying, but on the other hand it seems like this
solution could disable many hash joins that would actually be fine. I
don't think the largest ndistinct estimates we ever generate are very
large, and therefore this seems highly prone to worry even when
worrying isn't really justified.
I initially thought about driving the shutoff strictly from the estimate
of the MCV frequency, without involving the more general ndistinct
computation that estimate_hash_bucketsize does. I'm not sure how much
that would do for your concern, but at least the MCV frequency doesn't
involve quite as much extrapolation as ndistinct.
There's a practical problem from final_cost_hashjoin's standpoint,
which is that it has noplace to cache the MCV frequency separately from
estimate_hash_bucketsize's output. In HEAD we could just add some more
fields to RestrictInfo, but that would be an unacceptable ABI break in
the back branches. Maybe we could get away with replacing the float8
bucketsize fields with two float4 fields --- it seems unlikely that we
need more than 6 digits of precision for these numbers, and I doubt any
extensions are touching the bucketsize fields.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 16, 2017 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Feb 16, 2017 at 2:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
This might be overly aggressive, because it will pretty much shut off
any attempt to use hash joining on a large inner relation unless we
have statistics for it (and those stats are favorable). But having
seen this example, I think we need to be worried.I do think that's worrying, but on the other hand it seems like this
solution could disable many hash joins that would actually be fine. I
don't think the largest ndistinct estimates we ever generate are very
large, and therefore this seems highly prone to worry even when
worrying isn't really justified.I initially thought about driving the shutoff strictly from the estimate
of the MCV frequency, without involving the more general ndistinct
computation that estimate_hash_bucketsize does. I'm not sure how much
that would do for your concern, but at least the MCV frequency doesn't
involve quite as much extrapolation as ndistinct.
Hmm, so we could do something like: if the estimated frequency of the
least-common MCV is enough to make one bucket overflow work_mem, then
don't use a hash join? That would still be prone to some error (in
both directions, really) but it seems less likely to spit out
completely stupid results than relying on ndistinct, which never gets
very big even in a 10TB table.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Feb 16, 2017 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I initially thought about driving the shutoff strictly from the estimate
of the MCV frequency, without involving the more general ndistinct
computation that estimate_hash_bucketsize does. I'm not sure how much
that would do for your concern, but at least the MCV frequency doesn't
involve quite as much extrapolation as ndistinct.
Hmm, so we could do something like: if the estimated frequency of the
least-common MCV is enough to make one bucket overflow work_mem, then
don't use a hash join? That would still be prone to some error (in
both directions, really) but it seems less likely to spit out
completely stupid results than relying on ndistinct, which never gets
very big even in a 10TB table.
No, it'd be the *most* common MCV, because we're concerned about the
worst-case (largest) bucket size. But that's good, really, because the
highest MCV frequency will be the one we have most statistical
confidence in. There's generally a whole lot of noise in the tail-end
MCV numbers.
Also, I'd be inclined to do nothing (no shutoff) if we have no MCV
stats. That would be an expected case if the column is believed unique,
and it's probably a better fallback behavior when we simply don't have
stats. With the ndistinct-based rule, we'd be shutting off hashjoin
almost always when we don't have stats. Given how long it took us
to recognize this problem, that's probably the wrong default.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 16, 2017 at 3:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Feb 16, 2017 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I initially thought about driving the shutoff strictly from the estimate
of the MCV frequency, without involving the more general ndistinct
computation that estimate_hash_bucketsize does. I'm not sure how much
that would do for your concern, but at least the MCV frequency doesn't
involve quite as much extrapolation as ndistinct.Hmm, so we could do something like: if the estimated frequency of the
least-common MCV is enough to make one bucket overflow work_mem, then
don't use a hash join? That would still be prone to some error (in
both directions, really) but it seems less likely to spit out
completely stupid results than relying on ndistinct, which never gets
very big even in a 10TB table.No, it'd be the *most* common MCV, because we're concerned about the
worst-case (largest) bucket size. But that's good, really, because the
highest MCV frequency will be the one we have most statistical
confidence in. There's generally a whole lot of noise in the tail-end
MCV numbers.
Oh, right. That's reassuring, as it seems like it has a much better
chance of actually being right.
Also, I'd be inclined to do nothing (no shutoff) if we have no MCV
stats. That would be an expected case if the column is believed unique,
and it's probably a better fallback behavior when we simply don't have
stats. With the ndistinct-based rule, we'd be shutting off hashjoin
almost always when we don't have stats. Given how long it took us
to recognize this problem, that's probably the wrong default.
Right.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Feb 16, 2017 at 3:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
No, it'd be the *most* common MCV, because we're concerned about the
worst-case (largest) bucket size. But that's good, really, because the
highest MCV frequency will be the one we have most statistical
confidence in. There's generally a whole lot of noise in the tail-end
MCV numbers.
Oh, right. That's reassuring, as it seems like it has a much better
chance of actually being right.
Here's a version that does it that way. Unsurprisingly, it doesn't
cause any regression test changes, but you can confirm it's having an
effect with this test case:
create table tt(f1 int);
insert into tt select 1 from generate_series(1,1000000) g;
insert into tt select g from generate_series(1,1000000) g;
analyze tt;
explain select * from tt a natural join tt b;
Unpatched code will go for a hash join on this example.
For application to the back branches, we could do it just like this
(leaving the existing fields alone, and allowing sizeof(RestrictInfo)
to grow), or we could change the datatypes of the four fields involved
to float4 so that sizeof(RestrictInfo) stays the same. I'm not entirely
sure which way is the more hazardous from an ABI standpoint. If there
are any external modules doing their own palloc(sizeof(RestrictInfo))
calls, the first way would be bad, but really there shouldn't be; I'd
expect people to be using make_restrictinfo() and friends. (Note that
palloc's power-of-2 padding wouldn't save us, because sizeof(RestrictInfo)
is currently exactly 128 on 32-bit machines in several of the back
branches.) Conversely, if any non-core code is touching the bucketsize
fields, changing those field widths would break that; but that doesn't
seem all that likely either. On balance I think I might go for the first
way, because it would remove doubt about whether reducing the precision
of the bucketsize estimates would cause any unexpected plan changes.
Or we could decide not to back-patch because the problem doesn't come
up often enough to justify taking any risk for. But given that we've
gotten one confirmed field report, I'm not voting that way.
regards, tom lane
Attachments:
On Fri, Feb 17, 2017 at 11:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Feb 16, 2017 at 3:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
No, it'd be the *most* common MCV, because we're concerned about the
worst-case (largest) bucket size. But that's good, really, because the
highest MCV frequency will be the one we have most statistical
confidence in. There's generally a whole lot of noise in the tail-end
MCV numbers.Oh, right. That's reassuring, as it seems like it has a much better
chance of actually being right.Here's a version that does it that way. Unsurprisingly, it doesn't
cause any regression test changes, but you can confirm it's having an
effect with this test case:create table tt(f1 int);
insert into tt select 1 from generate_series(1,1000000) g;
insert into tt select g from generate_series(1,1000000) g;
analyze tt;
explain select * from tt a natural join tt b;Unpatched code will go for a hash join on this example.
+1
By strange coincidence, I was about to propose something along these
lines on theoretical grounds, having spent a bunch of time studying
the hash join code recently. It makes a lot of sense to use
statistics to try to avoid the "fail" (ie fail to respect work_mem,
and maybe fail to complete: maybe better called "overflow" or
"explode") mode during planning.
I have been wondering about a couple of different worst case execution
strategies that would be better than throwing our hands up and
potentially exploding memory once we detect that further partitioning
is not going to help, if we still manage to reach that case despite
adding stats-based defences like this due to statistics being missing,
bad or confused by joins below us.
1. We could probe the fraction of the hash table that we have managed
to load into work_mem so far and then rewind the outer batch and do it
again for the next work_mem-sized fraction of the inner batch and so
on. For outer joins we'd need to scan for unmatched tuples after each
hash table refill. If we detect this condition during the initial
hash build (as opposed to a later inner batch->hash table load), we'd
need to disable the so called 'hybrid' optimisation and fall back to
the so called 'Grace' hash join; that is, we'd need to pull in the
whole outer relation and write it out to batches before we even begin
probing batch 0, so that we have the ability to rewind outer batch 0
for another pass. When doing extra passes of an outer batch file, we
have to make sure that we don't do the 'send this tuple to a future
batch' behaviour if the number of batches happens to have increased.
Modulo some details, and I may be missing something fundamental here
(maybe breaks in some semi/anti case?)...
2. We could just abandon hash join for this batch. "She cannae take
it, captain", so sort inner and outer batches and merge-join them
instead. Same comment re switching to Grace hash join if this happens
while loading inner batch 0; we'll need a complete inner batch 0 and
outer batch 0, so we can't juse the hybrid optimisation.
Obviously there are vanishing returns here as we add more defences
making it increasingly unlikely that we hit "fail" mode. But it
bothers me that hash joins in general are not 100% guaranteed to be
able to complete unless you have infinite RAM.
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 16, 2017 at 8:13 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
Obviously there are vanishing returns here as we add more defences
making it increasingly unlikely that we hit "fail" mode. But it
bothers me that hash joins in general are not 100% guaranteed to be
able to complete unless you have infinite RAM.
I think in practice most people are forced to set work_mem to such a
small percentage of their available RAM that actual RAM exhaustion is
quite rare. The default value of 4MB is probably conservative even
for a Raspberry Pi, at least until the connection count spikes
unexpectedly, or until you have this problem:
/messages/by-id/16161.1324414006@sss.pgh.pa.us
Most advice that I've seen for work_mem involves choosing values like
RAM / (4 * max_connections), which tends to result in much larger
values that are typically still small very small compared to the
amount of RAM that's available at any given moment, because most of
the time you either don't have the maximum number of connections or
some of them are idle or not all of them are using plans that need any
work_mem. Unfortunately, even with these very conservative settings,
one sometimes sees a machine get absolutely swamped by a large
activity spike at a time when all of the backends just so happen to be
running a query that uses 2 or 3 (or 180) copies of work_mem.[1]Or all of the connections just touch each of your 100,000 relations and the backend-local caches fill up and the whole system falls over without using any work_mem at all.
If I were going to try to do something about the problem of machines
running out of memory, I'd be inclined to look at the problem more
broadly than "hey, hash joins can exceed work_mem if certain bad
things happen" and instead think about "hey, work_mem is a stupid way
of deciding on a memory budget". The intrinsic stupidity of work_mem
as an allocation system means that (1) it's perfectly possible to run
out of memory even if every node respects the memory budget and (2)
it's perfectly possible to drastically underutilize the memory you do
have even if some nodes fail to respect the memory budget. Of course,
if we had a smarter system for deciding on the budget it would be more
not less important for nodes to respect the budget they were given, so
that's not really an argument against trying to plug the hole you're
complaining about here, just a doubt about how much it will improve
the user experience if that's the only thing you do.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
[1]: Or all of the connections just touch each of your 100,000 relations and the backend-local caches fill up and the whole system falls over without using any work_mem at all.
relations and the backend-local caches fill up and the whole system
falls over without using any work_mem at all.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thomas Munro <thomas.munro@enterprisedb.com> writes:
I have been wondering about a couple of different worst case execution
strategies that would be better than throwing our hands up and
potentially exploding memory once we detect that further partitioning
is not going to help, if we still manage to reach that case despite
adding stats-based defences like this due to statistics being missing,
bad or confused by joins below us.
Yeah, it would definitely be nice if we could constrain the runtime
space consumption better.
1. We could probe the fraction of the hash table that we have managed
to load into work_mem so far and then rewind the outer batch and do it
again for the next work_mem-sized fraction of the inner batch and so
on. For outer joins we'd need to scan for unmatched tuples after each
hash table refill.
I do not understand how that works for a left join? You'd need to track
whether a given outer tuple has been matched in any one of the fractions
of the inner batch, so that when you're done with the batch you could know
which outer tuples need to be emitted null-extended. Right now we only
need to track that state for the current outer tuple, but in a rescan
situation we'd have to remember it for each outer tuple in the batch.
Perhaps it could be done by treating the outer batch file as read/write
and incorporating a state flag in each tuple; or to reduce write volume,
maintaining a separate outer batch file parallel to the main one with just
a bool or even just a bit per outer tuple. Seems messy though.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Mar 8, 2017 at 1:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
I have been wondering about a couple of different worst case execution
strategies that would be better than throwing our hands up and
potentially exploding memory once we detect that further partitioning
is not going to help, if we still manage to reach that case despite
adding stats-based defences like this due to statistics being missing,
bad or confused by joins below us.Yeah, it would definitely be nice if we could constrain the runtime
space consumption better.1. We could probe the fraction of the hash table that we have managed
to load into work_mem so far and then rewind the outer batch and do it
again for the next work_mem-sized fraction of the inner batch and so
on. For outer joins we'd need to scan for unmatched tuples after each
hash table refill.I do not understand how that works for a left join? You'd need to track
whether a given outer tuple has been matched in any one of the fractions
of the inner batch, so that when you're done with the batch you could know
which outer tuples need to be emitted null-extended. Right now we only
need to track that state for the current outer tuple, but in a rescan
situation we'd have to remember it for each outer tuple in the batch.Perhaps it could be done by treating the outer batch file as read/write
and incorporating a state flag in each tuple; or to reduce write volume,
maintaining a separate outer batch file parallel to the main one with just
a bool or even just a bit per outer tuple. Seems messy though.
Right. Messy. I think what I described may fall under the category
of "block nested loop". It looks doable but not very appealing for
left joins, and performance seems not great, multiplying the probing
scans by the number of fragments. Whether we actually care about
performance at all when we've reached this emergency state and are
primarily concerned with completing the query I'm not entirely sure.
Another idea would be to identify the offending bucket (how?) and
spill it to disk in its own file, and track it by pushing a special
control object with a distinguishing header flag into the hash table
(or a new overflow table, or extend the duties of the skew table,
or...). We'd have to deal with the matched flags of spilled inner
tuples for right/full joins. Matching is really per-key, not
per-tuple, so if there is a control object in memory for each of these
"overflow" buckets then perhaps it could hold the matched flag that
covers all tuples with each distinct key. What I like about this is
that is doesn't change the join algorithm at all, it just bolts on a
per-bucket escape valve. The changes might be quite localised, though
I know someone who probably wouldn't like an extra branch in
ExecScanHashBucket().
--
Thomas Munro
http://www.enterprisedb.com