hash join hashtable size and work_mem

Started by Timothy J. Kordasalmost 19 years ago5 messages
#1Timothy J. Kordas
tkordas@greenplum.com

in nodeHash.c, the function ExecChooseHashTableSize() uses two different
methods for determining the number of buckets to use.

the current code looks something like:

if (ntuples * tuplesize > work_mem * 1024)
buckets = (work_mem * 1024) / (tupsize * 10);
else
buckets = ntuples/10

So for the case where a spill is expected; we use work_mem to decide on our
hash size. For the case where a spill isn't expected; we rely on the row
estimate alone -- and make no provision for speeding the join by using the
memory that we're allowed to use.

When profiling large hash-joins, it often is the case that scanning the
hash-buckets is a bottleneck; it would be nice for the user to be able to
"throw memory" at a join to improve performance.

Am I missing something about the current implementation ? I would expect
that the bucket count would be calculated something like:

buckets = (work_mem * 1024L) / (tup_size * NTUP_PER_BUCKET)

for both cases ?

making this change appears to improve hash-join performance substantially in
some cases, and as far as I can tell doesn't hurt anything (apart from using
memory that it is "allowed" to use given a particular work_mem setting).

-Tim
--
tkordas@greenplum.com

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Timothy J. Kordas (#1)
Re: hash join hashtable size and work_mem

"Timothy J. Kordas" <tkordas@greenplum.com> writes:

Am I missing something about the current implementation ?

If the planner has correctly predicted the number of rows, the table
loading should be about NTUP_PER_BUCKET in either regime. Are you
sure you aren't just wishing that NTUP_PER_BUCKET were smaller?
I don't see that making the hashtable much larger than ntuples
is a good idea --- that just spreads out the live entries over more
cache lines, resulting in more cache thrashing.

regards, tom lane

#3Timothy J. Kordas
tkordas@greenplum.com
In reply to: Tom Lane (#2)
Re: hash join hashtable size and work_mem

Tom Lane wrote:

If the planner has correctly predicted the number of rows, the table
loading should be about NTUP_PER_BUCKET in either regime. Are you
sure you aren't just wishing that NTUP_PER_BUCKET were smaller?

Maybe I wish NTUP_PER_BUCKET was smaller. But I don't think that's the whole
story.

The planner estimates definitely play a role in my concern here. For
mis-estimated inner relations, the current calculation may over-subscribe
the hash-table even if more work_mem was available (that is, there are too
many hash collisions *and* memory isn't being used to the fullest extent
allowed).

I've been tracking the number of tuples which land in each bucket, and I'd
like to see that number go down as I increase work_mem.

I would expect for the same data a hash-join with a work_mem of 256MB to run
faster than one run with 32MB; even if the inner relation is only 30MB.

the implementation I've been experimenting with actually takes the average
of the current implementation (ntuples/10) and the spill version
(work_mem/(tupsize * 10).

-Tim

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Timothy J. Kordas (#3)
Re: hash join hashtable size and work_mem

"Timothy J. Kordas" <tkordas@greenplum.com> writes:

I would expect for the same data a hash-join with a work_mem of 256MB to run
faster than one run with 32MB; even if the inner relation is only 30MB.

Once you get to the point where each tuple is in a different bucket, it
is clearly impossible for further increases in hashtable size to improve
matters. All you can do is waste RAM and cache lines.

Now if we set NTUP_PER_BUCKET = 1 we would not be exactly at that critical
point because of uneven bucket loading and other factors ... but I
question whether there's enough incremental improvement available to
justify making the hashtable much larger than that.

regards, tom lane

#5Simon Riggs
simon@2ndquadrant.com
In reply to: Timothy J. Kordas (#3)
Re: hash join hashtable size and work_mem

On Wed, 2007-03-14 at 10:28 -0700, Timothy J. Kordas wrote:

I would expect for the same data a hash-join with a work_mem of 256MB
to run faster than one run with 32MB; even if the inner relation is
only 30MB.

Certainly not for all data, but for some distrubutions yes, probably.

The easiest thing to do is prove thats true and then work out how to
spot that case ahead of time, or at least find a place where you can
adjust your assumptions cheaply enough to improve things.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com