new hashing function

Started by Neil Conwayabout 24 years ago4 messageshackers
Jump to latest
#1Neil Conway
neilc@samurai.com

Hi,

I've implemented Bob Jenkin's hash function for PostgreSQL; more
information on the hash function can be found at
http://burtleburtle.net/bob/hash/doobs.html

I'm posting this to -hackers because I'd like to get some feedback on
whether this actually improves performance. I've tested 2 situations
locally:

(1) pgbench, btree indexes (to test typical performance)

(2) pgbench, hash indexes

I haven't looked at the implementation of hash joins; if they happen to
use this hash function as well, that would be another informative
situation to benchmark.

In my local tests, the performance in (1) is basically the same, while
the performance in (2) is increased by 4% to 8%. If you could let me
know the results on your local setup, it should become clear whether
this patch is a performance win overall.

Note that to test case (2) properly, you'll need to drop and re-create
your pgbench indexes after you apply the patch (so that when the new
indexes are created, they use the new hash function). Also, it would be
a good idea to use concurrency level 1; at higher concurrency levels,
hash indexes have a tendancy to deadlock (yes, I'm currently trying to
fix that too ;-) ).

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

Attachments:

jenkins_hash-3.patchtext/plain; charset=ISO-8859-1Download+173-156
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Neil Conway (#1)
Re: new hashing function

Neil Conway <nconway@klamath.dyndns.org> writes:

I haven't looked at the implementation of hash joins; if they happen to
use this hash function as well, that would be another informative
situation to benchmark.

Hash joins use some chosen-at-random hashing code of their own; see
hashFunc() in src/backend/executor/nodeHash.c. One of the things on my
to-do list has been to replace that with the datatype-specific hash
functions used for indexing/caching, since the latter seem better
engineered (even before your improvements).

BTW, I don't particularly approve of the parts of this patch that
simply remove unused arguments from various routines. You aren't
going to save very many cycles that way, and you are reducing
flexibility (eg, the changes to remove nkeys would interfere with
trying to make hash indexes support multiple columns).

regards, tom lane

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Neil Conway (#1)
Re: new hashing function

Neil Conway <nconway@klamath.dyndns.org> writes:

I've implemented Bob Jenkin's hash function for PostgreSQL; more
information on the hash function can be found at
http://burtleburtle.net/bob/hash/doobs.html

One other thought --- presently, catcache.c is careful to use a prime
size (257) for its hash tables, so that reducing the raw hash value
mod 257 will allow all bits of the hash to contribute to determining
the hash bucket number. This is necessary because of the relatively
poor randomness of the hash functions. Perhaps with Jenkins' function
we could dispense with that, and use a fixed power-of-2 size so that the
division becomes a simple bit masking. On machines with slow integer
division, this could be a nice speedup. Wouldn't help for hash indexes
or joins though, since they don't use constant hashtable sizes.

regards, tom lane

#4Neil Conway
neilc@samurai.com
In reply to: Tom Lane (#2)
Re: new hashing function

On Sun, 2002-03-03 at 12:31, Tom Lane wrote:

Neil Conway <nconway@klamath.dyndns.org> writes:

I haven't looked at the implementation of hash joins; if they happen to
use this hash function as well, that would be another informative
situation to benchmark.

Hash joins use some chosen-at-random hashing code of their own; see
hashFunc() in src/backend/executor/nodeHash.c. One of the things on my
to-do list has been to replace that with the datatype-specific hash
functions used for indexing/caching, since the latter seem better
engineered (even before your improvements).

Okay, I'll implement this.

BTW, I don't particularly approve of the parts of this patch that
simply remove unused arguments from various routines. You aren't
going to save very many cycles that way, and you are reducing
flexibility (eg, the changes to remove nkeys would interfere with
trying to make hash indexes support multiple columns).

Hmmm... I had viewed removing extra, unused functions to arguments as
basically code cleanup. But I see your point -- although really, the
future purpose behind the extra arguments should probably be
documented... I'll review my changes and remove the ones that seem to
have some future benefit.

Thanks for the feedback Tom.

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC