Fast, stable, portable hash function producing 4-byte or 8-byte values?

Started by Erwin Brandstetterover 6 years ago11 messagesgeneral
Jump to latest
#1Erwin Brandstetter
brsaweda@gmail.com

I am looking for stable hash functions producing 8-byte or 4-byte hashes
from long text values in Postgres 10 or later.

There is md5(), the result of which can be cast to uuid. This reliably
produces practically unique, stable 16-byte values. I have usecases where
an 8-byte or even 4-byte hash would be good enough to make collisions
reasonably unlikely. (I can recheck on the full string) - and expression
indexes substantially smaller. I could truncate md5 and cast back and
forth, but that seems like a lot of wasted computation. Are there
suggestions for text hash functions that are
- fast
- keep collisions to a minimum
- stable across major Postgres versions (so expression indexes don't break)
- croptographic aspect is not needed (acceptable, but no benefit)

There is an old post from 2012 by Tom Lane suggesting that hashtext() and
friends are not for users:

/messages/by-id/24463.1329854466@sss.pgh.pa.us

Postgres 11 added hashtextextended() and friends to generate bigint hashes.
In a more recent post from 3 months ago, Tom suggests to use it in
user-land - if portability is not needed:

/messages/by-id/9434.1568839177@sss.pgh.pa.us

Is pghashlib by Marko Kreen my best option?

https://github.com/markokr/pghashlib

Or the "version-independent hash functions for PostgreSQL" from Peter
Eisentraut:

https://github.com/petere/pgvihash

Neither received updates for a couple of years. Unmaintained? Or obsolete?
And neither is available on most hosted services like RDS or Heroku (which
would be required in come cases).

So what are my best options?

Regards
Erwin

#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Erwin Brandstetter (#1)
Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

On Tue, 2019-12-10 at 22:11 +0100, Erwin Brandstetter wrote:

I am looking for stable hash functions producing 8-byte or 4-byte hashes from long text values in Postgres 10 or later.

[...]

There is an old post from 2012 by Tom Lane suggesting that hashtext() and friends are not for users:

/messages/by-id/24463.1329854466@sss.pgh.pa.us

Changing a hash function would corrupt hash indexes, wouldn't it?

So I'd expect these functions to be pretty stable:

SELECT amp.amproc
FROM pg_amproc AS amp
JOIN pg_opfamily AS opf ON amp.amprocfamily = opf.oid
JOIN pg_am ON opf.opfmethod = pg_am.oid
WHERE pg_am.amname = 'hash'
AND amp.amprocnum = 1;

Or at least there would have to be a fat warning in the release notes
to reindex hash indexes.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#3Miles Elam
miles.elam@productops.com
In reply to: Erwin Brandstetter (#1)
Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

In terms of "wasted computation", MD5, SHA1, and the others always compute
the full length before they are passed to a UUID, int, or whatever. It's a
sunk cost. It's also a minor cost considering many hash algorithms are
performed in CPU hardware now. All that's left is the truncation and cast,
which you can't avoid easily.

Sure, you could reimplement Java's .hashCode() method by iterating through
the characters and processing the character codes:

s[0]*31^(n - 1) + s[1]*31^(n - 2) + ... + s[n - 1]

I don't see how that would beat the CPU-based hashes though unless you
wrote a C-based extension. Maybe it's better just to embrace the
user-defined function first and then decide if performance is insufficient
for your use cases.

CREATE EXTENSION IF NOT EXISTS pgcrypto;

CREATE OR REPLACE FUNCTION hash8 (p_data text, p_algo text = 'md5') RETURNS
int8 AS $$

SELECT ('x' || encode(substring(digest(p_data, p_algo) FROM 1 FOR 16),
'hex'))::bit(64)::int8

$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE;

CREATE OR REPLACE FUNCTION hash4 (p_data text, p_algo text = 'md5') RETURNS
int4 AS $$

SELECT ('x' || encode(substring(digest(p_data, p_algo) FROM 1 FOR 8),
'hex'))::bit(32)::int4

$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE;

SELECT

hash4('something something something'),

hash4('something something something', 'sha1'),

hash8('something something something'),

hash8('something something something', 'sha1');

Cheers,

Miles

On Tue, Dec 10, 2019 at 1:12 PM Erwin Brandstetter <brsaweda@gmail.com>
wrote:

Show quoted text

I am looking for stable hash functions producing 8-byte or 4-byte hashes
from long text values in Postgres 10 or later.

There is md5(), the result of which can be cast to uuid. This reliably
produces practically unique, stable 16-byte values. I have usecases where
an 8-byte or even 4-byte hash would be good enough to make collisions
reasonably unlikely. (I can recheck on the full string) - and expression
indexes substantially smaller. I could truncate md5 and cast back and
forth, but that seems like a lot of wasted computation. Are there
suggestions for text hash functions that are
- fast
- keep collisions to a minimum
- stable across major Postgres versions (so expression indexes don't break)
- croptographic aspect is not needed (acceptable, but no benefit)

There is an old post from 2012 by Tom Lane suggesting that hashtext() and
friends are not for users:

/messages/by-id/24463.1329854466@sss.pgh.pa.us

Postgres 11 added hashtextextended() and friends to generate bigint
hashes. In a more recent post from 3 months ago, Tom suggests to use it in
user-land - if portability is not needed:

/messages/by-id/9434.1568839177@sss.pgh.pa.us

Is pghashlib by Marko Kreen my best option?

https://github.com/markokr/pghashlib

Or the "version-independent hash functions for PostgreSQL" from Peter
Eisentraut:

https://github.com/petere/pgvihash

Neither received updates for a couple of years. Unmaintained? Or obsolete?
And neither is available on most hosted services like RDS or Heroku (which
would be required in come cases).

So what are my best options?

Regards
Erwin

#4Ron
ronljohnsonjr@gmail.com
In reply to: Erwin Brandstetter (#1)
Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

On 12/10/19 3:11 PM, Erwin Brandstetter wrote:

I am looking for stable hash functions producing 8-byte or 4-byte hashes
from long text values in Postgres 10 or later.

There is md5(), the result of which can be cast to uuid. This reliably
produces practically unique, stable 16-byte values. I have usecases where
an 8-byte or even 4-byte hash would be good enough to make collisions
reasonably unlikely. (I can recheck on the full string) - and expression
indexes substantially smaller. I could truncate md5 and cast back and
forth, but that seems like a lot of wasted computation. Are there
suggestions for text hash functions that are
- fast
- keep collisions to a minimum
- stable across major Postgres versions (so expression indexes don't break)
- croptographic aspect is not needed (acceptable, but no benefit)

What about a CRC32 function?  It's fast, and an SSE4 instruction has been in
Intel CPUs for about 10 years.

There is an old post from 2012 by Tom Lane suggesting that hashtext() and
friends are not for users:

/messages/by-id/24463.1329854466@sss.pgh.pa.us

Postgres 11 added hashtextextended() and friends to generate bigint
hashes. In a more recent post from 3 months ago, Tom suggests to use it in
user-land - if portability is not needed:

/messages/by-id/9434.1568839177@sss.pgh.pa.us

Is pghashlib by Marko Kreen my best option?

https://github.com/markokr/pghashlib

Or the "version-independent hash functions for PostgreSQL" from Peter
Eisentraut:

https://github.com/petere/pgvihash

Neither received updates for a couple of years. Unmaintained? Or obsolete?
And neither is available on most hosted services like RDS or Heroku (which
would be required in come cases).

So what are my best options?

Regards
Erwin

--
Angular momentum makes the world go 'round.

#5Erwin Brandstetter
brsaweda@gmail.com
In reply to: Laurenz Albe (#2)
Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

Thanks for the suggestion. Seems like a good assumption and I have been
using hashtext() in the past. But I am uncertain whether it is the best
option.

Guess Tom's warning in
/messages/by-id/9434.1568839177@sss.pgh.pa.us about
portability only refers to hashtextextended() and friends not being there
in Postgres 10 or older.

But why are none of these functions documented? Does the project still not
...

want people to rely on them continuing to have exactly the current

behavior.

I am not complaining, maybe just nobody did the work. But it's also
mentioned in this old thread, that hastext() changed in the past. Is all of
that outdated and we are welcome to use those functions for indexing?
/messages/by-id/24463.1329854466@sss.pgh.pa.us
</messages/by-id/24463.1329854466@sss.pgh.pa.us&gt;

Filtering with amprocnum = 2 gets functions producing bigint in Postgres 11
or later. Not sure about the exact meaning of amprocnum, manual says
"Support function number".

Remaining problem either way: no hash function returning bigint for
Postgres 10.

Regards
Erwin

On Tue, Dec 10, 2019 at 11:13 PM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

Show quoted text

On Tue, 2019-12-10 at 22:11 +0100, Erwin Brandstetter wrote:

I am looking for stable hash functions producing 8-byte or 4-byte hashes

from long text values in Postgres 10 or later.

[...]

There is an old post from 2012 by Tom Lane suggesting that hashtext()

and friends are not for users:

/messages/by-id/24463.1329854466@sss.pgh.pa.us

Changing a hash function would corrupt hash indexes, wouldn't it?

So I'd expect these functions to be pretty stable:

SELECT amp.amproc
FROM pg_amproc AS amp
JOIN pg_opfamily AS opf ON amp.amprocfamily = opf.oid
JOIN pg_am ON opf.opfmethod = pg_am.oid
WHERE pg_am.amname = 'hash'
AND amp.amprocnum = 1;

Or at least there would have to be a fat warning in the release notes
to reindex hash indexes.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#6Erwin Brandstetter
brsaweda@gmail.com
In reply to: Miles Elam (#3)
Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

On Tue, Dec 10, 2019 at 11:34 PM Miles Elam <miles.elam@productops.com>
wrote:

In terms of "wasted computation", MD5, SHA1, and the others always compute
the full length before they are passed to a UUID, int, or whatever. It's a
sunk cost. It's also a minor cost considering many hash algorithms are
performed in CPU hardware now. All that's left is the truncation and cast,
which you can't avoid easily.

Sure, you could reimplement Java's .hashCode() method by iterating through
the characters and processing the character codes:

s[0]*31^(n - 1) + s[1]*31^(n - 2) + ... + s[n - 1]

I don't see how that would beat the CPU-based hashes though unless you
wrote a C-based extension. Maybe it's better just to embrace the
user-defined function first and then decide if performance is insufficient
for your use cases.

CREATE EXTENSION IF NOT EXISTS pgcrypto;

CREATE OR REPLACE FUNCTION hash8 (p_data text, p_algo text = 'md5')
RETURNS int8 AS $$

SELECT ('x' || encode(substring(digest(p_data, p_algo) FROM 1 FOR 16),
'hex'))::bit(64)::int8

$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE;

CREATE OR REPLACE FUNCTION hash4 (p_data text, p_algo text = 'md5')
RETURNS int4 AS $$

SELECT ('x' || encode(substring(digest(p_data, p_algo) FROM 1 FOR 8),
'hex'))::bit(32)::int4

$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE;

SELECT

hash4('something something something'),

hash4('something something something', 'sha1'),

hash8('something something something'),

hash8('something something something', 'sha1');

Cheers,

Miles

Thanks for the custom functions! May be useful as fallback.

But I am really looking for standard functions in Postgres first. Those
should be faster and more reliable than writing my own.

Regards

Erwin

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Erwin Brandstetter (#5)
Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

Erwin Brandstetter <brsaweda@gmail.com> writes:

Guess Tom's warning in
/messages/by-id/9434.1568839177@sss.pgh.pa.us about
portability only refers to hashtextextended() and friends not being there
in Postgres 10 or older.

Well, the other portability issue that is worth considering is that
these functions are only intended to give stable results within a
particular installation; their use for hash indexes does not require
the same results across different platforms. Notably, most of them
give different answers on little-endian and big-endian machines.

That's not necessarily a reason not to use them, but you have to
be careful what you assume about them.

regards, tom lane

#8George Neuner
gneuner2@comcast.net
In reply to: Erwin Brandstetter (#1)
Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

On Tue, 10 Dec 2019 18:00:02 -0600, Ron <ronljohnsonjr@gmail.com>
wrote:

On 12/10/19 3:11 PM, Erwin Brandstetter wrote:

I am looking for stable hash functions producing 8-byte or 4-byte hashes
from long text values in Postgres 10 or later.

There is md5(), the result of which can be cast to uuid. This reliably
produces practically unique, stable 16-byte values. I have usecases where
an 8-byte or even 4-byte hash would be good enough to make collisions
reasonably unlikely. (I can recheck on the full string) - and expression
indexes substantially smaller. I could truncate md5 and cast back and
forth, but that seems like a lot of wasted computation. Are there
suggestions for text hash functions that are
- fast
- keep collisions to a minimum
- stable across major Postgres versions (so expression indexes don't break)
- croptographic aspect is not needed (acceptable, but no benefit)

What about a CRC32 function?  It's fast, and an SSE4 instruction has been in
Intel CPUs for about 10 years.

On long text CRC will not be as discriminating as a real cryptohash,
but it may be considerably faster to compute depending on the length
of the text. Given that the OP can check the actual string in the
event of a collision, it may work well enough.

One way to make it more discriminating is to compute a pair of CRCs
using different polynomials. Unfortunately the SSE4.2 CRC32
instruction uses a fixed polynomial. You can compute it twice using
different initial values, but the result won't be as good as actually
using different polynomials.

I used to create 4-byte hashes by concatenating two 16-bit CRCs - one
using the X-modem polynomial and the other using the CCITT polynomial.
Since these gave very different results, it worked quite well for my
use case where all the strings were guaranteed < 16KB.

YMMV,
George

#9Erik Aronesty
erik@q32.com
In reply to: Erwin Brandstetter (#5)
Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

You can always tweak fnv for whatever bite-size or bit size you want.
Sometimes I know a little information about my data shape and make a
custom fnv that only looks at the first half for the last half of a string,
etc.

On Wed, Dec 11, 2019, 1:02 PM Erwin Brandstetter <brsaweda@gmail.com> wrote:

Show quoted text

Thanks for the suggestion. Seems like a good assumption and I have been
using hashtext() in the past. But I am uncertain whether it is the best
option.

Guess Tom's warning in
/messages/by-id/9434.1568839177@sss.pgh.pa.us about
portability only refers to hashtextextended() and friends not being there
in Postgres 10 or older.

But why are none of these functions documented? Does the project still not
...

want people to rely on them continuing to have exactly the current

behavior.

I am not complaining, maybe just nobody did the work. But it's also
mentioned in this old thread, that hastext() changed in the past. Is all of
that outdated and we are welcome to use those functions for indexing?

/messages/by-id/24463.1329854466@sss.pgh.pa.us
</messages/by-id/24463.1329854466@sss.pgh.pa.us&gt;

Filtering with amprocnum = 2 gets functions producing bigint in Postgres
11 or later. Not sure about the exact meaning of amprocnum, manual says
"Support function number".

Remaining problem either way: no hash function returning bigint for
Postgres 10.

Regards
Erwin

On Tue, Dec 10, 2019 at 11:13 PM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

On Tue, 2019-12-10 at 22:11 +0100, Erwin Brandstetter wrote:

I am looking for stable hash functions producing 8-byte or 4-byte

hashes from long text values in Postgres 10 or later.

[...]

There is an old post from 2012 by Tom Lane suggesting that hashtext()

and friends are not for users:

/messages/by-id/24463.1329854466@sss.pgh.pa.us

Changing a hash function would corrupt hash indexes, wouldn't it?

So I'd expect these functions to be pretty stable:

SELECT amp.amproc
FROM pg_amproc AS amp
JOIN pg_opfamily AS opf ON amp.amprocfamily = opf.oid
JOIN pg_am ON opf.opfmethod = pg_am.oid
WHERE pg_am.amname = 'hash'
AND amp.amprocnum = 1;

Or at least there would have to be a fat warning in the release notes
to reindex hash indexes.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#10Ron
ronljohnsonjr@gmail.com
In reply to: George Neuner (#8)
Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

On 12/15/19 3:59 PM, George Neuner wrote:

On Tue, 10 Dec 2019 18:00:02 -0600, Ron <ronljohnsonjr@gmail.com>
wrote:

On 12/10/19 3:11 PM, Erwin Brandstetter wrote:

I am looking for stable hash functions producing 8-byte or 4-byte hashes
from long text values in Postgres 10 or later.

There is md5(), the result of which can be cast to uuid. This reliably
produces practically unique, stable 16-byte values. I have usecases where
an 8-byte or even 4-byte hash would be good enough to make collisions
reasonably unlikely. (I can recheck on the full string) - and expression
indexes substantially smaller. I could truncate md5 and cast back and
forth, but that seems like a lot of wasted computation. Are there
suggestions for text hash functions that are
- fast
- keep collisions to a minimum
- stable across major Postgres versions (so expression indexes don't break)
- croptographic aspect is not needed (acceptable, but no benefit)

What about a CRC32 function?  It's fast, and an SSE4 instruction has been in
Intel CPUs for about 10 years.

On long text CRC will not be as discriminating as a real cryptohash,

When specifying a 4 byte hash, something must be sacrificed...

--
Angular momentum makes the world go 'round.

#11George Neuner
gneuner2@comcast.net
In reply to: Erwin Brandstetter (#1)
Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

On Sun, 15 Dec 2019 20:23:25 -0600, Ron <ronljohnsonjr@gmail.com>
wrote:

On 12/15/19 3:59 PM, George Neuner wrote:

On long text CRC will not be as discriminating as a real cryptohash,

When specifying a 4 byte hash, something must be sacrificed...

Obviously. But the main point is that CRC never was designed to
uniquely fingerprint data - it was designed to detect corruption of
the data, which is a much weaker guarantee than the cryptodigest
hashes.

Despite being the same length as an MD5 hash, a 128-bit CRC still
might not be as discriminating ... it depends greatly on the CRC
polynomial used and on the length of the input.

George