Fast, stable, portable hash function producing 4-byte or 8-byte values?
I am looking for stable hash functions producing 8-byte or 4-byte hashes
from long text values in Postgres 10 or later.
There is md5(), the result of which can be cast to uuid. This reliably
produces practically unique, stable 16-byte values. I have usecases where
an 8-byte or even 4-byte hash would be good enough to make collisions
reasonably unlikely. (I can recheck on the full string) - and expression
indexes substantially smaller. I could truncate md5 and cast back and
forth, but that seems like a lot of wasted computation. Are there
suggestions for text hash functions that are
- fast
- keep collisions to a minimum
- stable across major Postgres versions (so expression indexes don't break)
- croptographic aspect is not needed (acceptable, but no benefit)
There is an old post from 2012 by Tom Lane suggesting that hashtext() and
friends are not for users:
/messages/by-id/24463.1329854466@sss.pgh.pa.us
Postgres 11 added hashtextextended() and friends to generate bigint hashes.
In a more recent post from 3 months ago, Tom suggests to use it in
user-land - if portability is not needed:
/messages/by-id/9434.1568839177@sss.pgh.pa.us
Is pghashlib by Marko Kreen my best option?
https://github.com/markokr/pghashlib
Or the "version-independent hash functions for PostgreSQL" from Peter
Eisentraut:
https://github.com/petere/pgvihash
Neither received updates for a couple of years. Unmaintained? Or obsolete?
And neither is available on most hosted services like RDS or Heroku (which
would be required in come cases).
So what are my best options?
Regards
Erwin
On Tue, 2019-12-10 at 22:11 +0100, Erwin Brandstetter wrote:
I am looking for stable hash functions producing 8-byte or 4-byte hashes from long text values in Postgres 10 or later.
[...]
There is an old post from 2012 by Tom Lane suggesting that hashtext() and friends are not for users:
Changing a hash function would corrupt hash indexes, wouldn't it?
So I'd expect these functions to be pretty stable:
SELECT amp.amproc
FROM pg_amproc AS amp
JOIN pg_opfamily AS opf ON amp.amprocfamily = opf.oid
JOIN pg_am ON opf.opfmethod = pg_am.oid
WHERE pg_am.amname = 'hash'
AND amp.amprocnum = 1;
Or at least there would have to be a fat warning in the release notes
to reindex hash indexes.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
In terms of "wasted computation", MD5, SHA1, and the others always compute
the full length before they are passed to a UUID, int, or whatever. It's a
sunk cost. It's also a minor cost considering many hash algorithms are
performed in CPU hardware now. All that's left is the truncation and cast,
which you can't avoid easily.
Sure, you could reimplement Java's .hashCode() method by iterating through
the characters and processing the character codes:
s[0]*31^(n - 1) + s[1]*31^(n - 2) + ... + s[n - 1]
I don't see how that would beat the CPU-based hashes though unless you
wrote a C-based extension. Maybe it's better just to embrace the
user-defined function first and then decide if performance is insufficient
for your use cases.
CREATE EXTENSION IF NOT EXISTS pgcrypto;
CREATE OR REPLACE FUNCTION hash8 (p_data text, p_algo text = 'md5') RETURNS
int8 AS $$
SELECT ('x' || encode(substring(digest(p_data, p_algo) FROM 1 FOR 16),
'hex'))::bit(64)::int8
$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE;
CREATE OR REPLACE FUNCTION hash4 (p_data text, p_algo text = 'md5') RETURNS
int4 AS $$
SELECT ('x' || encode(substring(digest(p_data, p_algo) FROM 1 FOR 8),
'hex'))::bit(32)::int4
$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE;
SELECT
hash4('something something something'),
hash4('something something something', 'sha1'),
hash8('something something something'),
hash8('something something something', 'sha1');
Cheers,
Miles
On Tue, Dec 10, 2019 at 1:12 PM Erwin Brandstetter <brsaweda@gmail.com>
wrote:
Show quoted text
I am looking for stable hash functions producing 8-byte or 4-byte hashes
from long text values in Postgres 10 or later.There is md5(), the result of which can be cast to uuid. This reliably
produces practically unique, stable 16-byte values. I have usecases where
an 8-byte or even 4-byte hash would be good enough to make collisions
reasonably unlikely. (I can recheck on the full string) - and expression
indexes substantially smaller. I could truncate md5 and cast back and
forth, but that seems like a lot of wasted computation. Are there
suggestions for text hash functions that are
- fast
- keep collisions to a minimum
- stable across major Postgres versions (so expression indexes don't break)
- croptographic aspect is not needed (acceptable, but no benefit)There is an old post from 2012 by Tom Lane suggesting that hashtext() and
friends are not for users:/messages/by-id/24463.1329854466@sss.pgh.pa.us
Postgres 11 added hashtextextended() and friends to generate bigint
hashes. In a more recent post from 3 months ago, Tom suggests to use it in
user-land - if portability is not needed:/messages/by-id/9434.1568839177@sss.pgh.pa.us
Is pghashlib by Marko Kreen my best option?
https://github.com/markokr/pghashlib
Or the "version-independent hash functions for PostgreSQL" from Peter
Eisentraut:https://github.com/petere/pgvihash
Neither received updates for a couple of years. Unmaintained? Or obsolete?
And neither is available on most hosted services like RDS or Heroku (which
would be required in come cases).So what are my best options?
Regards
Erwin
On 12/10/19 3:11 PM, Erwin Brandstetter wrote:
I am looking for stable hash functions producing 8-byte or 4-byte hashes
from long text values in Postgres 10 or later.There is md5(), the result of which can be cast to uuid. This reliably
produces practically unique, stable 16-byte values. I have usecases where
an 8-byte or even 4-byte hash would be good enough to make collisions
reasonably unlikely. (I can recheck on the full string) - and expression
indexes substantially smaller. I could truncate md5 and cast back and
forth, but that seems like a lot of wasted computation. Are there
suggestions for text hash functions that are
- fast
- keep collisions to a minimum
- stable across major Postgres versions (so expression indexes don't break)
- croptographic aspect is not needed (acceptable, but no benefit)
What about a CRC32 function? It's fast, and an SSE4 instruction has been in
Intel CPUs for about 10 years.
There is an old post from 2012 by Tom Lane suggesting that hashtext() and
friends are not for users:/messages/by-id/24463.1329854466@sss.pgh.pa.us
Postgres 11 added hashtextextended() and friends to generate bigint
hashes. In a more recent post from 3 months ago, Tom suggests to use it in
user-land - if portability is not needed:/messages/by-id/9434.1568839177@sss.pgh.pa.us
Is pghashlib by Marko Kreen my best option?
https://github.com/markokr/pghashlib
Or the "version-independent hash functions for PostgreSQL" from Peter
Eisentraut:https://github.com/petere/pgvihash
Neither received updates for a couple of years. Unmaintained? Or obsolete?
And neither is available on most hosted services like RDS or Heroku (which
would be required in come cases).So what are my best options?
Regards
Erwin
--
Angular momentum makes the world go 'round.
Thanks for the suggestion. Seems like a good assumption and I have been
using hashtext() in the past. But I am uncertain whether it is the best
option.
Guess Tom's warning in
/messages/by-id/9434.1568839177@sss.pgh.pa.us about
portability only refers to hashtextextended() and friends not being there
in Postgres 10 or older.
But why are none of these functions documented? Does the project still not
...
want people to rely on them continuing to have exactly the current
behavior.
I am not complaining, maybe just nobody did the work. But it's also
mentioned in this old thread, that hastext() changed in the past. Is all of
that outdated and we are welcome to use those functions for indexing?
/messages/by-id/24463.1329854466@sss.pgh.pa.us
</messages/by-id/24463.1329854466@sss.pgh.pa.us>
Filtering with amprocnum = 2 gets functions producing bigint in Postgres 11
or later. Not sure about the exact meaning of amprocnum, manual says
"Support function number".
Remaining problem either way: no hash function returning bigint for
Postgres 10.
Regards
Erwin
On Tue, Dec 10, 2019 at 11:13 PM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:
Show quoted text
On Tue, 2019-12-10 at 22:11 +0100, Erwin Brandstetter wrote:
I am looking for stable hash functions producing 8-byte or 4-byte hashes
from long text values in Postgres 10 or later.
[...]
There is an old post from 2012 by Tom Lane suggesting that hashtext()
and friends are not for users:
Changing a hash function would corrupt hash indexes, wouldn't it?
So I'd expect these functions to be pretty stable:
SELECT amp.amproc
FROM pg_amproc AS amp
JOIN pg_opfamily AS opf ON amp.amprocfamily = opf.oid
JOIN pg_am ON opf.opfmethod = pg_am.oid
WHERE pg_am.amname = 'hash'
AND amp.amprocnum = 1;Or at least there would have to be a fat warning in the release notes
to reindex hash indexes.Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
On Tue, Dec 10, 2019 at 11:34 PM Miles Elam <miles.elam@productops.com>
wrote:
In terms of "wasted computation", MD5, SHA1, and the others always compute
the full length before they are passed to a UUID, int, or whatever. It's a
sunk cost. It's also a minor cost considering many hash algorithms are
performed in CPU hardware now. All that's left is the truncation and cast,
which you can't avoid easily.Sure, you could reimplement Java's .hashCode() method by iterating through
the characters and processing the character codes:s[0]*31^(n - 1) + s[1]*31^(n - 2) + ... + s[n - 1]
I don't see how that would beat the CPU-based hashes though unless you
wrote a C-based extension. Maybe it's better just to embrace the
user-defined function first and then decide if performance is insufficient
for your use cases.CREATE EXTENSION IF NOT EXISTS pgcrypto;
CREATE OR REPLACE FUNCTION hash8 (p_data text, p_algo text = 'md5')
RETURNS int8 AS $$SELECT ('x' || encode(substring(digest(p_data, p_algo) FROM 1 FOR 16),
'hex'))::bit(64)::int8$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE;
CREATE OR REPLACE FUNCTION hash4 (p_data text, p_algo text = 'md5')
RETURNS int4 AS $$SELECT ('x' || encode(substring(digest(p_data, p_algo) FROM 1 FOR 8),
'hex'))::bit(32)::int4$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE;
SELECT
hash4('something something something'),
hash4('something something something', 'sha1'),
hash8('something something something'),
hash8('something something something', 'sha1');
Cheers,
Miles
Thanks for the custom functions! May be useful as fallback.
But I am really looking for standard functions in Postgres first. Those
should be faster and more reliable than writing my own.
Regards
Erwin
Erwin Brandstetter <brsaweda@gmail.com> writes:
Guess Tom's warning in
/messages/by-id/9434.1568839177@sss.pgh.pa.us about
portability only refers to hashtextextended() and friends not being there
in Postgres 10 or older.
Well, the other portability issue that is worth considering is that
these functions are only intended to give stable results within a
particular installation; their use for hash indexes does not require
the same results across different platforms. Notably, most of them
give different answers on little-endian and big-endian machines.
That's not necessarily a reason not to use them, but you have to
be careful what you assume about them.
regards, tom lane
On Tue, 10 Dec 2019 18:00:02 -0600, Ron <ronljohnsonjr@gmail.com>
wrote:
On 12/10/19 3:11 PM, Erwin Brandstetter wrote:
I am looking for stable hash functions producing 8-byte or 4-byte hashes
from long text values in Postgres 10 or later.There is md5(), the result of which can be cast to uuid. This reliably
produces practically unique, stable 16-byte values. I have usecases where
an 8-byte or even 4-byte hash would be good enough to make collisions
reasonably unlikely. (I can recheck on the full string) - and expression
indexes substantially smaller. I could truncate md5 and cast back and
forth, but that seems like a lot of wasted computation. Are there
suggestions for text hash functions that are
- fast
- keep collisions to a minimum
- stable across major Postgres versions (so expression indexes don't break)
- croptographic aspect is not needed (acceptable, but no benefit)What about a CRC32 function? It's fast, and an SSE4 instruction has been in
Intel CPUs for about 10 years.
On long text CRC will not be as discriminating as a real cryptohash,
but it may be considerably faster to compute depending on the length
of the text. Given that the OP can check the actual string in the
event of a collision, it may work well enough.
One way to make it more discriminating is to compute a pair of CRCs
using different polynomials. Unfortunately the SSE4.2 CRC32
instruction uses a fixed polynomial. You can compute it twice using
different initial values, but the result won't be as good as actually
using different polynomials.
I used to create 4-byte hashes by concatenating two 16-bit CRCs - one
using the X-modem polynomial and the other using the CCITT polynomial.
Since these gave very different results, it worked quite well for my
use case where all the strings were guaranteed < 16KB.
YMMV,
George
You can always tweak fnv for whatever bite-size or bit size you want.
Sometimes I know a little information about my data shape and make a
custom fnv that only looks at the first half for the last half of a string,
etc.
On Wed, Dec 11, 2019, 1:02 PM Erwin Brandstetter <brsaweda@gmail.com> wrote:
Show quoted text
Thanks for the suggestion. Seems like a good assumption and I have been
using hashtext() in the past. But I am uncertain whether it is the best
option.Guess Tom's warning in
/messages/by-id/9434.1568839177@sss.pgh.pa.us about
portability only refers to hashtextextended() and friends not being there
in Postgres 10 or older.But why are none of these functions documented? Does the project still not
...want people to rely on them continuing to have exactly the current
behavior.
I am not complaining, maybe just nobody did the work. But it's also
mentioned in this old thread, that hastext() changed in the past. Is all of
that outdated and we are welcome to use those functions for indexing?/messages/by-id/24463.1329854466@sss.pgh.pa.us
</messages/by-id/24463.1329854466@sss.pgh.pa.us>Filtering with amprocnum = 2 gets functions producing bigint in Postgres
11 or later. Not sure about the exact meaning of amprocnum, manual says
"Support function number".Remaining problem either way: no hash function returning bigint for
Postgres 10.Regards
ErwinOn Tue, Dec 10, 2019 at 11:13 PM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:On Tue, 2019-12-10 at 22:11 +0100, Erwin Brandstetter wrote:
I am looking for stable hash functions producing 8-byte or 4-byte
hashes from long text values in Postgres 10 or later.
[...]
There is an old post from 2012 by Tom Lane suggesting that hashtext()
and friends are not for users:
Changing a hash function would corrupt hash indexes, wouldn't it?
So I'd expect these functions to be pretty stable:
SELECT amp.amproc
FROM pg_amproc AS amp
JOIN pg_opfamily AS opf ON amp.amprocfamily = opf.oid
JOIN pg_am ON opf.opfmethod = pg_am.oid
WHERE pg_am.amname = 'hash'
AND amp.amprocnum = 1;Or at least there would have to be a fat warning in the release notes
to reindex hash indexes.Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
On 12/15/19 3:59 PM, George Neuner wrote:
On Tue, 10 Dec 2019 18:00:02 -0600, Ron <ronljohnsonjr@gmail.com>
wrote:On 12/10/19 3:11 PM, Erwin Brandstetter wrote:
I am looking for stable hash functions producing 8-byte or 4-byte hashes
from long text values in Postgres 10 or later.There is md5(), the result of which can be cast to uuid. This reliably
produces practically unique, stable 16-byte values. I have usecases where
an 8-byte or even 4-byte hash would be good enough to make collisions
reasonably unlikely. (I can recheck on the full string) - and expression
indexes substantially smaller. I could truncate md5 and cast back and
forth, but that seems like a lot of wasted computation. Are there
suggestions for text hash functions that are
- fast
- keep collisions to a minimum
- stable across major Postgres versions (so expression indexes don't break)
- croptographic aspect is not needed (acceptable, but no benefit)What about a CRC32 function? It's fast, and an SSE4 instruction has been in
Intel CPUs for about 10 years.On long text CRC will not be as discriminating as a real cryptohash,
When specifying a 4 byte hash, something must be sacrificed...
--
Angular momentum makes the world go 'round.
On Sun, 15 Dec 2019 20:23:25 -0600, Ron <ronljohnsonjr@gmail.com>
wrote:
On 12/15/19 3:59 PM, George Neuner wrote:
On long text CRC will not be as discriminating as a real cryptohash,
When specifying a 4 byte hash, something must be sacrificed...
Obviously. But the main point is that CRC never was designed to
uniquely fingerprint data - it was designed to detect corruption of
the data, which is a much weaker guarantee than the cryptodigest
hashes.
Despite being the same length as an MD5 hash, a 128-bit CRC still
might not be as discriminating ... it depends greatly on the CRC
polynomial used and on the length of the input.
George