Hash Function: MD5 or other?

Started by Peter Feinalmost 21 years ago11 messagesgeneral

pfein@pobox.com

almost 21 years ago

Hi-

I wanted to use a partially unique index (dependent on a flag) on a TEXT
column, but the index row size was too big for btrees. See the thread
"index row size 2728 exceeds btree maximum, 2713" from the beginning of
this month for someone with a similar problem. In it, someone suggests
indexing on a hash of the text. I'm fine with this, as the texts in
question are similar enough to each other to make collisions unlikely
and a collision won't really cause any serious problems.

My question is: is the builtin MD5 appropriate for this use or should I
be using a function from pl/something? Figures on collision rates would
be nice as well - the typical chunk of text is probably 1k-8k.

Thanks!

--
Peter Fein pfein@pobox.com 773-575-0694

Basically, if you're not a utopianist, you're a schmuck. -J. Feldman

alyandon@yahoo.com

almost 21 years ago

In reply to: Peter Fein (#1)

Re: Hash Function: MD5 or other?

--- Peter Fein <pfein@pobox.com> wrote:

Hi-

I wanted to use a partially unique index (dependent on a flag) on a
TEXT
column, but the index row size was too big for btrees. See the
thread
"index row size 2728 exceeds btree maximum, 2713" from the beginning
of
this month for someone with a similar problem. In it, someone
suggests
indexing on a hash of the text. I'm fine with this, as the texts in
question are similar enough to each other to make collisions unlikely
and a collision won't really cause any serious problems.

My question is: is the builtin MD5 appropriate for this use or should
I
be using a function from pl/something? Figures on collision rates
would
be nice as well - the typical chunk of text is probably 1k-8k.

Thanks!

--

I believe the odds of two arbitrary inputs yielding the same MD5 hash
would be 1 in 2^128. Even though the odds of collision are small
you'll want to write your query such that use use the index on the hash
and then filter on the text field to guarantee you get the result you
are interested in.

Regards,

Shelby Cain

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

bruce@momjian.us

almost 21 years ago

In reply to: Shelby Cain (#2)

Re: Hash Function: MD5 or other?

Shelby Cain <alyandon@yahoo.com> writes:

My question is: is the builtin MD5 appropriate for this use or should I be
using a function from pl/something? Figures on collision rates would be
nice as well - the typical chunk of text is probably 1k-8k.

Note that MD5 is slow and CPU-intensive. By design.

If you want a quick way to find matching records then you might find something
like CRC to be more useful. With MD5 it's supposed to be hard for someone to
come up with inputs that hash to a target value, but if you're not too worried
about people trying to do that then MD5 is probably overkill.

--
greg

Tino Wildenhain

tino@wildenhain.de

almost 21 years ago

In reply to: Bruce Momjian (#3)

Re: Hash Function: MD5 or other?

Am Dienstag, den 14.06.2005, 02:40 -0400 schrieb Greg Stark:

Shelby Cain <alyandon@yahoo.com> writes:

My question is: is the builtin MD5 appropriate for this use or should I be
using a function from pl/something? Figures on collision rates would be
nice as well - the typical chunk of text is probably 1k-8k.

Note that MD5 is slow and CPU-intensive. By design.

If you want a quick way to find matching records then you might find something
like CRC to be more useful. With MD5 it's supposed to be hard for someone to
come up with inputs that hash to a target value, but if you're not too worried
about people trying to do that then MD5 is probably overkill.

Sha1 is said to be faster.
(And less likely to collisions)
--
Tino Wildenhain <tino@wildenhain.de>

alexs@advfn.com

almost 21 years ago

In reply to: Peter Fein (#1)

Re: Hash Function: MD5 or other?

On 13 Jun 2005, at 23:49, Peter Fein wrote:

Hi-

I wanted to use a partially unique index (dependent on a flag) on a
TEXT
column, but the index row size was too big for btrees. See the thread
"index row size 2728 exceeds btree maximum, 2713" from the
beginning of
this month for someone with a similar problem. In it, someone
suggests
indexing on a hash of the text. I'm fine with this, as the texts in
question are similar enough to each other to make collisions unlikely
and a collision won't really cause any serious problems.

My question is: is the builtin MD5 appropriate for this use or
should I
be using a function from pl/something? Figures on collision rates
would
be nice as well - the typical chunk of text is probably 1k-8k.

Thanks!

As others have said MD5 isn't the fastest one out there. However no
cryptographically secure hashes are really that fast. However you
can probably get away with using a CRC hash which is long enough to
reduce your chances of collision a lot. However, PostgreSQL doesn't
have a built in CRC function, which is a bit of a pain unless your
prepared to implement one, or use pl/* to do it, which sounds like
overkill. I suggest you run some benchmarks on MD5 and see if it's
fast enough to meet your current (and perhaps future) needs.

You could of course, just use a hash index on your text field! I
think that would probably cope with larger data sets OK. It has the
advantage of handling collisions for you as well :) However it means
you have to transfer large amounts of data around, so if network
speed ever becomes a limitation, MD5 hashing (or enabling compression
on your PgSQL connection) may help.

Show quoted text

--
Peter Fein pfein@pobox.com
773-575-0694

Basically, if you're not a utopianist, you're a schmuck. -J. Feldman

---------------------------(end of
broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

pfein@pobox.com

almost 21 years ago

In reply to: Alex Stapleton (#5)

Re: Hash Function: MD5 or other?

Alex Stapleton wrote:

On 13 Jun 2005, at 23:49, Peter Fein wrote:

Hi-

I wanted to use a partially unique index (dependent on a flag) on a TEXT
column, but the index row size was too big for btrees. See the thread
"index row size 2728 exceeds btree maximum, 2713" from the beginning of
this month for someone with a similar problem. In it, someone suggests
indexing on a hash of the text. I'm fine with this, as the texts in
question are similar enough to each other to make collisions unlikely
and a collision won't really cause any serious problems.

My question is: is the builtin MD5 appropriate for this use or should I
be using a function from pl/something? Figures on collision rates would
be nice as well - the typical chunk of text is probably 1k-8k.

Thanks!

As others have said MD5 isn't the fastest one out there. However no
cryptographically secure hashes are really that fast. However you can
probably get away with using a CRC hash which is long enough to reduce
your chances of collision a lot. However, PostgreSQL doesn't have a
built in CRC function, which is a bit of a pain unless your prepared to
implement one, or use pl/* to do it, which sounds like overkill. I
suggest you run some benchmarks on MD5 and see if it's fast enough to
meet your current (and perhaps future) needs.

You could of course, just use a hash index on your text field! I think
that would probably cope with larger data sets OK. It has the advantage
of handling collisions for you as well :) However it means you have to
transfer large amounts of data around, so if network speed ever becomes
a limitation, MD5 hashing (or enabling compression on your PgSQL
connection) may help.

Let me clarify my problem:

I've got a bunch of texts, stored in column sometext. These texts are
assigned to groups, group_id. Each group may contain rows with identical
sometext. Finding the appropriate group when inserting is very expensive
and is calculated on the client side. I want to mark *one* of each of
the exact duplicates in a group with a flag, group_representative.
Basically, a partial index on:

(group_id, sometext) WHERE group_representative=TRUE.

The motivation is then to be able to do:

SELECT * FROM mytable WHERE <some other conditions> AND
group_representative=TRUE

instead of:

SELECT * FROM mytable WHERE <some other conditions> DISTINCT ON
(group_id, sometext)

since there are more of these queries than inserts (and they return a
lot more rows than belong to a single group_id). I suspect that running
a DISTINCT in this case would be more expensive than updating the index
when inserting.

Knowing the specifics of the data I'm putting in sometext, a halfway
decent hash function would make collisions so rare as to make the chance
insignificant (and collisions wouldn't break anything anyway). Is this
approach reasonable, or should I use a hash index on (group_id,
sometext) - does this suffer from the same size limitation as btrees? I
thought hash indexes were slow...

Hope that's clearer...

--Pete

--
Peter Fein pfein@pobox.com 773-575-0694

Basically, if you're not a utopianist, you're a schmuck. -J. Feldman

alexs@advfn.com

almost 21 years ago

In reply to: Peter Fein (#6)

Re: Hash Function: MD5 or other?

On 14 Jun 2005, at 14:33, Peter Fein wrote:

Alex Stapleton wrote:

On 13 Jun 2005, at 23:49, Peter Fein wrote:

Hi-

I wanted to use a partially unique index (dependent on a flag) on
a TEXT
column, but the index row size was too big for btrees. See the
thread
"index row size 2728 exceeds btree maximum, 2713" from the
beginning of
this month for someone with a similar problem. In it, someone
suggests
indexing on a hash of the text. I'm fine with this, as the texts in
question are similar enough to each other to make collisions
unlikely
and a collision won't really cause any serious problems.

My question is: is the builtin MD5 appropriate for this use or
should I
be using a function from pl/something? Figures on collision
rates would
be nice as well - the typical chunk of text is probably 1k-8k.

Thanks!

As others have said MD5 isn't the fastest one out there. However no
cryptographically secure hashes are really that fast. However
you can
probably get away with using a CRC hash which is long enough to
reduce
your chances of collision a lot. However, PostgreSQL doesn't have a
built in CRC function, which is a bit of a pain unless your
prepared to
implement one, or use pl/* to do it, which sounds like overkill. I
suggest you run some benchmarks on MD5 and see if it's fast
enough to
meet your current (and perhaps future) needs.

You could of course, just use a hash index on your text field! I
think
that would probably cope with larger data sets OK. It has the
advantage
of handling collisions for you as well :) However it means you
have to
transfer large amounts of data around, so if network speed ever
becomes
a limitation, MD5 hashing (or enabling compression on your PgSQL
connection) may help.

Let me clarify my problem:

I've got a bunch of texts, stored in column sometext. These texts are
assigned to groups, group_id. Each group may contain rows with
identical
sometext. Finding the appropriate group when inserting is very
expensive
and is calculated on the client side. I want to mark *one* of each of
the exact duplicates in a group with a flag, group_representative.
Basically, a partial index on:

(group_id, sometext) WHERE group_representative=TRUE.

The motivation is then to be able to do:

SELECT * FROM mytable WHERE <some other conditions> AND
group_representative=TRUE

instead of:

SELECT * FROM mytable WHERE <some other conditions> DISTINCT ON
(group_id, sometext)

since there are more of these queries than inserts (and they return a
lot more rows than belong to a single group_id). I suspect that
running
a DISTINCT in this case would be more expensive than updating the
index
when inserting.

Knowing the specifics of the data I'm putting in sometext, a halfway
decent hash function would make collisions so rare as to make the
chance
insignificant (and collisions wouldn't break anything anyway). Is this
approach reasonable, or should I use a hash index on (group_id,
sometext) - does this suffer from the same size limitation as
btrees? I
thought hash indexes were slow...

Hope that's clearer...

From the manual.

Testing has shown PostgreSQL's hash indexes to perform no better than
B-tree indexes, and the index size and build time for hash indexes is
much worse. For these reasons, hash index use is presently discouraged.

So yes, they are slower, but not by an enormous amount. (The lack of
any data to backup the manuals conclusions worries me, but
nevermind.) It is certainly worth trying.

The only hashing function PG has built in which you can actually
access is MD5. You might be able to get better performance from that
if data transmission takes a long time as you will only need to send
the hash to the database. You could then build a B-Tree index on the
MD5 hashes, which would certainly do what you want it to as the
likelihood of a collision is low. So basically it all boils down to
just how slow MD5 is, and just how fast your network connection to
your DB server(s) is.

Show quoted text

--Pete

--
Peter Fein pfein@pobox.com
773-575-0694

Basically, if you're not a utopianist, you're a schmuck. -J. Feldman

---------------------------(end of
broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

alyandon@yahoo.com

almost 21 years ago

In reply to: Bruce Momjian (#3)

Re: Hash Function: MD5 or other?

--- Greg Stark <gsstark@mit.edu> wrote:

Note that MD5 is slow and CPU-intensive. By design.

If you want a quick way to find matching records then you might find
something
like CRC to be more useful. With MD5 it's supposed to be hard for
someone to
come up with inputs that hash to a target value, but if you're not
too worried
about people trying to do that then MD5 is probably overkill.

$ ./hash -b

CRC32: 302.78 MB/sec
HAVAL 128: 165.33 MB/sec
HAVAL 160: 178.69 MB/sec
HAVAL 192: 124.74 MB/sec
HAVAL 224: 123.05 MB/sec
HAVAL 256: 98.14 MB/sec
MD2: 9.03 MB/sec
MD4: 233.36 MB/sec
MD5: 105.39 MB/sec
Panama: 311.21 MB/sec
RIPEMD-128: 129.88 MB/sec
RIPEMD-160: 76.75 MB/sec
SHA1: 135.40 MB/sec
SHA256: 49.42 MB/sec
SHA384: 32.77 MB/sec
SHA512: 31.58 MB/sec
Tiger: 54.02 MB/sec
Whirlpool: 17.51 MB/sec

Elapsed time: 3.56 seconds
Average throughput: 121.06 MB/s

Granted, MD5 isn't the quickest hashing algorithm out there but it is
certainly fast enough for general use IMO.

Regards,

Shelby Cain

__________________________________
Discover Yahoo!
Find restaurants, movies, travel and more fun for the weekend. Check it out!
http://discover.yahoo.com/weekend.html

Bruno Wolff III

bruno@wolff.to

almost 21 years ago

In reply to: Peter Fein (#6)

Re: Hash Function: MD5 or other?

On Tue, Jun 14, 2005 at 08:33:34 -0500,
Peter Fein <pfein@pobox.com> wrote:

Knowing the specifics of the data I'm putting in sometext, a halfway
decent hash function would make collisions so rare as to make the chance
insignificant (and collisions wouldn't break anything anyway). Is this
approach reasonable, or should I use a hash index on (group_id,
sometext) - does this suffer from the same size limitation as btrees? I
thought hash indexes were slow...

The hash value should be saved as a separate column. Then it sounds
like you want a partial btree index of (group_id, hash) where the
flag is set.

pfein@pobox.com

almost 21 years ago

In reply to: Bruno Wolff III (#9)

Re: Hash Function: MD5 or other?

Bruno Wolff III wrote:

On Tue, Jun 14, 2005 at 08:33:34 -0500,
Peter Fein <pfein@pobox.com> wrote:

Knowing the specifics of the data I'm putting in sometext, a halfway
decent hash function would make collisions so rare as to make the chance
insignificant (and collisions wouldn't break anything anyway). Is this
approach reasonable, or should I use a hash index on (group_id,
sometext) - does this suffer from the same size limitation as btrees? I
thought hash indexes were slow...

The hash value should be saved as a separate column. Then it sounds
like you want a partial btree index of (group_id, hash) where the
flag is set.

I'm unclear why I'd need to store the hash in a column. I suppose I
could have the hash column populated by a trigger on inserts, but this
seems to get me the same functionality & is less obvious.

For the archives, I did:

CREATE UNIQUE INDEX idx_md5_sometext ON mytable USING btree
(group_id, md5(sometext))
WHERE group_representative = true;

I then basically replicate this in a SELECT on the client side
(including calculating the MD5 by the client) to figure out the correct
value for group_representative before inserting a new row. This is the
only way I use the MD5, so I don't much care about retrieving it in
other contexts.

--
Peter Fein pfein@pobox.com 773-575-0694

Basically, if you're not a utopianist, you're a schmuck. -J. Feldman

Bruno Wolff III

bruno@wolff.to

almost 21 years ago

In reply to: Peter Fein (#10)

Re: Hash Function: MD5 or other?

On Tue, Jun 14, 2005 at 15:54:50 -0500,
Peter Fein <pfein@pobox.com> wrote:

I'm unclear why I'd need to store the hash in a column. I suppose I
could have the hash column populated by a trigger on inserts, but this
seems to get me the same functionality & is less obvious.

For the archives, I did:

CREATE UNIQUE INDEX idx_md5_sometext ON mytable USING btree
(group_id, md5(sometext))
WHERE group_representative = true;

I then basically replicate this in a SELECT on the client side
(including calculating the MD5 by the client) to figure out the correct
value for group_representative before inserting a new row. This is the
only way I use the MD5, so I don't much care about retrieving it in
other contexts.

That should work fine.

I wasn't sure that you weren't going to want to use the hash for joins.
And I was a little concerned that because you used the phrase "hash index",
that you might be considering using a hash (as opposed to btree) index.