Hash Indexes

Started by Naz Gassiepover 18 years ago7 messagesgeneral
Jump to latest
#1Naz Gassiep
naz@mira.net

Hi there,
I am creating functionality where there are blocks of text that are
being stored in the DB and that need to be searched for. No like or
pattern matching, just a plain old WHERE clause. Obviously, hash indexes
would be best here, however I've been warned away from PG's hash
implementation. Would it be better to manually create a column storing
the hashes (maintained with a CHECK (md5(textcol) = hashcol) type
constraint) and generate the hashes in the app?

- Naz.

#2Martijn van Oosterhout
kleptog@svana.org
In reply to: Naz Gassiep (#1)
Re: Hash Indexes

On Sat, Jan 05, 2008 at 03:20:54AM +1100, Naz Gassiep wrote:

Hi there,
I am creating functionality where there are blocks of text that are
being stored in the DB and that need to be searched for. No like or
pattern matching, just a plain old WHERE clause. Obviously, hash indexes
would be best here, however I've been warned away from PG's hash
implementation.

Why are hash indexes "obviously" best? In an ideal world with a good
implementation maybe, but postgresql b-trees are really quite good.

Would it be better to manually create a column storing
the hashes (maintained with a CHECK (md5(textcol) = hashcol) type
constraint) and generate the hashes in the app?

You could always do something like:

CREATE INDEX foo ON table((md5(textcol)));

Then it will get used in queries like:
SELECT * FROM table WHERE md5(textcol) = md5('text');

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy

#3Naz Gassiep
naz@mira.net
In reply to: Martijn van Oosterhout (#2)
Re: Hash Indexes

Why are hash indexes "obviously" best? In an ideal world with a good
implementation maybe, but postgresql b-trees are really quite good.

Because doing normal queries on a table where there are large text
blocks is unlikely to be a good idea. E.g.,:

SELECT * FROM table WHERE textcol = 'a 4kb block of text';

You could always do something like:

CREATE INDEX foo ON table((md5(textcol)));

Then it will get used in queries like:
SELECT * FROM table WHERE md5(textcol) = md5('text');

That's exactly what I was considering doing, however there is always the
change of a hash collision. Yes, this is a very remote chance, however
the ramifications of a collision under those circumstances is
potentially catastrophic. Think a user being delivered text that
contains confidential and sensitive material as opposed to the latest
memo about the cleaning of toilets.

I would assume that hash indexes have inbuilt mechanisms for collision
checking before returning the row as a match. Am I correct in this
assumption?

Best regards,
- Naz.

#4Andrew Sullivan
ajs@crankycanuck.ca
In reply to: Naz Gassiep (#3)
Re: Hash Indexes

On Tue, Jan 08, 2008 at 01:49:53AM +1100, Naz Gassiep wrote:

Because doing normal queries on a table where there are large text
blocks is unlikely to be a good idea. E.g.,:

SELECT * FROM table WHERE textcol = 'a 4kb block of text';

I suggest you look at the tsearch stuff instead.

I would assume that hash indexes have inbuilt mechanisms for collision
checking before returning the row as a match. Am I correct in this
assumption?

I think you should avoid any assumptions about the hash index implementation
in PostgreSQL. The general consensus seems to be that the code has a number
of problems. Most importantly, hash index operations are _not_ currently
WAL-logged, which means you probably need to REINDEX in the event of a
database crash. I don't know whether the collision issues are present in
hash indexes, though.

A

#5Martijn van Oosterhout
kleptog@svana.org
In reply to: Naz Gassiep (#3)
Re: Hash Indexes

On Tue, Jan 08, 2008 at 01:49:53AM +1100, Naz Gassiep wrote:

I would assume that hash indexes have inbuilt mechanisms for collision
checking before returning the row as a match. Am I correct in this
assumption?

Well, they do in the sense that it does the equivalent of:

SELECT * FROM table WHERE md5(textcol) = md5('text') and textcol = 'text';

But hash indexes can't be used for uniqueness checks, so if you need to
stop the same value being entered twice, hash indexes are way out.

The fact that hash indexes don't survive a DB crash is probably your
biggest concern.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy

#6Sam Mason
sam@samason.me.uk
In reply to: Naz Gassiep (#3)
Re: Hash Indexes

On Tue, Jan 08, 2008 at 01:49:53AM +1100, Naz Gassiep wrote:

You could always do something like:

CREATE INDEX foo ON table((md5(textcol)));

Then it will get used in queries like:
SELECT * FROM table WHERE md5(textcol) = md5('text');

That's exactly what I was considering doing, however there is always the
change of a hash collision. Yes, this is a very remote chance, however
the ramifications of a collision under those circumstances is
potentially catastrophic.

You could make it a UNIQUE index (i.e. CREATE UNIQUE INDEX and the rest
like above) if you wanted, or you could perform the query as:

SELECT * FROM table
WHERE md5(textcol) = md5('text')
AND textcol = 'text';

this should use the index to do the initial lookup and then filter out
colliding entries.

I would assume that hash indexes have inbuilt mechanisms for collision
checking before returning the row as a match. Am I correct in this
assumption?

The above isn't using hash indexes in any way. You're creating a b-tree
index on top of the md5-hash of a column. The only index type that
support uniqueness constraints at the moment are b-tree indexes[1]http://www.postgresql.org/docs/current/static/sql-createindex.html#AEN47593.

Sam

[1]: http://www.postgresql.org/docs/current/static/sql-createindex.html#AEN47593

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Naz Gassiep (#3)
Re: Hash Indexes

Naz Gassiep <naz@mira.net> writes:

Why are hash indexes "obviously" best? In an ideal world with a good
implementation maybe, but postgresql b-trees are really quite good.

Because doing normal queries on a table where there are large text
blocks is unlikely to be a good idea. E.g.,:
SELECT * FROM table WHERE textcol = 'a 4kb block of text';

You seem to be harboring some rather severe conceptual errors about
how hash indexes work, or at least how Postgres' hash indexes work.
I get the impression you think that a hash index stores only a hash
code and not the actual field value, but that's not so.

regards, tom lane