todo: Hash index creation

Started by Nonameover 18 years ago10 messages
#1Noname
twraney@comcast.net

Is anyone currently working on this TODO item?

"During index creation, pre-sort the tuples to improve build speed"
http://archives.postgresql.org/pgsql-hackers/2007-03/msg01199.php

A few of us would like to tackle it and see if we can add some value here.

Tom,
Shreya

#2Heikki Linnakangas
heikki@enterprisedb.com
In reply to: Noname (#1)
Re: todo: Hash index creation

twraney@comcast.net wrote:

Is anyone currently working on this TODO item?

"During index creation, pre-sort the tuples to improve build speed"
http://archives.postgresql.org/pgsql-hackers/2007-03/msg01199.php

A few of us would like to tackle it and see if we can add some value here.

That would be nice!

If you want to work on hash indexes, though, this TODO item seems more
important to me at least:

Add WAL logging for crash recovery

Until that's done, you can't really use hash indexes for anything else
than read-only tables.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#2)
Re: todo: Hash index creation

Heikki Linnakangas <heikki@enterprisedb.com> writes:

twraney@comcast.net wrote:

Is anyone currently working on this TODO item?
"During index creation, pre-sort the tuples to improve build speed"

If you want to work on hash indexes, though, this TODO item seems more
important to me at least:

Add WAL logging for crash recovery

Actually I think the *most* important thing to work on is to get hash to
the point where its search speed actually beats btree consistently, so
that it has an excuse to live. If that is insoluble we might well end up
ripping it out entirely. (The first three TODO items for hash indexes
are ideas for trying to improve the speed.)

Fixing the WAL support would come after that, and bring it to the point
where someone could actually recommend it for production use.

After that it would be sensible to work on inessentials like improving
the build speed.

regards, tom lane

#4Kenneth Marshall
ktm@rice.edu
In reply to: Tom Lane (#3)
Re: todo: Hash index creation

On Wed, Jun 27, 2007 at 08:36:54PM -0400, Tom Lane wrote:

Heikki Linnakangas <heikki@enterprisedb.com> writes:

twraney@comcast.net wrote:

Is anyone currently working on this TODO item?
"During index creation, pre-sort the tuples to improve build speed"

If you want to work on hash indexes, though, this TODO item seems more
important to me at least:

Add WAL logging for crash recovery

Actually I think the *most* important thing to work on is to get hash to
the point where its search speed actually beats btree consistently, so
that it has an excuse to live. If that is insoluble we might well end up
ripping it out entirely. (The first three TODO items for hash indexes
are ideas for trying to improve the speed.)

Fixing the WAL support would come after that, and bring it to the point
where someone could actually recommend it for production use.

After that it would be sensible to work on inessentials like improving
the build speed.

regards, tom lane

I definitely agree with Tom's assessment. If we cannot need to make the
hash index as performant as it is in theory, none of the other refinements
are worth it. You would need to use BTree if you were concerned about
speed. (and who isn't)

Ken

#5Naz Gassiep
naz@mira.net
In reply to: Tom Lane (#3)
Re: todo: Hash index creation

Actually I think the *most* important thing to work on is to get hash to
the point where its search speed actually beats btree consistently, so
that it has an excuse to live. If that is insoluble we might well end up
ripping it out entirely. (The first three TODO items for hash indexes
are ideas for trying to improve the speed.)

Fixing the WAL support would come after that, and bring it to the point
where someone could actually recommend it for production use.

After that it would be sensible to work on inessentials like improving
the build speed.

I've been warned away from hash indexes before, however I had no idea
that it's performance was that abysmal that BTREE beat it and I was
definitely not aware that they were not included in WAL logs. I was told
it wasn't as good as it could be, but I wasn't told it was pretty much
an alpha piece of code.

As a result, when creating tables containing large blocks of text I wish
to index, I've been using HASH as an index method. Please can we state
in the manual that HASH index types are in a beta stage of development
or something similar, or perhaps remove the manual entry altogether
until HASH is at a point where it is usable in production.

Regards,
A very surprised n00b.

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Naz Gassiep (#5)
Re: todo: Hash index creation

Naz Gassiep <naz@mira.net> writes:

As a result, when creating tables containing large blocks of text I wish
to index, I've been using HASH as an index method. Please can we state
in the manual that HASH index types are in a beta stage of development
or something similar, or perhaps remove the manual entry altogether
until HASH is at a point where it is usable in production.

Uh, the manual already does say

Note: Testing has shown PostgreSQL's hash indexes to perform no better
than B-tree indexes, and the index size and build time for hash indexes
is much worse. Furthermore, hash index operations are not presently
WAL-logged, so hash indexes might need to be rebuilt with REINDEX after
a database crash. For these reasons, hash index use is presently
discouraged.

under 11.2 Index Types, as well as various derogatory remarks elsewhere.

regards, tom lane

#7Hannu Krosing
hannu@skype.net
In reply to: Naz Gassiep (#5)
Re: todo: Hash index creation

Ühel kenal päeval, E, 2007-07-02 kell 04:27, kirjutas Naz Gassiep:

I've been warned away from hash indexes before, however I had no idea
that it's performance was that abysmal that BTREE beat it and I was
definitely not aware that they were not included in WAL logs. I was told
it wasn't as good as it could be, but I wasn't told it was pretty much
an alpha piece of code.

As a result, when creating tables containing large blocks of text I wish
to index, I've been using HASH as an index method.

If you just wish to have smaller indexes, then you can use functional
btree indexes over text hash, like this:

CREATE INDEX largetextindex on mytable(hashtext(largetext));

and use

SELECT * FROM mytable
where hashtext(largetext) = hastext('searchvalue')
and largetext = 'searchvalue'
;

btw, if the real hash indexes don't get fixes soon, maybe we could
redefine hash index to actually mean usage like this and do the rewrites
in parser?

Show quoted text

Please can we state
in the manual that HASH index types are in a beta stage of development
or something similar, or perhaps remove the manual entry altogether
until HASH is at a point where it is usable in production.

Regards,
A very surprised n00b.

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

#8Naz Gassiep
naz@mira.net
In reply to: Tom Lane (#6)
Re: todo: Hash index creation

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
Wow... not sure how I missed that. I *did* create this schema ages ago,
perhaps it wasn't there, or at the time I had no idea what the
implications were. *shrug*<br>
Regards,<br>
- Naz.<br>
<br>
<br>
Tom Lane wrote:
<blockquote cite="mid:20662.1183344247@sss.pgh.pa.us" type="cite">
<pre wrap="">Naz Gassiep <a class="moz-txt-link-rfc2396E" href="mailto:naz@mira.net">&lt;naz@mira.net&gt;</a> writes:
</pre>
<blockquote type="cite">
<pre wrap="">As a result, when creating tables containing large blocks of text I wish
to index, I've been using HASH as an index method. Please can we state
in the manual that HASH index types are in a beta stage of development
or something similar, or perhaps remove the manual entry altogether
until HASH is at a point where it is usable in production.
</pre>
</blockquote>
<pre wrap=""><!---->
Uh, the manual already does say

Note: Testing has shown PostgreSQL's hash indexes to perform no better
than B-tree indexes, and the index size and build time for hash indexes
is much worse. Furthermore, hash index operations are not presently
WAL-logged, so hash indexes might need to be rebuilt with REINDEX after
a database crash. For these reasons, hash index use is presently
discouraged.

under 11.2 Index Types, as well as various derogatory remarks elsewhere.

regards, tom lane

</pre>
</blockquote>
</body>
</html>

#9Heikki Linnakangas
heikki@enterprisedb.com
In reply to: Kenneth Marshall (#4)
Re: todo: Hash index creation

Kenneth Marshall wrote:

I definitely agree with Tom's assessment. If we cannot need to make the
hash index as performant as it is in theory, none of the other refinements
are worth it. You would need to use BTree if you were concerned about
speed. (and who isn't)

I just got an idea.

Hash indexes would take much less space, and be more efficient to
search, if we only stored the hash of the key in the index. Such index
tuples would be fixed size, so we could get rid of the overhead of the
length-field in IndexTuple, as well as the line pointers.

Of course, the key value would need to be rechecked after fetching the
heap tuple, but that shouldn't be a problem assuming there's few collisions.

Another idea: when searching, we scan the whole bucket page looking for
matching tuples. If we stored the tuples ordered by hash value within
page, we could do a binary search instead.

These changes might give hash indexam the edge over b-tree it needs.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In reply to: Heikki Linnakangas (#9)
Re: todo: Hash index creation

On Thu, Jul 05, 2007 at 12:26:45PM +0100, Heikki Linnakangas wrote:

Kenneth Marshall wrote:

I definitely agree with Tom's assessment. If we cannot need to make the
hash index as performant as it is in theory, none of the other refinements
are worth it. You would need to use BTree if you were concerned about
speed. (and who isn't)

I just got an idea.

Hash indexes would take much less space, and be more efficient to
search, if we only stored the hash of the key in the index. Such index
tuples would be fixed size, so we could get rid of the overhead of the
length-field in IndexTuple, as well as the line pointers.

Of course, the key value would need to be rechecked after fetching the
heap tuple, but that shouldn't be a problem assuming there's few collisions.

Another idea: when searching, we scan the whole bucket page looking for
matching tuples. If we stored the tuples ordered by hash value within
page, we could do a binary search instead.

These changes might give hash indexam the edge over b-tree it needs.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

I was thinking about your idea on my commute yesterday and the new
HOT functionality. The first step would be as you suggested, store
only the hash value and the heap bucket page in the index. The check
for collisions within a hash bucket is essentially the same thing
that we do when searching for the heap tuple visible for a specific
transaction. This would mean, that with a HOT type process, the index
would not need to be updated when a non-indexed field is changed. We
treat them as another tuple in the HOT chain.

This would make typical lookups require 2 disk reads per item, the
first to get the heap location and the second to get the heap page
itself. The next step would be to have the hash function generate
the heap location directly instead of a disk read, then retrieving
an item would be one read. This would also provide for a simple
clustering based on the hash index. I am certain that there are
holes in this ideal, but it would be a logical next step.

Regards,
Ken