Hash index todo list item
Dear PostgreSQL Hackers:
After following the hackers mailing list for quite a while,
I am going to start investigating what will need to be done
to improve hash index performance. Below are the pieces of
this project that I am currently considering:
1. Characterize the current hash index implementation against
the BTree index, with a focus on space utilization and
lookup performance against a collection of test data. This
will give a baseline performance test to evaluate the impact
of changes. I initially do not plan to bench the hash creation
process since my initial focus will be on lookup performance.
2. Evaluate the performance of different hash index implementations
and/or changes to the current implementation. My current plan is
to keep the implementation as simple as possible and still provide
the desired performance. Several hash index suggestions deal with
changing the layout of the keys on a page to improve lookup
performance, including reducing the bucket size to a fraction of
a page or only storing the hash value on the page, instead of
the index value itself. My goal in this phase is to produce one
or more versions with better performance than the current BTree.
3. Look at build time and concurrency issues with the addition of
some additional tests to the test bed. (1)
4. Repeat as needed.
This is the rough plan. Does anyone see anything critical that
is missing at this point? Please send me any suggestions for test
data and various performance test ideas, since I will be working
on that first.
Regards,
Ken Marshall
Kenneth Marshall <ktm@rice.edu> writes:
... This is the rough plan. Does anyone see anything critical that
is missing at this point?
Sounds pretty good. Let me brain-dump one item on you: one thing that
hash currently has over btree is the ability to handle index items up
to a full page. Now, if you go with a scheme that only stores hash
codes and not the underlying data, you can not only handle that but
improve on it; but if you reduce the bucket size and don't remove the
data, it'd be a step backward. The idea I had about dealing with that
was to only reduce the size of primary buckets --- if it's necessary to
add overflow space to a bucket, the overflow units are still full pages.
So an index tuple up to a page in size can always be accommodated by
adding an overflow page to the bucket.
Just a thought, but AFAIR it's not in the archives anywhere.
regards, tom lane
On Sun, 2007-09-02 at 13:04 -0500, Kenneth Marshall wrote:
Dear PostgreSQL Hackers:
After following the hackers mailing list for quite a while,
I am going to start investigating what will need to be done
to improve hash index performance. Below are the pieces of
this project that I am currently considering:1. Characterize the current hash index implementation against
the BTree index, with a focus on space utilization and
lookup performance against a collection of test data. This
will give a baseline performance test to evaluate the impact
of changes. I initially do not plan to bench the hash creation
process since my initial focus will be on lookup performance.2. Evaluate the performance of different hash index implementations
and/or changes to the current implementation. My current plan is
to keep the implementation as simple as possible and still provide
the desired performance. Several hash index suggestions deal with
changing the layout of the keys on a page to improve lookup
performance, including reducing the bucket size to a fraction of
a page or only storing the hash value on the page, instead of
the index value itself. My goal in this phase is to produce one
or more versions with better performance than the current BTree.3. Look at build time and concurrency issues with the addition of
some additional tests to the test bed. (1)4. Repeat as needed.
This is the rough plan. Does anyone see anything critical that
is missing at this point? Please send me any suggestions for test
data and various performance test ideas, since I will be working
on that first.
Sounds good.
I'd be particularly interested in large indexes, say ~ 0.5 - 2GB. There
are likely to be various effects apparent as the indexes grow. It would
be too easy to do all the tests with smaller indexes and miss things.
Other factors are:
- volatility
- concurrency
My general experience is that hash-based indexes are better when the
range of inputs is relatively well-known, allowing a fast lookup. If
that is the only benefit of hash indexes, a flexible hashing scheme may
simply weaken the benefit-case for using them. If that's true, should
the index build process examine the key values in the data to determine
the best parameters to use? Kind of ANALYZE before build.
My current feeling is that they ought to be very good at handling
read-mostly situations such as privilege checking or UPDATE-intensive
situations such as Customer-Current-Credit tracking, when the number of
customers is large.
It might also be worth looking at lossy hash indexes, i.e. the index
stores only the block numbers. That would need to be part of the
discussion around how lossy we will allow indexes to be.
We currently have two kinds of full text index with different
concurrency use cases, so it should be acceptable to have hash indexes
have a clear benefit in one use case but a clear loss in another.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
On Sun, Sep 02, 2007 at 10:41:22PM -0400, Tom Lane wrote:
Kenneth Marshall <ktm@rice.edu> writes:
... This is the rough plan. Does anyone see anything critical that
is missing at this point?Sounds pretty good. Let me brain-dump one item on you: one thing that
hash currently has over btree is the ability to handle index items up
to a full page. Now, if you go with a scheme that only stores hash
codes and not the underlying data, you can not only handle that but
improve on it; but if you reduce the bucket size and don't remove the
data, it'd be a step backward. The idea I had about dealing with that
was to only reduce the size of primary buckets --- if it's necessary to
add overflow space to a bucket, the overflow units are still full pages.
So an index tuple up to a page in size can always be accommodated by
adding an overflow page to the bucket.Just a thought, but AFAIR it's not in the archives anywhere.
regards, tom lane
Tom,
Thank you for the input. I agree that keeping the ability to accomodate
an index tuple up to a page is size worth keeping. I think that your
goal in reducing the bucket size is to improve the packing efficiency
of the index. Since the on disk page size remains the same, it may be
possible to use a different structure overlayed on the current bucket
size and still improve the packing efficiency of the index. After some
more mulling, here are some further thoughts on the improved hash table
implementation:
- Hash lookup is O(1) while btree is O(logN). Is there a value
in optimizing the NOT case, i.e. the entry is not in the table?
- Space versus performance trade-off. This may tie into cache
efficiency and use of L2/L3, shared buffers, main memory.
Denser layouts with a higher load facter may be a bit slower
in lookups but play much nicer in a multi-user system. Look
at the possibility of a lossy mapping?
- Build time versus update time. How does concurrency enter into
the picture regarding simultaneous updates, inserts, deletes,
and lookups?
- Could a hybrid structure with some type of prefix compression
give a more efficient layout and possibly improve performance?
- Index larger fields. Btree is limited to blocksize/3, the
current hash implementation can go up to a full block.
- What about multi-column indexes? The current implementation
only supports 1 column.
More ideas are welcome and I will add them to the list for
investigation.
Regards,
Ken
On Mon, Sep 03, 2007 at 10:33:54AM +0100, Simon Riggs wrote:
This is the rough plan. Does anyone see anything critical that
is missing at this point? Please send me any suggestions for test
data and various performance test ideas, since I will be working
on that first.Sounds good.
I'd be particularly interested in large indexes, say ~ 0.5 - 2GB. There
are likely to be various effects apparent as the indexes grow. It would
be too easy to do all the tests with smaller indexes and miss things.Other factors are:
- volatility
- concurrencyMy general experience is that hash-based indexes are better when the
range of inputs is relatively well-known, allowing a fast lookup. If
that is the only benefit of hash indexes, a flexible hashing scheme may
simply weaken the benefit-case for using them. If that's true, should
the index build process examine the key values in the data to determine
the best parameters to use? Kind of ANALYZE before build.My current feeling is that they ought to be very good at handling
read-mostly situations such as privilege checking or UPDATE-intensive
situations such as Customer-Current-Credit tracking, when the number of
customers is large.It might also be worth looking at lossy hash indexes, i.e. the index
stores only the block numbers. That would need to be part of the
discussion around how lossy we will allow indexes to be.We currently have two kinds of full text index with different
concurrency use cases, so it should be acceptable to have hash indexes
have a clear benefit in one use case but a clear loss in another.--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Simon,
Thank you for your input. I would like to include some tests with large
indexes too. Do you have any ideas for a test corpus or should we try
and generate the test data programatically? Many people in the literature
of text retrieval use the TREC* data for at least some of their runs. I
am going to check at work to see if the campus has access to the data,
otherwise I will do some web crawling to generate some sample data. I
have just posted a reply to Tom Lane with some further ideas for consideration
in the new hash index support. Like you, I suspect that volatile data that
results in many index changes may not work well with hash indexes, in general.
PostgreSQL has the additional burden of needing to access both the index and
the data heap. Obviously, the less I/O that is needed the better the
performance is likely to be. The new HOT functionality plus clustering the
table data on the hash index would effectively organize the table into the
"hash buckets" which could help with reducing both the churn in the index
as well as in the tables.
Regards,
Ken
"Kenneth Marshall" <ktm@rice.edu> writes:
On Sun, Sep 02, 2007 at 10:41:22PM -0400, Tom Lane wrote:
Kenneth Marshall <ktm@rice.edu> writes:
... This is the rough plan. Does anyone see anything critical that
is missing at this point?Sounds pretty good. Let me brain-dump one item on you: one thing that
hash currently has over btree is the ability to handle index items up
to a full page. Now, if you go with a scheme that only stores hash
codes and not the underlying data, you can not only handle that but
improve on it;
I think that would be a big selling point for hash indexes. It would let you
index even toasted data which are larger than a page. I'm not sure whether you
can make it work for unique indexes though. But for non-unique indexes I think
it would be a solid win and mean you cover a set of use cases quite distinct
from btree indexes.
- Hash lookup is O(1) while btree is O(logN).
That's not really true. There's a tradeoff between insertion time and lookup
time. In order to get O(1) lookups you need to work pretty hard to maintain
the hash table including spending a lot of time reorganizing it when you grow
it. If you don't want to spend the time on inserts then you end up with
buckets and the hash table is basically just a linear speedup to whatever
algorithm you use to scan the buckets.
- What about multi-column indexes? The current implementation
only supports 1 column.
That seems kind of weird. It seems obvious that you mix the three hashes
together which reduces it to the solved problem.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes:
"Kenneth Marshall" <ktm@rice.edu> writes:
- What about multi-column indexes? The current implementation
only supports 1 column.
That seems kind of weird. It seems obvious that you mix the three hashes
together which reduces it to the solved problem.
No, because part of the deal is that you can do lookups using only the
leading index columns. At least, all the existing multicolumn index
types can do that.
regards, tom lane
On 9/3/07, Gregory Stark <stark@enterprisedb.com> wrote:
"Kenneth Marshall" <ktm@rice.edu> writes:
On Sun, Sep 02, 2007 at 10:41:22PM -0400, Tom Lane wrote:
Kenneth Marshall <ktm@rice.edu> writes:
... This is the rough plan. Does anyone see anything critical that
is missing at this point?Sounds pretty good. Let me brain-dump one item on you: one thing that
hash currently has over btree is the ability to handle index items up
to a full page. Now, if you go with a scheme that only stores hash
codes and not the underlying data, you can not only handle that but
improve on it;I think that would be a big selling point for hash indexes. It would let you
index even toasted data which are larger than a page. I'm not sure whether you
can make it work for unique indexes though. But for non-unique indexes I think
it would be a solid win and mean you cover a set of use cases quite distinct
from btree indexes.- Hash lookup is O(1) while btree is O(logN).
That's not really true. There's a tradeoff between insertion time and lookup
time. In order to get O(1) lookups you need to work pretty hard to maintain
the hash table including spending a lot of time reorganizing it when you grow
it. If you don't want to spend the time on inserts then you end up with
buckets and the hash table is basically just a linear speedup to whatever
algorithm you use to scan the buckets.
These facts notwithstanding, average insert performance remains O(1)
if you grow the hash exponentially every time it needs to be grown.
Suppose, for example, that you use a power of 2 arrangement. Then the
worst case scenario, right after a split, is that all of your keys had
to be inserted, all had to be moved once, half had to be moved twice,
a quarter 3 times, etc. So the ratio of moves to keys is 1 + 1/2 +
1/4 + ... which is a well-known geometric series converging on 2.
True, when you cross the threshold a lot of work needs to be done.
Life would be simpler if you could just put up a lock while you split
the hash. You can't do that for a busy transactional database though.
But if you want to be clever about it, you build into your hash
implementation the intelligence to be able to have 1 or 2 hash
locations to search. When they are both present, all inserts go into
one of them, all deletes and updates are performed against both. Then
you're able to have a background job reorganize your hash while the
database continues to use it.
- What about multi-column indexes? The current implementation
only supports 1 column.That seems kind of weird. It seems obvious that you mix the three hashes
together which reduces it to the solved problem.
That raises a very random thought. One of the nicer features of
Oracle is the ability to have function-based indexes. So you could
index, say, trim(lower(person.name)). There are a *lot* of practical
situations where that comes in handy. The best workaround that I can
think of for not having that is to have a column defined to hold the
result of the function, maintain that column with a trigger, then
index that column. Which works, but is inelegant. (It also requires
storing completely redundant data.)
Is there any prospect of postgres aquiring that functionality?
Ben
On Mon, Sep 03, 2007 at 05:20:34PM -0700, Ben Tilly wrote:
That raises a very random thought. One of the nicer features of
Oracle is the ability to have function-based indexes. So you could
index, say, trim(lower(person.name)). There are a *lot* of practical
situations where that comes in handy. The best workaround that I can
think of for not having that is to have a column defined to hold the
result of the function, maintain that column with a trigger, then
index that column. Which works, but is inelegant. (It also requires
storing completely redundant data.)Is there any prospect of postgres aquiring that functionality?
Ben
I believe that PostgreSQL already supports functional indexes. In fact,
one suggestion to address the egregiously poor performance of the current
hash index was to replace it with a functional index.
Regards,
Ken
"Ben Tilly" <btilly@gmail.com> writes:
That raises a very random thought. One of the nicer features of
Oracle is the ability to have function-based indexes. So you could
index, say, trim(lower(person.name)).
Is there any prospect of postgres aquiring that functionality?
Uh, no, since it's already there; has been since Berkeley days ...
regards, tom lane
On 9/3/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Ben Tilly" <btilly@gmail.com> writes:
That raises a very random thought. One of the nicer features of
Oracle is the ability to have function-based indexes. So you could
index, say, trim(lower(person.name)).Is there any prospect of postgres aquiring that functionality?
Uh, no, since it's already there; has been since Berkeley days ...
Nice!
I know of at least one DBA who is moving from Oracle to postgres who
will be *very* happy to hear that.
Ben
On Sun, Sep 02, 2007 at 01:04:04PM -0500, Kenneth Marshall wrote:
Dear PostgreSQL Hackers:
After following the hackers mailing list for quite a while,
I am going to start investigating what will need to be done
to improve hash index performance. Below are the pieces of
this project that I am currently considering:1. Characterize the current hash index implementation against
the BTree index, with a focus on space utilization and
lookup performance against a collection of test data. This
will give a baseline performance test to evaluate the impact
of changes. I initially do not plan to bench the hash creation
process since my initial focus will be on lookup performance.
Here are very basic results for a table with 1.6m entries:
postgres=# CREATE TABLE dict (word varchar(100));
CREATE TABLE
postgres=# COPY dict FROM '/tmp/words';
COPY 1648379
postgres=# select count(*) from dict;
count
---------
1648379
(1 row)
Time: 11187.418 ms
postgres=# select count(*) from dict;
count
---------
1648379
(1 row)
Time: 6040.912 ms
postgres=# CREATE INDEX wordhash ON dict USING hash (word);
CREATE INDEX
Time: 11108707.160 ms
postgres=# select * from dict where word = 'avatar';
word
--------
avatar
(1 row)
Time: 79.823 ms
postgres=# select * from dict where word = 'zebra';
word
-------
zebra
(1 row)
Time: 9.864 ms
postgres=# select * from dict where word = 'turkey';
word
--------
turkey
(1 row)
Time: 18.418 ms
Time: 1.045 ms
Time: 1.257 ms
Time: 1.080 ms
postgres=# CREATE INDEX wordbtree ON dict USING btree (word);
CREATE INDEX
Time: 25438.884 ms
postgres=# select * from dict where word = 'avatar';
word
--------
avatar
(1 row)
Time: 13.400 ms
postgres=# select * from dict where word = 'zebra';
word
-------
zebra
(1 row)
Time: 1.173 ms
postgres=# select * from dict where word = 'turkey';
word
--------
turkey
(1 row)
Time: 1.186 ms
Time: 1.103 ms
Time: 1.099 ms
Time: 1.108 ms
------------------------------
Size of table = 87556096
Size of hash index = 268451840
Size of btree index = 53510144
From my very small sample on an unloaded machine, a hash index lookup
took the least amount of time. It had a much larger initial time which
could be attributable to cache population effects. The size is 5X that
of the Btree index. I will continue to improve the test suite as more
granularity is needed. If anyone has a good data generator, please let
me know. Otherwise I will just roll my own.
Regards,
Ken
Ühel kenal päeval, E, 2007-09-03 kell 19:55, kirjutas Tom Lane:
Gregory Stark <stark@enterprisedb.com> writes:
"Kenneth Marshall" <ktm@rice.edu> writes:
- What about multi-column indexes? The current implementation
only supports 1 column.That seems kind of weird. It seems obvious that you mix the three hashes
together which reduces it to the solved problem.No, because part of the deal is that you can do lookups using only the
leading index columns. At least, all the existing multicolumn index
types can do that.
One approahc is not to mix hashes, but to partition the hash, so that
each column gets its N bits in the hash.
If you do it smartly you can use any column for index lookups, not just
the leading one.
Show quoted text
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
Hannu Krosing <hannu@skype.net> writes:
Ühel kenal päeval, E, 2007-09-03 kell 19:55, kirjutas Tom Lane:
No, because part of the deal is that you can do lookups using only the
leading index columns. At least, all the existing multicolumn index
types can do that.
One approahc is not to mix hashes, but to partition the hash, so that
each column gets its N bits in the hash.
How does that help? You still need all the keys to find out which
bucket to look in.
regards, tom lane
Ühel kenal päeval, N, 2007-09-06 kell 09:38, kirjutas Tom Lane:
Hannu Krosing <hannu@skype.net> writes:
Ühel kenal päeval, E, 2007-09-03 kell 19:55, kirjutas Tom Lane:
No, because part of the deal is that you can do lookups using only the
leading index columns. At least, all the existing multicolumn index
types can do that.One approahc is not to mix hashes, but to partition the hash, so that
each column gets its N bits in the hash.How does that help? You still need all the keys to find out which
bucket to look in.
no. you need to look at only the buckets where that part of hash matches
say you allocate bits 4-7 for column 2 and then need to look up column 2
value with hash 3 . here you need to look at only buckets N*16 + 3, that
is, you need to examine only each 16th bucket
---------
Hannu
Hannu Krosing wrote:
One approahc is not to mix hashes, but to partition the hash, so that
each column gets its N bits in the hash.How does that help? You still need all the keys to find out which
bucket to look in.no. you need to look at only the buckets where that part of hash matches
say you allocate bits 4-7 for column 2 and then need to look up column 2
value with hash 3 . here you need to look at only buckets N*16 + 3, that
is, you need to examine only each 16th bucket
I don't like the truncating hash suggestion because it limits the
ability of a hash code to uniquely identify a key.
If a user requires the ability to search on both (column1) and (column1,
column2), they can create two hash indexes and the planner can decide
which to use.
Or, they can use a btree. I think hash has a subset of uses where it
would be a significant gain, and focus should be spent on this subset.
Cheers,
mark
--
Mark Mielke <mark@mielke.cc>
On Sep 6, 2007, at 10:53 , Mark Mielke wrote:
I don't like the truncating hash suggestion because it limits the
ability of a hash code to uniquely identify a key.
AIUI, a hash can't be used as a unique identifier: it always needs to
be rechecked due to the chance of collisions. There might be other
issues with truncation, but preventing hashes from being unique isn't
one of them.
Michael Glaesemann
grzm seespotcode net
On Thu, Sep 06, 2007 at 11:53:45AM -0400, Mark Mielke wrote:
Hannu Krosing wrote:
One approahc is not to mix hashes, but to partition the hash, so that
each column gets its N bits in the hash.How does that help? You still need all the keys to find out which
bucket to look in.no. you need to look at only the buckets where that part of hash matches
say you allocate bits 4-7 for column 2 and then need to look up column 2
value with hash 3 . here you need to look at only buckets N*16 + 3, that
is, you need to examine only each 16th bucketI don't like the truncating hash suggestion because it limits the ability
of a hash code to uniquely identify a key.If a user requires the ability to search on both (column1) and (column1,
column2), they can create two hash indexes and the planner can decide which
to use.
Or, they can use a btree. I think hash has a subset of uses where it would
be a significant gain, and focus should be spent on this subset.Cheers,
mark
I agree that we should focus primarily on the subset of uses for hash
indexes where there would be a significant gain. I do think that being
able to use a single O(1) hash lookup against all the values specified
in a pseudo-multi-column index could be very beneficial in reducing
access time and I/O.
Since we already have to check the actual tuple values for any index
lookup in postgresql, we could only store the full hash value and the
corresponding TIDs in the bucket. Then when we lookup an item by
calculating its hash, if the exact hash is not present in the bucket,
then we know that the item is not in the index. If the value exists,
then we would check the heap tuples before returning the results. Thus
a negative lookup only needs to check the index and if the hash function
is "good" there will be optimally only 1 possibly valid heap tuple if
there is a match. One very big win for this change is to allow a much
smaller index size (hash value + relevant TIDs) and the large column
values are only stored in the actual data tuples.
Regards,
Ken
Show quoted text
--
Mark Mielke <mark@mielke.cc>
Michael Glaesemann wrote:
On Sep 6, 2007, at 10:53 , Mark Mielke wrote:
I don't like the truncating hash suggestion because it limits the
ability of a hash code to uniquely identify a key.AIUI, a hash can't be used as a unique identifier: it always needs to
be rechecked due to the chance of collisions. There might be other
issues with truncation, but preventing hashes from being unique isn't
one of them.
Of course - that's why I used the word "limit".
Hash works best, when the key is unique, however. A 32-bit hash will be
many powers of 2 more unique than a 8-bit hash.
Cheers,
mark
--
Mark Mielke <mark@mielke.cc>
On Thu, Sep 06, 2007 at 01:08:59PM -0500, Kenneth Marshall wrote:
Since we already have to check the actual tuple values for any index
lookup in postgresql, we could only store the full hash value and the
corresponding TIDs in the bucket. Then when we lookup an item by
calculating its hash, if the exact hash is not present in the bucket,
then we know that the item is not in the index.
Sounds like you'd be returning a bitmap for use with a bitmap scan.
That's a different take on other suggestions I've heard and would allow
a hash index to have an almost unlimited key size yet flexible
matching... (combined with other index, or even just the same index).
Neat.
Have a nice day,
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
On Sun, 2007-02-09 at 13:04 -0500, Kenneth Marshall wrote:
2. Evaluate the performance of different hash index implementations
and/or changes to the current implementation. My current plan is
to keep the implementation as simple as possible and still provide
the desired performance. Several hash index suggestions deal with
changing the layout of the keys on a page to improve lookup
performance, including reducing the bucket size to a fraction of
a page or only storing the hash value on the page, instead of
the index value itself.
You might find this patch useful:
http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php
It implements the "just store the hash in the index" idea; it also sorts
the entries in a bucket by the hash value, which allows binary search to
be used to locate candidate matches.
I was surprised that this didn't result in a performance improvement for
the benchmarks that I ran, but I never got around to investigating
further (either running more benchmarks or checking whether there was a
bug in the implementation).
Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge
it up to HEAD if you'd like.
-Neil
Neil Conway wrote:
You might find this patch useful:
http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php
Oh, I had forgot about that.
It implements the "just store the hash in the index" idea; it also sorts
the entries in a bucket by the hash value, which allows binary search to
be used to locate candidate matches.I was surprised that this didn't result in a performance improvement for
the benchmarks that I ran, but I never got around to investigating
further (either running more benchmarks or checking whether there was a
bug in the implementation).
You did get a smaller index, which has cache benefits with larger
tables. You didn't compare the size comparison against b-tree, but a
smaller index is a good thing.
I think you left some money on the table on that front. Since all the
HashItems are fixed size, you can get rid of the line pointers
altogether. Given that a sizeof(HashItemData) is 8 bytes, and a line
pointer is 4 bytes, that should give a further 1/3 reduction in index
size. If you're willing to give up on the alignment of HashItemData, you
can get it down to 6 bytes.
Even from a CPU point of view, as the table becomes bigger and the
b-tree becomes deeper, the difference between O(1) and O(log n) lookup
starts to become more significant.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote:
On Sun, 2007-02-09 at 13:04 -0500, Kenneth Marshall wrote:
2. Evaluate the performance of different hash index implementations
and/or changes to the current implementation. My current plan is
to keep the implementation as simple as possible and still provide
the desired performance. Several hash index suggestions deal with
changing the layout of the keys on a page to improve lookup
performance, including reducing the bucket size to a fraction of
a page or only storing the hash value on the page, instead of
the index value itself.You might find this patch useful:
http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php
It implements the "just store the hash in the index" idea; it also sorts
the entries in a bucket by the hash value, which allows binary search to
be used to locate candidate matches.I was surprised that this didn't result in a performance improvement for
the benchmarks that I ran, but I never got around to investigating
further (either running more benchmarks or checking whether there was a
bug in the implementation).Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge
it up to HEAD if you'd like.-Neil
This is a great starting point. I would appreciate it if you have the
time and could make it apply cleanly to HEAD. I remember when you first
posted it but had forgotten, probably because of the lack-luster results.
Just a quick glance at the patch and from what I can tell, tagging the
index as lossy causes a lot more work to be done than should be needed
in theory. Currently the index-scan machinery will recheck the value
against the original value for lossy indexes. However, given that we
are using a good hash function with a low chance of collision, if we
mark the unique items in the index then they do not actually have to
be rechecked during the scan. Do you have any suggestions for implementing
that optimization or is there any option to tell the scan machinery to
only re-check a certain list of tuples? Thank you again for pointing
this patch out and please let me know when you have a version against
HEAD.
Regards,
Ken
Show quoted text
On Fri, Sep 07, 2007 at 12:55:37PM +0100, Heikki Linnakangas wrote:
Neil Conway wrote:
You might find this patch useful:
http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php
Oh, I had forgot about that.
It implements the "just store the hash in the index" idea; it also sorts
the entries in a bucket by the hash value, which allows binary search to
be used to locate candidate matches.I was surprised that this didn't result in a performance improvement for
the benchmarks that I ran, but I never got around to investigating
further (either running more benchmarks or checking whether there was a
bug in the implementation).You did get a smaller index, which has cache benefits with larger
tables. You didn't compare the size comparison against b-tree, but a
smaller index is a good thing.I think you left some money on the table on that front. Since all the
HashItems are fixed size, you can get rid of the line pointers
altogether. Given that a sizeof(HashItemData) is 8 bytes, and a line
pointer is 4 bytes, that should give a further 1/3 reduction in index
size. If you're willing to give up on the alignment of HashItemData, you
can get it down to 6 bytes.Even from a CPU point of view, as the table becomes bigger and the
b-tree becomes deeper, the difference between O(1) and O(log n) lookup
starts to become more significant.
If you use the size values from my initial tests, the hash index is
down to 3X the btree size instead of 5X. If we can make these additional
changes and add a unique flags to the index entry, we would have a much
smaller index. I had not thought of it at the time, but the addition of
the unique/sole item flag would allow the use of the hash index to
support unique indexes and be used for primary keys. In some usage
cases, the O(1) would be very advantageous.
Regards,
Ken
Kenneth Marshall wrote:
On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote:
You might find this patch useful:
http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php
...Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge
it up to HEAD if you'd like.This is a great starting point. I would appreciate it if you have the
time and could make it apply cleanly to HEAD. I remember when you first
posted it but had forgotten, probably because of the lack-luster results.
Just a quick glance at the patch and from what I can tell, tagging the
index as lossy causes a lot more work to be done than should be needed
in theory. Currently the index-scan machinery will recheck the value
against the original value for lossy indexes. However, given that we
are using a good hash function with a low chance of collision, if we
mark the unique items in the index then they do not actually have to
be rechecked during the scan. Do you have any suggestions for implementing
that optimization or is there any option to tell the scan machinery to
only re-check a certain list of tuples? Thank you again for pointing
this patch out and please let me know when you have a version against
HEAD.
What do you mean by "mark the unique items in the index then they do not
actually have to be rechecked during the scan." Even if there is a
unique hash value mapping to a unique key, there is no guarantee that a
new value won't result in that same hash value. Such is the nature of
hashes. A hash key map does not mean a value match. The value must be
checked. The opposite, however, may be true. If the hash key is not
found, then we know the row for the value does not exist.
Have you measured the performance of re-checking? I have always assumed
the performance of re-checking was near free when compared to the cost
of looking up the tuples in the table to determine whether or not they
are "live". This is why I have not been upset that bitmap index scans
often re-check.
Cheers,
mark
--
Mark Mielke <mark@mielke.cc>
On Fri, Sep 07, 2007 at 09:50:07AM -0400, Mark Mielke wrote:
Kenneth Marshall wrote:
On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote:
You might find this patch useful:
http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php
...Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge
it up to HEAD if you'd like.This is a great starting point. I would appreciate it if you have the
time and could make it apply cleanly to HEAD. I remember when you first
posted it but had forgotten, probably because of the lack-luster results.
Just a quick glance at the patch and from what I can tell, tagging the
index as lossy causes a lot more work to be done than should be needed
in theory. Currently the index-scan machinery will recheck the value
against the original value for lossy indexes. However, given that we
are using a good hash function with a low chance of collision, if we
mark the unique items in the index then they do not actually have to
be rechecked during the scan. Do you have any suggestions for implementing
that optimization or is there any option to tell the scan machinery to
only re-check a certain list of tuples? Thank you again for pointing
this patch out and please let me know when you have a version against
HEAD.What do you mean by "mark the unique items in the index then they do not
actually have to be rechecked during the scan." Even if there is a unique
hash value mapping to a unique key, there is no guarantee that a new value
won't result in that same hash value. Such is the nature of hashes. A hash
key map does not mean a value match. The value must be checked. The
opposite, however, may be true. If the hash key is not found, then we know
the row for the value does not exist.Have you measured the performance of re-checking? I have always assumed the
performance of re-checking was near free when compared to the cost of
looking up the tuples in the table to determine whether or not they are
"live". This is why I have not been upset that bitmap index scans often
re-check.Cheers,
mark
I understand that a hash value is a many-to-one mapping. That is the
point of the flag in the index. The flag means that there is only one
item in the heap corresponding to that hash value. In this case we
know that the value in the heap is the correct one and a possibly
very expensive string comparison can be skipped. Given that the hash
function is doing its job, almost every string comparison can be skipped.
How long would it take to compare 1-32K of data? How much CPU usage?
With this field in place, you only need to check tuple visibility.
Regards,
Ken
On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote:
On Sun, 2007-02-09 at 13:04 -0500, Kenneth Marshall wrote:
2. Evaluate the performance of different hash index implementations
and/or changes to the current implementation. My current plan is
to keep the implementation as simple as possible and still provide
the desired performance. Several hash index suggestions deal with
changing the layout of the keys on a page to improve lookup
performance, including reducing the bucket size to a fraction of
a page or only storing the hash value on the page, instead of
the index value itself.You might find this patch useful:
http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php
It implements the "just store the hash in the index" idea; it also sorts
the entries in a bucket by the hash value, which allows binary search to
be used to locate candidate matches.I was surprised that this didn't result in a performance improvement for
the benchmarks that I ran, but I never got around to investigating
further (either running more benchmarks or checking whether there was a
bug in the implementation).Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge
it up to HEAD if you'd like.-Neil
I have another question. Did the scan code at this time use the
heap-order scanning? Could that have had an impact on the patch
performance?
Ken
Kenneth Marshall wrote:
I understand that a hash value is a many-to-one mapping. That is the
point of the flag in the index. The flag means that there is only one
item in the heap corresponding to that hash value. In this case we
know that the value in the heap is the correct one and a possibly
very expensive string comparison can be skipped. Given that the hash
function is doing its job, almost every string comparison can be skipped.
How long would it take to compare 1-32K of data? How much CPU usage?
With this field in place, you only need to check tuple visibility
The value comparison cannot be skipped. I do not think you understand
the many-to-one mapping in its entirety.
Example:
Table has: a(1), b(2), c(3)
Index has: 1 => 1 (unique), 2 => 2 (unique), 3 => 3 (unique)
Query:
select * from table where key = 'z';
If 'z' hashes to '3' (completely possible), then the index record 3
points to tuple 3, and it "exists". Only, it doesn't because 'a' <> 'z'.
You MUST check the value.
Cheers,
mark
--
Mark Mielke <mark@mielke.cc>
Kenneth Marshall wrote:
I understand that a hash value is a many-to-one mapping. That is the
point of the flag in the index. The flag means that there is only one
item in the heap corresponding to that hash value. In this case we
know that the value in the heap is the correct one and a possibly
very expensive string comparison can be skipped. Given that the hash
function is doing its job, almost every string comparison can be skipped.
How long would it take to compare 1-32K of data? How much CPU usage?
With this field in place, you only need to check tuple visibility.
How likely is it that you will get a hash collision, two strings that
are different that will hash to the same value? To avoid this requires
a very large hash key (128 bits, minimum)- otherwise you get into
birthday attack problems. With a 32-bit hash, the likelyhood is greater
than 50% that two strings in a collection of 100,000 will hash to the
same value. With a 64-bit hash, the likelyhood is greater than 50% that
two strings in a collection of 10 billion will has to same value. 10
billion is a large number, but not an unreasonable number, of strings to
want to put into a hash table- and it's exactly this case where the O(1)
cost of hashtables starts being a real win.
Brian
On Fri, Sep 07, 2007 at 10:30:30AM -0400, Mark Mielke wrote:
Kenneth Marshall wrote:
I understand that a hash value is a many-to-one mapping. That is the
point of the flag in the index. The flag means that there is only one
item in the heap corresponding to that hash value. In this case we
know that the value in the heap is the correct one and a possibly
very expensive string comparison can be skipped. Given that the hash
function is doing its job, almost every string comparison can be skipped.
How long would it take to compare 1-32K of data? How much CPU usage?
With this field in place, you only need to check tuple visibilityThe value comparison cannot be skipped. I do not think you understand the
many-to-one mapping in its entirety.Example:
Table has: a(1), b(2), c(3)
Index has: 1 => 1 (unique), 2 => 2 (unique), 3 => 3 (unique)Query:
select * from table where key = 'z';
If 'z' hashes to '3' (completely possible), then the index record 3 points
to tuple 3, and it "exists". Only, it doesn't because 'a' <> 'z'. You MUST
check the value.Cheers,
mark
Yes, you are completely correct. Thank you for the reminder.
Regards,
Ken
On Fri, Sep 07, 2007 at 10:36:41AM -0400, Brian Hurt wrote:
Kenneth Marshall wrote:
I understand that a hash value is a many-to-one mapping. That is the
point of the flag in the index. The flag means that there is only one
item in the heap corresponding to that hash value. In this case we
know that the value in the heap is the correct one and a possibly
very expensive string comparison can be skipped. Given that the hash
function is doing its job, almost every string comparison can be skipped.
How long would it take to compare 1-32K of data? How much CPU usage?
With this field in place, you only need to check tuple visibility.How likely is it that you will get a hash collision, two strings that are
different that will hash to the same value? To avoid this requires a very
large hash key (128 bits, minimum)- otherwise you get into birthday attack
problems. With a 32-bit hash, the likelyhood is greater than 50% that two
strings in a collection of 100,000 will hash to the same value. With a
64-bit hash, the likelyhood is greater than 50% that two strings in a
collection of 10 billion will has to same value. 10 billion is a large
number, but not an unreasonable number, of strings to want to put into a
hash table- and it's exactly this case where the O(1) cost of hashtables
starts being a real win.Brian
Yes, there is a non-negligible chance of collision (In a DB is there
any chance that is non-negligible? :) ) and the values must be checked
against the actual. The win is the collapse of the index size and only
needed to check a small fraction of the actual tuples.
Regards,
Ken
Kenneth Marshall wrote:
How likely is it that you will get a hash collision, two strings that are
different that will hash to the same value? To avoid this requires a very
large hash key (128 bits, minimum)- otherwise you get into birthday attack
problems. With a 32-bit hash, the likelyhood is greater than 50% that two
strings in a collection of 100,000 will hash to the same value. With a
64-bit hash, the likelyhood is greater than 50% that two strings in a
collection of 10 billion will has to same value. 10 billion is a large
number, but not an unreasonable number, of strings to want to put into a
hash table- and it's exactly this case where the O(1) cost of hashtables
starts being a real win.Brian
Yes, there is a non-negligible chance of collision (In a DB is there
any chance that is non-negligible? :) ) and the values must be checked
against the actual. The win is the collapse of the index size and only
needed to check a small fraction of the actual tuples.
Ah, OK- I misunderstood you. I thought you were saying that the hash
values would need to be unique, and you wouldn't check the original
values at all. My bad.
Brian
On Fri, Sep 07, 2007 at 11:08:13AM -0400, Brian Hurt wrote:
Kenneth Marshall wrote:
How likely is it that you will get a hash collision, two strings that are
different that will hash to the same value? To avoid this requires a
very large hash key (128 bits, minimum)- otherwise you get into birthday
attack problems. With a 32-bit hash, the likelyhood is greater than 50%
that two strings in a collection of 100,000 will hash to the same value.
With a 64-bit hash, the likelyhood is greater than 50% that two strings
in a collection of 10 billion will has to same value. 10 billion is a
large number, but not an unreasonable number, of strings to want to put
into a hash table- and it's exactly this case where the O(1) cost of
hashtables starts being a real win.Brian
Yes, there is a non-negligible chance of collision (In a DB is there
any chance that is non-negligible? :) ) and the values must be checked
against the actual. The win is the collapse of the index size and only
needed to check a small fraction of the actual tuples.Ah, OK- I misunderstood you. I thought you were saying that the hash
values would need to be unique, and you wouldn't check the original values
at all. My bad.Brian
No, you were correct. I misstated originally and you and Mark both pointed
out my mistake.
Regards,
Ken
On Fri, Sep 07, 2007 at 10:36:41AM -0400, Brian Hurt wrote:
Kenneth Marshall wrote:
I understand that a hash value is a many-to-one mapping. That is the
point of the flag in the index. The flag means that there is only one
item in the heap corresponding to that hash value. In this case we
know that the value in the heap is the correct one and a possibly
very expensive string comparison can be skipped. Given that the hash
function is doing its job, almost every string comparison can be skipped.
How long would it take to compare 1-32K of data? How much CPU usage?
With this field in place, you only need to check tuple visibility.How likely is it that you will get a hash collision, two strings that are
different that will hash to the same value? To avoid this requires a very
large hash key (128 bits, minimum)- otherwise you get into birthday attack
problems. With a 32-bit hash, the likelyhood is greater than 50% that two
strings in a collection of 100,000 will hash to the same value. With a
64-bit hash, the likelyhood is greater than 50% that two strings in a
collection of 10 billion will has to same value. 10 billion is a large
number, but not an unreasonable number, of strings to want to put into a
hash table- and it's exactly this case where the O(1) cost of hashtables
starts being a real win.Brian
Continuing this train of thought.... While it would make sense for larger
keys to store the hash in the index, if the key is smaller, particularly
if it is of fixed size, it would make sense to store the key in the index
instead. This would have the benefit of allowing use of the hash index
in non-lossy mode albeit with a slight increase in complexity.
Ken
Kenneth Marshall wrote:
Continuing this train of thought.... While it would make sense for larger
keys to store the hash in the index, if the key is smaller, particularly
if it is of fixed size, it would make sense to store the key in the index
instead. This would have the benefit of allowing use of the hash index
in non-lossy mode albeit with a slight increase in complexity.
I suspect there is no value in designing a hash implementation to work
well for a context where a btree index would already perform equally well.
If there are too few hash buckets, performance is not O(1). For a hash
index to function better than btree, I believe focus should be spent on
the O(1) case, which means ensuring that enough hash buckets are used to
provide O(1).
All of these must match: 1) Hash value, 2) Key value, 3) Tuple visibility.
In the optimum O(1) scenario, each existing key will map to a hash
bucket that contains ~1 entry. For this case, there is no value to
having the key stored in the index row, as 3) Tuple visibility, will
still require access to the table row. In this optimum scenario, I do
not believe anything of value is saved by storing the key in the index
row. The loss, however, is that the hash index data structures become
more complex, and would likely require support for variable length data.
The resulting increase in hash index size and code complexity would
reduce performance.
Just an opinion.
Cheers,
mark
--
Mark Mielke <mark@mielke.cc>
On Sat, Sep 08, 2007 at 05:14:09PM -0400, Mark Mielke wrote:
Kenneth Marshall wrote:
Continuing this train of thought.... While it would make sense for larger
keys to store the hash in the index, if the key is smaller, particularly
if it is of fixed size, it would make sense to store the key in the index
instead. This would have the benefit of allowing use of the hash index
in non-lossy mode albeit with a slight increase in complexity.I suspect there is no value in designing a hash implementation to work well
for a context where a btree index would already perform equally well.If there are too few hash buckets, performance is not O(1). For a hash
index to function better than btree, I believe focus should be spent on the
O(1) case, which means ensuring that enough hash buckets are used to
provide O(1).All of these must match: 1) Hash value, 2) Key value, 3) Tuple visibility.
In the optimum O(1) scenario, each existing key will map to a hash bucket
that contains ~1 entry. For this case, there is no value to having the key
stored in the index row, as 3) Tuple visibility, will still require access
to the table row. In this optimum scenario, I do not believe anything of
value is saved by storing the key in the index row. The loss, however, is
that the hash index data structures become more complex, and would likely
require support for variable length data. The resulting increase in hash
index size and code complexity would reduce performance.Just an opinion.
Cheers,
mark
I agree that we should focus on the O(1) area. The value of storing the
actual value, possibly as an option, is that the key check can be done
in the index without requiring a heap lookup to check the actual value
which would be a benefit for a unique index. I agree that supporting
variable length data would complicate the index and reduce performance.
I am not willing to assume that ~1 entry per hash bucket is necessarily
what we want, at least without some testing. It seems reasonable that
with the performance differences between L1/L2/L3 cache, main memory,
and the disk subsystem a higher load factor would result in better
overall system performance by reducing cache line misses and improving
hardware pre-fetch efficiency. Along with the hypothetical performance
wins, the hash index space efficiency would be improved by a similar
factor. Obviously, all of these ideas would need to be tested in
various workload environments. In the large index arena, 10^6 to 10^9
keys and more, space efficiency will help keep the index manageable
in todays system memories.
Please keep the ideas and comments coming. I am certain that a synthesis
of them will provide an implementation with the performance characteristics
that we are seeking.
Regards,
Ken
Show quoted text
--
Mark Mielke <mark@mielke.cc>
Kenneth Marshall wrote:
The value of storing the
actual value, possibly as an option, is that the key check can be done
in the index without requiring a heap lookup to check the actual value
which would be a benefit for a unique index. I agree that supporting
variable length data would complicate the index and reduce performance.
I am not willing to assume that ~1 entry per hash bucket is necessarily
what we want, at least without some testing.
I think that if the case of >1 entry per hash becomes common enough to
be significant, and the key is stored in the hash, that a btree will
perform equal or better, and there is no point in pursuing such a hash
index model. This is where we are today.
It seems reasonable that
with the performance differences between L1/L2/L3 cache, main memory,
and the disk subsystem a higher load factor would result in better
overall system performance by reducing cache line misses and improving
hardware pre-fetch efficiency.
If the key is stored, all of these factors likely favor the btree format
over the hash format. Again, this is where we are today.
Along with the hypothetical performance
wins, the hash index space efficiency would be improved by a similar
factor. Obviously, all of these ideas would need to be tested in
various workload environments. In the large index arena, 10^6 to 10^9
keys and more, space efficiency will help keep the index manageable
in todays system memories.
Space efficiency is provided by not storing the key, nor the header data
required (length prefix?).
Please keep the ideas and comments coming. I am certain that a synthesis
of them will provide an implementation with the performance characteristics
that we are seeking.
The subject interests me. :-)
Cheers,
mark
--
Mark Mielke <mark@mielke.cc>
On Sun, Sep 02, 2007 at 10:41:22PM -0400, Tom Lane wrote:
Kenneth Marshall <ktm@rice.edu> writes:
... This is the rough plan. Does anyone see anything critical that
is missing at this point?Sounds pretty good. Let me brain-dump one item on you: one thing that
hash currently has over btree is the ability to handle index items up
to a full page. Now, if you go with a scheme that only stores hash
codes and not the underlying data, you can not only handle that but
improve on it; but if you reduce the bucket size and don't remove the
data, it'd be a step backward. The idea I had about dealing with that
was to only reduce the size of primary buckets --- if it's necessary to
add overflow space to a bucket, the overflow units are still full pages.
So an index tuple up to a page in size can always be accommodated by
adding an overflow page to the bucket.Just a thought, but AFAIR it's not in the archives anywhere.
regards, tom lane
I was thinking about this some more, and it strikes me that we can
keep the page size = bucket size = overflow size in the new scheme
of storing the hash value in the index and still reduce the effective
bucket size. Let's make the new size the (page size / 2^n) where n
is chosen appropriately. Then we we store the value in the page we
simply use n bits of the hash to determine which sub-piece of the
page to use to actually store the value. We may need to add a multiplier
to adjust the decision to split the page based on the mini-page. This
should allow us to much more densely pack the pages and approach the
1 item per bucket. This could easily shrink the size of the index by
a factor of 2^n. Let me know what you think.
Regards,
Ken
--On Samstag, September 08, 2007 18:56:23 -0400 Mark Mielke
<mark@mark.mielke.cc> wrote:
Kenneth Marshall wrote:
Along with the hypothetical performance
wins, the hash index space efficiency would be improved by a similar
factor. Obviously, all of these ideas would need to be tested in
various workload environments. In the large index arena, 10^6 to 10^9
keys and more, space efficiency will help keep the index manageable
in todays system memories.Space efficiency is provided by not storing the key, nor the header data
required (length prefix?).
Space efficiency at ~1 entry per bucket: How about using closed hashing,
saving in each page a bitmask in front which specifies which entries hold
valid entries and in the rest of the page row-pointers (is this the correct
expression? I don't know...) without further data. Should provide
reasonably simple data structure and alignment for the pointers.
Please keep the ideas and comments coming. I am certain that a synthesis
of them will provide an implementation with the performance
characteristics
that we are seeking.
One should look into new plan nodes for "!= ANY()", "NOT EXISTS" and
similar. A node like "look into hash and true if bucket is empty" would
work without checking tuple visibility when the bucket is empty and could
be a win in some situations.
Do we want special cases for short keys like INT4? In those cases the
implementation might use hash == key and put that knowledge to use in
plans. Even a unique constraint might then be doable. Does the
postgresql-storage backend on linux support sparse files? Might be a win
when holes in the sequence turn up.
More random thoughts:
- Hash-Indices are best for unique keys, but every table needs a new hash
key, which means one more random page access. Is there any way to build
multi-_table_ indices? A join might then fetch all table rows with a given
unique key after one page fetch for the combined index.
- Hashes with trivial hash-functions (like identity) can also return rows
in a desired order.
- Is there a case where a sequentially scanning a hash-index is useful? I
can't find any, but maybe somebody else has a use-case.
- What about HashJoins when the referenced tables have hash-indices?
- What about hash-indices where entries are inserted for multiple columns.
Assume a table like this:
CREATE TABLE link (obj_id1 INT4, obj_id2 INT4);
and a query like
SELECT * FROM link WHERE ? IN (obj_id1, obj_id2);
or some join using a similar condition. It might be a nice thing to insert
entries at both HASH(obj_id1) and HASH(obj_id2), which would eliminate the
need to check in two indices and do a bitmap OR. OTOH it might not be
faster in any significant use cases because who'd need a link table with
nearly unique linked objects?
- In cases where the distribution of the hash-function is good, but a small
and relatively even number of rows exist for each key (like it might be the
case in the above example), it might be nice to reserve a given amount of
same-key row entries in each bucket, and hold a fill-count at the front of
it. That would avoid costly page fetches after each collision. You'd create
a hash-index with n-buckets, each m-elements large. When the bucket is
full, the usual collision handling continues.
- About hash enlargement: What about always using only the first k bits of
each hash value. When you find that the hash is "quite-full" (however that
is defined and detected), k is increased by one, effectively doubling the
hash size. New entries are then written as usual, while retrieving the old
entries needs to test at the k-bit-position first and if there is a miss,
also at the k-1-position and so forth. To limit this search, some
background process could after analyzing the index move old entries to the
now correct k-bit-position and increment some "min-k"-value once all old
entries have been moved. After the hash has been increased, the index would
approximately half it's speed for some time then. Additionally one could
also insert the entry at the new position if it has been found at the old
one only while using the index. A special "miss"-entry at the new position
doesn't help if nothing could be found because the old positions will
usually hold some data which resides there even if it uses k bits.
Import Notes
Resolved by subject fallback
Kenneth Marshall wrote:
On Sun, Sep 02, 2007 at 10:41:22PM -0400, Tom Lane wrote:
Kenneth Marshall <ktm@rice.edu> writes:
... This is the rough plan. Does anyone see anything critical that
is missing at this point?Sounds pretty good. Let me brain-dump one item on you: one thing that
hash currently has over btree is the ability to handle index items up
to a full page. Now, if you go with a scheme that only stores hash
codes and not the underlying data, you can not only handle that but
improve on it; but if you reduce the bucket size and don't remove the
data, it'd be a step backward. The idea I had about dealing with that
was to only reduce the size of primary buckets --- if it's necessary to
add overflow space to a bucket, the overflow units are still full pages.
So an index tuple up to a page in size can always be accommodated by
adding an overflow page to the bucket.Just a thought, but AFAIR it's not in the archives anywhere.
regards, tom lane
I was thinking about this some more, and it strikes me that we can
keep the page size = bucket size = overflow size in the new scheme
of storing the hash value in the index and still reduce the effective
bucket size. Let's make the new size the (page size / 2^n) where n
is chosen appropriately. Then we we store the value in the page we
simply use n bits of the hash to determine which sub-piece of the
page to use to actually store the value. We may need to add a multiplier
to adjust the decision to split the page based on the mini-page. This
should allow us to much more densely pack the pages and approach the
1 item per bucket. This could easily shrink the size of the index by
a factor of 2^n. Let me know what you think.
My personal opinion is that something like this is required to take best
advantage of hashes. I didn't respond immediately because I don't have
advice on what the best implementation would look like.
I have also been wondering about the effect of a hash index that
includes multiple values to the same key (either a non-unique key, or
different tuples from the same key due to table updates). I started by
thinking that the maximum number of hash entries per hash bucket should
be kept to 2 or 3 to prevent reduction in performance to that of btree,
but I think this doesn't work if multiple tuples can have the same key.
Unless - the maps is hash value =1:1> index page =1:1> hash bucket =1:N>
hash value =1:M=> tuples. Guarantee than N is small (either <= 2 or <=4
depending on performance evaluation) by re-hashing if N ever becomes > 2
or > 4. Choose a sparse harsh. Let 1:M be indefinite? Also, optimize the
1:M=1:1 case, such that the value can be inline?
For most cases, I would think the above model would make it cheap to
determine if the key does not exist, as well as the 1:1 (hash value:key)
case requiring a single page lookup. Update performance would suffer,
but lookup should be faster than btree in these cases, as btree often
requires > 1 index page scan.
Cheers,
mark
--
Mark Mielke <mark@mielke.cc>
On Sat, Sep 08, 2007 at 06:56:23PM -0400, Mark Mielke wrote:
I think that if the case of >1 entry per hash becomes common enough to
be significant, and the key is stored in the hash, that a btree will
perform equal or better, and there is no point in pursuing such a hash
index model. This is where we are today.
There is the point that if a user does an UPDATE of a row without
changing the key your index will have to store entries with the same
hash. If your goal is mostly write-once tables, then that's cool, but
otherwise you're going to have to find a way of dealing with that. It's
not just going to happen a lot, it is going to be common, even for
unique indexes.
Presumably HOT will help with this, but that relies on all index columns
not to change.
The major benenfits will mostly come from not storing the key at all. I
think.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
We are pleased to announce an upcoming patch to the hash index code
which improves build time and index size, based on this item in the
TODO list:
During index creation, pre-sort the tuples to improve build speed
http://archives.postgresql.org/pgsql-hackers/2007-03/msg01199.php
Using our implementation, build times and index sizes are
comparable with btree index build times and index sizes.
For example, for a particular 12 million row relation, the
8.2.4 release requires over 2.8 hours to build the hash index.
With our patch, the index is built in 80 seconds.
Here is the link for a graph, showing a comparative analysis of
btree and hash index builds and describing the benchmark data.
http://web.cecs.pdx.edu/~raneyt/pg/
We are currently cleaning up the patch and will submit it asap.
Regards,
Shreya Bhargava <shreya_bhargav@yahoo.com>
Tom Raney <twraney@comcast.net>
Kenneth Marshall wrote:
Show quoted text
Dear PostgreSQL Hackers:
After following the hackers mailing list for quite a while,
I am going to start investigating what will need to be done
to improve hash index performance. Below are the pieces of
this project that I am currently considering:1. Characterize the current hash index implementation against
the BTree index, with a focus on space utilization and
lookup performance against a collection of test data. This
will give a baseline performance test to evaluate the impact
of changes. I initially do not plan to bench the hash creation
process since my initial focus will be on lookup performance.2. Evaluate the performance of different hash index implementations
and/or changes to the current implementation. My current plan is
to keep the implementation as simple as possible and still provide
the desired performance. Several hash index suggestions deal with
changing the layout of the keys on a page to improve lookup
performance, including reducing the bucket size to a fraction of
a page or only storing the hash value on the page, instead of
the index value itself. My goal in this phase is to produce one
or more versions with better performance than the current BTree.3. Look at build time and concurrency issues with the addition of
some additional tests to the test bed. (1)4. Repeat as needed.
This is the rough plan. Does anyone see anything critical that
is missing at this point? Please send me any suggestions for test
data and various performance test ideas, since I will be working
on that first.Regards,
Ken Marshall---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote:
We are pleased to announce an upcoming patch to the hash index code
which improves build time and index size, based on this item in the
TODO list:
During index creation, pre-sort the tuples to improve build speed
http://archives.postgresql.org/pgsql-hackers/2007-03/msg01199.phpUsing our implementation, build times and index sizes are
comparable with btree index build times and index sizes.
For example, for a particular 12 million row relation, the
8.2.4 release requires over 2.8 hours to build the hash index. With our
patch, the index is built in 80 seconds.
Here is the link for a graph, showing a comparative analysis of
btree and hash index builds and describing the benchmark data.
http://web.cecs.pdx.edu/~raneyt/pg/We are currently cleaning up the patch and will submit it asap.
Regards,
Shreya Bhargava <shreya_bhargav@yahoo.com>
Tom Raney <twraney@comcast.net>
That is super! (and timely)
I was just looking at that some myself since the large index build
times make testing a very laborious process. I am looking forward to
seeing the details.
Regards,
Ken
"Kenneth Marshall" <ktm@rice.edu> writes:
On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote:
Using our implementation, build times and index sizes are
comparable with btree index build times and index sizes....
That is super! (and timely)
It is pretty super. I have a few comments to raise but don't take it to be too
negative, it sounds like this is a big step forward towards making hash
indexes valuable.
Firstly in the graphs it seems the index size graph has an exponential x-axis
but a linear y-axis. This makes it hard to see what I would expect to be
pretty much linear growth. The curves look exponential which would mean linear
growth but of course it's hard to tell.
Also, the growth in the time chart looks pretty much linear. That seems weird
since I would expect there would be a visible extra component since sort times
are n-log(n). Perhaps you need to test still larger tables to see that though.
In any case it's clear from the results you have there that the change is a
positive one and fixes a fundamental problem with the hash index build code.
Something else you should perhaps test is indexing something which is
substantially larger than hash function output. A hash function is going to
make the most sense when indexing something like strings for which you want to
avoid the long comparison costs. Especially consider testing this on a UTF8
locale with expensive comparisons (like a CJK locale for example).
Note that the bottom line for the problems with hash indexes is that the
current implementation doesn't offer any advantages over btree indexes. Hash
indexes need to be not only as good of a choice as btree indexes but
significantly better a significantly better choice at least for some specific
circumstances.
Also, if you're going to submit a patch you should check out a copy of the CVS
HEAD and work from that. I don't think there are huge differences in the area
of hash indexes though. But in most other areas you would be spending quite a
bit of time dealing details which have changed since.
Finally note that we're in the final throes of the 8.3 feature freeze.
Normally any patch submitted now would be held until 8.3 is released and
development on 8.4 is started. I could imagine an exception being made for
hash indexes since they're so moribund currently but probably not. The flip
side of that argument is that there's not much point in making an exception
for something which will only be really useful once further work is done in
the same area.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
On Tue, Sep 25, 2007 at 03:35:47PM +0100, Gregory Stark wrote:
"Kenneth Marshall" <ktm@rice.edu> writes:
On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote:
Using our implementation, build times and index sizes are
comparable with btree index build times and index sizes....
That is super! (and timely)It is pretty super. I have a few comments to raise but don't take it to be too
negative, it sounds like this is a big step forward towards making hash
indexes valuable.Firstly in the graphs it seems the index size graph has an exponential x-axis
but a linear y-axis. This makes it hard to see what I would expect to be
pretty much linear growth. The curves look exponential which would mean linear
growth but of course it's hard to tell.Also, the growth in the time chart looks pretty much linear. That seems weird
since I would expect there would be a visible extra component since sort times
are n-log(n). Perhaps you need to test still larger tables to see that though.In any case it's clear from the results you have there that the change is a
positive one and fixes a fundamental problem with the hash index build code.Something else you should perhaps test is indexing something which is
substantially larger than hash function output. A hash function is going to
make the most sense when indexing something like strings for which you want to
avoid the long comparison costs. Especially consider testing this on a UTF8
locale with expensive comparisons (like a CJK locale for example).Note that the bottom line for the problems with hash indexes is that the
current implementation doesn't offer any advantages over btree indexes. Hash
indexes need to be not only as good of a choice as btree indexes but
significantly better a significantly better choice at least for some specific
circumstances.Also, if you're going to submit a patch you should check out a copy of the CVS
HEAD and work from that. I don't think there are huge differences in the area
of hash indexes though. But in most other areas you would be spending quite a
bit of time dealing details which have changed since.Finally note that we're in the final throes of the 8.3 feature freeze.
Normally any patch submitted now would be held until 8.3 is released and
development on 8.4 is started. I could imagine an exception being made for
hash indexes since they're so moribund currently but probably not. The flip
side of that argument is that there's not much point in making an exception
for something which will only be really useful once further work is done in
the same area.
Although I am very excited about this patch, I do not see any real value
in including it in 8.3. As you mention, we need to to have a hash index
implementation that outperforms btree in some problem regime and that is
currently not the case. I have just recently started the process of
gathering ideas and having discussions on various approaches to making
hash indexes more performant and we have a number of ideas on which to
start our investigation. I do think that this patch will make the testing
and evaluation, that will be needed to truly improve the hash index, much
much easier.
Regards,
Ken
On Sep 25, 2007, at 11:26 , Kenneth Marshall wrote:
Although I am very excited about this patch, I do not see any real
value
in including it in 8.3.
I don't think you have to worry about it being in 8.3. Feature freeze
was months ago.
Michael Glaesemann
grzm seespotcode net
Kenneth Marshall wrote:
On Tue, Sep 25, 2007 at 03:35:47PM +0100, Gregory Stark wrote:
"Kenneth Marshall" <ktm@rice.edu> writes:
On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote:
Using our implementation, build times and index sizes are
comparable with btree index build times and index sizes....
That is super! (and timely)It is pretty super. I have a few comments to raise but don't take it to be too
negative, it sounds like this is a big step forward towards making hash
indexes valuable.Firstly in the graphs it seems the index size graph has an exponential x-axis
but a linear y-axis. This makes it hard to see what I would expect to be
pretty much linear growth. The curves look exponential which would mean linear
growth but of course it's hard to tell.Also, the growth in the time chart looks pretty much linear. That seems weird
since I would expect there would be a visible extra component since sort times
are n-log(n). Perhaps you need to test still larger tables to see that though.In any case it's clear from the results you have there that the change is a
positive one and fixes a fundamental problem with the hash index build code.Something else you should perhaps test is indexing something which is
substantially larger than hash function output. A hash function is going to
make the most sense when indexing something like strings for which you want to
avoid the long comparison costs. Especially consider testing this on a UTF8
locale with expensive comparisons (like a CJK locale for example).Note that the bottom line for the problems with hash indexes is that the
current implementation doesn't offer any advantages over btree indexes. Hash
indexes need to be not only as good of a choice as btree indexes but
significantly better a significantly better choice at least for some specific
circumstances.Also, if you're going to submit a patch you should check out a copy of the CVS
HEAD and work from that. I don't think there are huge differences in the area
of hash indexes though. But in most other areas you would be spending quite a
bit of time dealing details which have changed since.Finally note that we're in the final throes of the 8.3 feature freeze.
Normally any patch submitted now would be held until 8.3 is released and
development on 8.4 is started. I could imagine an exception being made for
hash indexes since they're so moribund currently but probably not. The flip
side of that argument is that there's not much point in making an exception
for something which will only be really useful once further work is done in
the same area.Although I am very excited about this patch, I do not see any real value
in including it in 8.3. As you mention, we need to to have a hash index
implementation that outperforms btree in some problem regime and that is
currently not the case. I have just recently started the process of
gathering ideas and having discussions on various approaches to making
hash indexes more performant and we have a number of ideas on which to
start our investigation. I do think that this patch will make the testing
and evaluation, that will be needed to truly improve the hash index, much
much easier.Regards,
Ken
We're glad to contribute and be a part of Postgres. The patch has been
posted to pgsql-patches@postgresql.org.
Speeding up the creation time of hash indexes on non-trivial relations
was our goal. This will allow some interesting performance tests of the
hash index on very large relations. It may be that the near constant
lookup time of the hash index outperforms the Btree index for some large
data sets and for certain types of data and distributions.
Sincerely,
Tom Raney
This has been saved for the 8.4 release:
http://momjian.postgresql.org/cgi-bin/pgpatches_hold
---------------------------------------------------------------------------
Tom Raney wrote:
We are pleased to announce an upcoming patch to the hash index code
which improves build time and index size, based on this item in the
TODO list:
During index creation, pre-sort the tuples to improve build speed
http://archives.postgresql.org/pgsql-hackers/2007-03/msg01199.phpUsing our implementation, build times and index sizes are
comparable with btree index build times and index sizes.
For example, for a particular 12 million row relation, the
8.2.4 release requires over 2.8 hours to build the hash index.
With our patch, the index is built in 80 seconds.
Here is the link for a graph, showing a comparative analysis of
btree and hash index builds and describing the benchmark data.
http://web.cecs.pdx.edu/~raneyt/pg/We are currently cleaning up the patch and will submit it asap.
Regards,
Shreya Bhargava <shreya_bhargav@yahoo.com>
Tom Raney <twraney@comcast.net>Kenneth Marshall wrote:
Dear PostgreSQL Hackers:
After following the hackers mailing list for quite a while,
I am going to start investigating what will need to be done
to improve hash index performance. Below are the pieces of
this project that I am currently considering:1. Characterize the current hash index implementation against
the BTree index, with a focus on space utilization and
lookup performance against a collection of test data. This
will give a baseline performance test to evaluate the impact
of changes. I initially do not plan to bench the hash creation
process since my initial focus will be on lookup performance.2. Evaluate the performance of different hash index implementations
and/or changes to the current implementation. My current plan is
to keep the implementation as simple as possible and still provide
the desired performance. Several hash index suggestions deal with
changing the layout of the keys on a page to improve lookup
performance, including reducing the bucket size to a fraction of
a page or only storing the hash value on the page, instead of
the index value itself. My goal in this phase is to produce one
or more versions with better performance than the current BTree.3. Look at build time and concurrency issues with the addition of
some additional tests to the test bed. (1)4. Repeat as needed.
This is the rough plan. Does anyone see anything critical that
is missing at this point? Please send me any suggestions for test
data and various performance test ideas, since I will be working
on that first.Regards,
Ken Marshall---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
On Wed, Sep 05, 2007 at 03:07:03PM -0500, Kenneth Marshall wrote:
On Sun, Sep 02, 2007 at 01:04:04PM -0500, Kenneth Marshall wrote:
Dear PostgreSQL Hackers:
After following the hackers mailing list for quite a while,
I am going to start investigating what will need to be done
to improve hash index performance. Below are the pieces of
this project that I am currently considering:1. Characterize the current hash index implementation against
the BTree index, with a focus on space utilization and
lookup performance against a collection of test data. This
will give a baseline performance test to evaluate the impact
of changes. I initially do not plan to bench the hash creation
process since my initial focus will be on lookup performance.Here are very basic results for a table with 1.6m entries:
postgres=# CREATE TABLE dict (word varchar(100));
CREATE TABLEpostgres=# COPY dict FROM '/tmp/words';
COPY 1648379
postgres=# select count(*) from dict;
count
---------
1648379
(1 row)Time: 11187.418 ms
postgres=# select count(*) from dict;
count
---------
1648379
(1 row)Time: 6040.912 ms
postgres=# CREATE INDEX wordhash ON dict USING hash (word);
CREATE INDEX
Time: 11108707.160 mspostgres=# select * from dict where word = 'avatar';
word
--------
avatar
(1 row)Time: 79.823 ms
postgres=# select * from dict where word = 'zebra';
word
-------
zebra
(1 row)Time: 9.864 ms
postgres=# select * from dict where word = 'turkey';
word
--------
turkey
(1 row)Time: 18.418 ms
Time: 1.045 ms
Time: 1.257 ms
Time: 1.080 mspostgres=# CREATE INDEX wordbtree ON dict USING btree (word);
CREATE INDEXTime: 25438.884 ms
postgres=# select * from dict where word = 'avatar';
word
--------
avatar
(1 row)Time: 13.400 ms
postgres=# select * from dict where word = 'zebra';
word
-------
zebra
(1 row)Time: 1.173 ms
postgres=# select * from dict where word = 'turkey';
word
--------
turkey
(1 row)Time: 1.186 ms
Time: 1.103 ms
Time: 1.099 ms
Time: 1.108 ms------------------------------
Size of table = 87556096Size of hash index = 268451840
Size of btree index = 53510144From my very small sample on an unloaded machine, a hash index lookup
took the least amount of time. It had a much larger initial time which
could be attributable to cache population effects. The size is 5X that
of the Btree index. I will continue to improve the test suite as more
granularity is needed. If anyone has a good data generator, please let
me know. Otherwise I will just roll my own.Regards,
Ken
I have just re-ran the previous benchmark against 8.3beta1 (versus 8.2)
including the hash index sorted build patch and the results are below:
test=# CREATE TABLE dict (word varchar(100));
CREATE TABLE
Time: 24.463 ms
test=# COPY dict FROM '/tmp/cracklib-words';
LOG: checkpoints are occurring too frequently (21 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
COPY 1648379
Time: 44470.235 ms
test=# select count(*) from dict;
count
---------
1648379
(1 row)
Time: 4553.924 ms
test=# CREATE INDEX wordhash ON dict USING hash (word);
CREATE INDEX
Time: 85905.960 ms
test=# select * from dict where word = 'avatar';
word
--------
avatar
(1 row)
Time: 592.906 ms
test=# select * from dict where word = 'zebra';
word
-------
zebra
(1 row)
Time: 21.458 ms
test=# select * from dict where word = 'turkey';
word
--------
turkey
(1 row)
Time: 22.532 ms
Time: 1.224 ms
Time: 1.200 ms
Time: 1.264 ms
test=# CREATE INDEX wordbtree ON dict USING btree (word);
CREATE INDEX
Time: 33714.874 ms
test=# select * from dict where word = 'avatar';
word
--------
avatar
(1 row)
Time: 24.296 ms
test=# select * from dict where word = 'zebra';
word
-------
zebra
(1 row)
Time: 1.400 ms
test=# select * from dict where word = 'turkey';
word
--------
turkey
(1 row)
Time: 28.302 ms
Time: 1.284 ms
Time: 1.313 ms
---------------------------------------
Size of table = 69877760
Size of hash index = 268451840
Size of btree index = 48570368
I just wanted to have this baseline for future patches since
8.3 is just around the bend. As expected, the sorted build
process is a huge improvement from the unsorted original.
Regards,
Ken Marshall
Tom,
That is great. I am looking forward to your patch. After the
issues that you needed to address, I think that it would be
reasonable to add a few more user settings for the hash index.
Fill-factor is too course a knob. The others that I have been
considering are:
maxtuples - Not really the maximum, but a target value to use
for setting up the initial buckets. This would allow you to
set it for data loads and avoid the "split-n-copy" trauma
that you are trying to avoid with your new hash build process.
multiplicity - Try to capture use cases that would require many
overflow pages. In particular, if we discard the normal index
page layout we can skip the space overhead of the page pointer
and generate a more compact index. Then you could use a few
more hash bits to lookup the index entry in the page. How many
bits would be determined by this factor. 8-bits would give
you 256 sub-pieces that could each hold about 3 entries using
the current 4-byte hash, or 2 entries using an 8-byte hash.
What do you think?
Cheers,
Ken
Show quoted text
On Wed, Oct 17, 2007 at 03:31:58PM -0700, Tom Raney wrote:
Kenneth,
Great!
Yes, we did update the code to use the estimate. I will post the patch
with this update. We only saw a very small difference in index build time,
but you may when you add many columns to the base relation.
With 1 billion tuples, you should start to see the hash index outperform
the btree for some equality probes, I would imagine. With a 90% fill
factor, the btree would require 4 levels to index that many tuples. If the
top two were in memory, there would be 3 IOs needed. I don't think PG
supports index only scans, so it will take the extra IO to probe the base
relation. The hash may take up to 2 IOs and maybe even less (or maybe more
depending on how many overflow buckets there are). It might be interesting
to fiddle with the fill factors of each index - hash pages (buckets)
default to 75% full.
-TomTom,
I am getting ready to stitch in an updated, simplified version
of Neil Conway's hash-only hash index patch. Did you have a
chance to update your sizing function to use the planner-like
estimate and not a full table scan? I would like to be able
to use that when my test table start to have 10^9 entries.
If you have not had a chance, I will try and add it myself.Regards,
Ken
Import Notes
Reply to msg id not found: 47168D5E.2010507@comcast.netReference msg id not found: 20071017214714.GB12467@it.is.rice.eduReference msg id not found: 47168D5E.2010507@comcast.net | Resolved by subject fallback
Kenneth, I just pushed the revised patch (v2!). The revised approach
samples the parent relation to estimate the number of tuples rather than
performing a complete scan. In my tests, the estimate appears to be
accurate, erring on the larger side, which is fine.
Tom,
That is great. I am looking forward to your patch. After the
issues that you needed to address, I think that it would be
reasonable to add a few more user settings for the hash index.
Fill-factor is too course a knob. The others that I have been
considering are:
maxtuples - Not really the maximum, but a target value to use
for setting up the initial buckets. This would allow you to
set it for data loads and avoid the "split-n-copy" trauma
that you are trying to avoid with your new hash build process.
If I understand you correctly, I believe we already do this with our
current build process, there should not be any splits of the index if we
estimated the tuple count correctly. However, what gets you is
collisions where lots of overflow pages occur when distinct keys map to
the same bucket, or if you have lots of duplicate keys. Because your
overall tuple count doesn't exceed the fill factor, no splits occur, but
lengthy bucket chains lead to lots of IOs. You touch on this below.
multiplicity - Try to capture use cases that would require many
overflow pages. In particular, if we discard the normal index
page layout we can skip the space overhead of the page pointer
and generate a more compact index. Then you could use a few
more hash bits to lookup the index entry in the page. How many
bits would be determined by this factor. 8-bits would give
you 256 sub-pieces that could each hold about 3 entries using
the current 4-byte hash, or 2 entries using an 8-byte hash.What do you think?
Yes, this is a good direction. If we can increase the number of buckets
and reduce the bucket size (either physically or virtually) to allow
more direct access without creating a huge index on disk, that would be
ideal. But, then if you do have collisions, overflows occur more
frequently. I spoke with Neil Conway yesterday at the PG conference
here in Portland and he piqued my interest in examining his hash code
more closely to see what he has already done in this area.
-Tom
Show quoted text
Cheers,
KenOn Wed, Oct 17, 2007 at 03:31:58PM -0700, Tom Raney wrote:
Kenneth,
Great!
Yes, we did update the code to use the estimate. I will post the patch
with this update. We only saw a very small difference in index build time,
but you may when you add many columns to the base relation.
With 1 billion tuples, you should start to see the hash index outperform
the btree for some equality probes, I would imagine. With a 90% fill
factor, the btree would require 4 levels to index that many tuples. If the
top two were in memory, there would be 3 IOs needed. I don't think PG
supports index only scans, so it will take the extra IO to probe the base
relation. The hash may take up to 2 IOs and maybe even less (or maybe more
depending on how many overflow buckets there are). It might be interesting
to fiddle with the fill factors of each index - hash pages (buckets)
default to 75% full.
-TomTom,
I am getting ready to stitch in an updated, simplified version
of Neil Conway's hash-only hash index patch. Did you have a
chance to update your sizing function to use the planner-like
estimate and not a full table scan? I would like to be able
to use that when my test table start to have 10^9 entries.
If you have not had a chance, I will try and add it myself.Regards,
Ken---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
Tom,
Thank you for the update. I am currently working on updating the
patch Neil Conway sent in against 8.0-ish that stores only the hash
in the index and locates the entries within the page using a binary
search. Then I will fold in your recent update.
On Sun, Oct 21, 2007 at 01:13:48PM -0700, Tom Raney wrote:
Kenneth, I just pushed the revised patch (v2!). The revised approach
samples the parent relation to estimate the number of tuples rather than
performing a complete scan. In my tests, the estimate appears to be
accurate, erring on the larger side, which is fine.
Yes, larger is great and what we need to avoid all the index tuple
suffling between pages.
Tom,
That is great. I am looking forward to your patch. After the
issues that you needed to address, I think that it would be
reasonable to add a few more user settings for the hash index.
Fill-factor is too course a knob. The others that I have been
considering are:maxtuples - Not really the maximum, but a target value to use
for setting up the initial buckets. This would allow you to
set it for data loads and avoid the "split-n-copy" trauma
that you are trying to avoid with your new hash build process.If I understand you correctly, I believe we already do this with our
current build process, there should not be any splits of the index if we
estimated the tuple count correctly. However, what gets you is collisions
where lots of overflow pages occur when distinct keys map to the same
bucket, or if you have lots of duplicate keys. Because your overall tuple
count doesn't exceed the fill factor, no splits occur, but lengthy bucket
chains lead to lots of IOs. You touch on this below.
Yes, you do address this in your patches and it works well for an
existing heap. My idea was to minimize the shuffling problem when we
are doing a data load and do not have a heap to get a count from
because it has not been loaded yet.
multiplicity - Try to capture use cases that would require many
overflow pages. In particular, if we discard the normal index
page layout we can skip the space overhead of the page pointer
and generate a more compact index. Then you could use a few
more hash bits to lookup the index entry in the page. How many
bits would be determined by this factor. 8-bits would give
you 256 sub-pieces that could each hold about 3 entries using
the current 4-byte hash, or 2 entries using an 8-byte hash.What do you think?
Yes, this is a good direction. If we can increase the number of buckets
and reduce the bucket size (either physically or virtually) to allow more
direct access without creating a huge index on disk, that would be ideal.
But, then if you do have collisions, overflows occur more frequently. I
spoke with Neil Conway yesterday at the PG conference here in Portland and
he piqued my interest in examining his hash code more closely to see what
he has already done in this area.
Right, overflows would occur more frequently and any overflow would
allocate a full page. It may be possible to estimate the multiplicity
and minimize the use of overflow pages. If we know that on average that
there are no more than 10 items in a bucket, we can size the virtual
buckets on the first page to support 10 items and minimize the rollover
to an overflow page.
Other ideas, once we hit the overflow page, back-off to the current
fullpage use to maximize the fill-factor, possibly using Neil's
binary search to speed lookups within the overflow pages. Also, with
the smaller virtual buckets use the remaining space on the page as
the first overflow page to hopefully prevent needing to allocate a
full new overflow page. This would be most useful for the smaller
virtual bucket sizes and improve the overall packing efficiency of
the index. If we intend to take advantage of the O(1) behavior, it
will be most useful for large numbers of tuples, more than 10^9. A
32-bit hash is not good enough and we will need to use a 64-bit hash
to reduce collisions in the hash table and consequent need for overflow
pages. I already have the hash function updated to support 64-bit and
will look at getting that functional once I have the hash-only version
working with the 32-bit hashes. Also, without the need to store the
value in the index, we can boost the size of indexable fields to the
page size, and larger. I was tentatively targeting 32k as the implementation
max, because then hashing time becomes the issue. Of course, all of these
changes and options will need to be benchmarked and discussed.
Thank you again for posting the updated patch.
Regards,
Ken
Show quoted text
-Tom
Cheers,
KenOn Wed, Oct 17, 2007 at 03:31:58PM -0700, Tom Raney wrote:
Kenneth,
Great!
Yes, we did update the code to use the estimate. I will post the patch
with this update. We only saw a very small difference in index build
time, but you may when you add many columns to the base relation. With 1
billion tuples, you should start to see the hash index outperform the
btree for some equality probes, I would imagine. With a 90% fill factor,
the btree would require 4 levels to index that many tuples. If the top
two were in memory, there would be 3 IOs needed. I don't think PG
supports index only scans, so it will take the extra IO to probe the base
relation. The hash may take up to 2 IOs and maybe even less (or maybe
more depending on how many overflow buckets there are). It might be
interesting to fiddle with the fill factors of each index - hash pages
(buckets) default to 75% full. -TomTom,
I am getting ready to stitch in an updated, simplified version
of Neil Conway's hash-only hash index patch. Did you have a
chance to update your sizing function to use the planner-like
estimate and not a full table scan? I would like to be able
to use that when my test table start to have 10^9 entries.
If you have not had a chance, I will try and add it myself.Regards,
Ken---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
On Thu, Sep 13, 2007 at 02:02:14PM -0700, Neil Conway wrote:
On Fri, 2007-09-07 at 08:29 -0500, Kenneth Marshall wrote:
This is a great starting point. I would appreciate it if you have the
time and could make it apply cleanly to HEAD.Just to give you an update on this, I'll try to find the time to get it
done soon, but my day job is keeping me really busy these days, so I'm
not sure when I'll be able to get to it...-Neil
Neil,
I am currently working on integrating your previous patch with the
8.3beta1 hash code + the sorted build patch by Tom Raney. I have
been adding back pieces that were removed and I had a question
about index tuples, in particular what does the size field indicate?
Is it the size in bytes of the data after the IndexTupleData or the
total size in bytes including the IndexTupleData?
In order to use the hitem with the sorted build process, I think
that the flags should provide the size info so that the tuplesort
code can be used. This will also allow the use of >32-bit hash
values. Am I completely off base?
Regards,
Ken
Import Notes
Reply to msg id not found: 1189717334.20524.0.camel@dell.linuxdev.us.dell.com
"Note that the bottom line for the problems with hash indexes is that the
current implementation doesn't offer any advantages over btree indexes. Hash
indexes need to be not only as good of a choice as btree indexes but
significantly better a significantly better choice at least for some specific
circumstances."
We performed some probe tests on our patch on
hash index and the original btree index to compare the
performance between the two. We used a 80 million tuple table.
The table contained single integer attribute and the data
range was 0 - 100000000. (generated by a random generator).
We did the following:
1. Populate the table with 80 million tuples.
2. Create HASH index on the table.
3. clear both linux cache & psql buffers.
(exiting psql and restarting it cleared the psql buffers;
to clear linux cache, we used drop_cache command)
4. start psql
5. select on an integer in the range of values in the table.
(all test numbers were big ones, like 98934599)
6. record the time.
7. exit psql.
8. drop caches.(as described above)
9. repeat 4-8 for different numbers.
10. Drop Hash index.
11. Create Btree index and repeat 3-9.
The results are as shown below. All trials are in ms.
Count is the number of such tuples in the table.
HASH INDEX:
Number Count Trial1 Trial2 Trial3
21367898 2 152.0 129.5 131.1
95678699 2 140.6 145.6 137.5
99899799 2 148.0 147.4 152.6
97677899 2 142.0 153.5 112.0
94311123 2 137.6 146.3 141.3
67544455 2 141.6 144.6 152.7
88877677 2 122.1 123.1 130.8
88788988 2 150.2 150.0 171.7
44333344 1 119.9 116.3 117.6
34267878 1 97.5 99.9 114.8
55489781 2 115.7 117.2 145.3
97676767 1 169.5 168.5 181.7
99767564 1 142.7 133.6 132.7
21367989 4 198.0 193.2 186.7
BTREE INDEX
Number Trial1 Trial2 Trial3
21367898 145.5 145.6 150.6
95678699 177.5 170.0 176.4
99899799 175.4 181.2 185.4
97677899 136.9 149.0 130.8
94311123 179.0 175.3 166.3
67544455 161.7 162.2 170.4
88877677 138.7 135.2 149.9
88788988 165.7 179.3 186.3
44333344 166.0 172.8 184.3
34267878 159.1 168.8 147.8
55489781 183.2 195.4 185.1
97676767 147.2 143.6 132.6
99767564 167.8 178.3 162.1
21367989 232.3 238.1 216.9
From the results obtained, the average of all the hash probes is 141.8ms, the average for btree is 168.5, a difference of about 27.The standard deviations are about 23, so this is a statistically significant difference.
Our prediction that the hash index would take on the average one
probe for about 10ms and the btree would take three probes for about 30 ms
or a difference of about 20ms was pretty well shown by the difference we
got of about 27. Hope these data points will help with some questions
about the performance differences between Hash and Btree index.
Regards,
Shreya
Gregory Stark <stark@enterprisedb.com> wrote: "Kenneth Marshall" writes:
On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote:
Using our implementation, build times and index sizes are
comparable with btree index build times and index sizes....
That is super! (and timely)
It is pretty super. I have a few comments to raise but don't take it to be too
negative, it sounds like this is a big step forward towards making hash
indexes valuable.
Firstly in the graphs it seems the index size graph has an exponential x-axis
but a linear y-axis. This makes it hard to see what I would expect to be
pretty much linear growth. The curves look exponential which would mean linear
growth but of course it's hard to tell.
Also, the growth in the time chart looks pretty much linear. That seems weird
since I would expect there would be a visible extra component since sort times
are n-log(n). Perhaps you need to test still larger tables to see that though.
In any case it's clear from the results you have there that the change is a
positive one and fixes a fundamental problem with the hash index build code.
Something else you should perhaps test is indexing something which is
substantially larger than hash function output. A hash function is going to
make the most sense when indexing something like strings for which you want to
avoid the long comparison costs. Especially consider testing this on a UTF8
locale with expensive comparisons (like a CJK locale for example).
Note that the bottom line for the problems with hash indexes is that the
current implementation doesn't offer any advantages over btree indexes. Hash
indexes need to be not only as good of a choice as btree indexes but
significantly better a significantly better choice at least for some specific
circumstances.
Also, if you're going to submit a patch you should check out a copy of the CVS
HEAD and work from that. I don't think there are huge differences in the area
of hash indexes though. But in most other areas you would be spending quite a
bit of time dealing details which have changed since.
Finally note that we're in the final throes of the 8.3 feature freeze.
Normally any patch submitted now would be held until 8.3 is released and
development on 8.4 is started. I could imagine an exception being made for
hash indexes since they're so moribund currently but probably not. The flip
side of that argument is that there's not much point in making an exception
for something which will only be really useful once further work is done in
the same area.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
I have not followed this thread very closely. But just wanted to give my inputs.
From the results obtained, the average of all the hash probes is 141.8ms,
the average for btree is 168.5, a difference of about 27.The standard
deviations are about 23, so this is a statistically significant difference.
Our prediction that the hash index would take on the average one
probe for about 10ms and the btree would take three probes for about 30 ms
or a difference of about 20ms was pretty well shown by the difference we
got of about 27. Hope these data points will help with some questions
about the performance differences between Hash and Btree index.
We all know that Hash indexes are good for equality queries and Binary
indexes are good for both equality queries and Range queries. So for
equality queries, Hash index should do only one Logical I/O (if no
hash collisions) and Binary index should do atleast 3 (I don't know
the level of B-tree that came out as a result of this). You can enable
the Logical I/O count by applying this patch.
*** postgresql-8.3beta1/src/backend/storage/buffer/bufmgr.c Tue Sep 25
18:11:48 2007
--- postgresql-8.3patch/src/backend/storage/buffer/bufmgr.c Fri Oct 19
23:18:36 2007
***************
*** 1470,1477 ****
localhitrate = (float) LocalBufferHitCount *100.0 / ReadLocalBufferCount;
appendStringInfo(&str,
! "!\tShared blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
! ReadBufferCount - BufferHitCount, BufferFlushCount, hitrate);
appendStringInfo(&str,
"!\tLocal blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
ReadLocalBufferCount - LocalBufferHitCount,
LocalBufferFlushCount, localhitrate);
--- 1470,1477 ----
localhitrate = (float) LocalBufferHitCount *100.0 / ReadLocalBufferCount;
appendStringInfo(&str,
! "!\tShared blocks: %10ld Logical Reads, %10ld Physical Reads, %10ld
written, buffer hit rate = %.2f%%\n",
! ReadBufferCount, ReadBufferCount - BufferHitCount,
BufferFlushCount, hitrate);
appendStringInfo(&str,
"!\tLocal blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
ReadLocalBufferCount - LocalBufferHitCount,
LocalBufferFlushCount, localhitrate);
If possible, it would be useful for you to present the reduction in
Logical I/O count, as it very well might translate to Physical I/Os
for simple index scans. Atleast check whether it is in the ratio 1:3.
Hope my suggestion helps.
Thanks,
Gokul.
CertoSQL Project,
Allied Solution Group.
(www.alliedgroups.com)
Shreya Bhargava wrote:
1. Populate the table with 80 million tuples.
2. Create HASH index on the table.
3. clear both linux cache & psql buffers.
(exiting psql and restarting it cleared the psql buffers;
to clear linux cache, we used drop_cache command)
4. start psql
5. select on an integer in the range of values in the table.
(all test numbers were big ones, like 98934599)
6. record the time.
7. exit psql.
8. drop caches.(as described above)
9. repeat 4-8 for different numbers.
10. Drop Hash index.
11. Create Btree index and repeat 3-9.
It seems you're mostly measuring the overhead of starting a backend,
populating the relcache etc.
Restarting psql doesn't clear the postgres shared buffer cache. Or did
you mean that you restarted postgres?
Anyway, I don't think it's interesting to test with cleared caches.
Surely the metapage and first 1-2 levels of the b-tree would stay cached
all the time in real life.
From the results obtained, the average of all the hash probes is 141.8ms, the average for btree is 168.5, a difference of about 27.The standard deviations are about 23, so this is a statistically significant difference.
I don't trust those numbers much, but in any case I don't think that
edge is big enough to justify the existence of hash indexes.
If you're looking for a use case where hash index is faster, I'd suggest
using a data type with an expensive comparison function. Like long
multi-byte strings in UTF-8 encoding.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Shreya Bhargava wrote:
"Note that the bottom line for the problems with hash indexes is that the
current implementation doesn't offer any advantages over btree indexes. Hash
indexes need to be not only as good of a choice as btree indexes but
significantly better a significantly better choice at least for some
specific circumstances."
Oh it does. I recently used a hash index to speed up my database. Namely
I found it improved performance when indexing a non-integer column
containing english words.
I don't know how much of that data was cached, according to the sound
of my harddrive it wasn't all of it. Consider this anecdotical
evidence, but the speedup was noticeable.
We performed some probe tests on our patch on
hash index and the original btree index to compare the
performance between the two. We used a 80 million tuple table.
The table contained single integer attribute and the data
range was 0 - 100000000. (generated by a random generator).
I'd be interested how much difference is there between non-integer
index behaviour. I at least had the impression that in my case
the sorted strings in the btree pages didn't compare too well.
*/Gregory Stark <stark@enterprisedb.com>/* wrote:
"Kenneth Marshall" writes:On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote:
Using our implementation, build times and index sizes are
comparable with btree index build times and index sizes.
Way to go! Currently building hash indices is no fun.
Regards,
Jens-W. Schicke
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
iD8DBQFHMErXzhchXT4RR5ARAh7pAKCZIZFJFa7Oq25GvwDhiZJFsrtwgACbBC1F
otwhIZVlNgUGlroePIafi1c=
=N1f7
-----END PGP SIGNATURE-----
Shreya Bhargava <shreya_bhargav@yahoo.com> writes:
We performed some probe tests on our patch on
hash index and the original btree index to compare the
performance between the two.
Interesting, but ...
From the results obtained, the average of all the hash probes is 141.8ms, the average for btree is 168.5, a difference of about 27.The standard deviations are about 23, so this is a statistically significant difference.
Our prediction that the hash index would take on the average one
probe for about 10ms and the btree would take three probes for about 30 ms
or a difference of about 20ms was pretty well shown by the difference we
got of about 27.
... unfortunately, a zero-cache situation isn't very reflective of the
real world. In reality, for an index that is getting used often enough
to impact application performance at all, the root page will stay in
cache and likely so will some of the upper tree levels. So I don't
think this experiment proves anything at all about real-world
performance.
What I'd like to see is an aggregate time measurement for millions of
random probes into an index that's noticeably larger than memory.
It would also be interesting to measure the speed of the fully-cached
case (same test, but index smaller than memory).
regards, tom lane
On Thu, Sep 13, 2007 at 02:02:14PM -0700, Neil Conway wrote:
On Fri, 2007-09-07 at 08:29 -0500, Kenneth Marshall wrote:
This is a great starting point. I would appreciate it if you have the
time and could make it apply cleanly to HEAD.Just to give you an update on this, I'll try to find the time to get it
done soon, but my day job is keeping me really busy these days, so I'm
not sure when I'll be able to get to it...-Neil
Neil,
I have been working on putting an updated version of your
patch into the current source. My first try was to try and
put your patch in directly, but it differed so much from the
current build that it was not obvious how to address things
like the current hash_index sorted build patch, which I need
to be able to test with indexes of any size at all. My current
try is to replace the _hash_formitem() calls with a function
called _hash_form_tuple() that actually returns an IndexTuple
and not an HashItem. This will allow it to be used quite
naturally with the current sorted build patch. Here is what
it looks like now:
/*
* _hash_form_tuple -- construct index tuple using hash(value) not value
*/
IndexTuple
_hash_form_tuple(IndexTuple itup, Relation rel)
{
IndexTuple result;
Size size;
uint32 hashkey;
Datum datum;
bool isnull;
/* disallow nulls in hash keys */
if (IndexTupleHasNulls(itup))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("hash indexes cannot contain null keys")
));
if (rel->rd_rel->relnatts != 1)
elog(ERROR, "hash indexes support only one index key");
/* hash the tuple; we only store the hash value in the index */
datum = index_getattr(itup, 1, RelationGetDescr(rel), &isnull);
Assert(!isnull);
hashkey = _hash_datum2hashkey(rel, datum);
size = IndexTupleSize(itup);
result = (IndexTuple) palloc(size);
memcpy(result, itup, size);
return result;
}
I am not currently doing anything other than returning the current
IndexTuple that was created with index_form_tuple(). Am I daft, or
can I just memcpy() the 6 bytes of TID, add the 2 bytes of t_info
(everything 0 and the size set to 6 + 2 + sizeof(hash) = 10), and
the 4 bytes of hash. This will allow me to handle 8-byte hashes
in the future. If you see a problem with this approach, please
let me know. I would appreciate any feedback you can give.
Regards,
Ken
Show quoted text
Import Notes
Reply to msg id not found: 1189717334.20524.0.camel@dell.linuxdev.us.dell.com