Next Steps with Hash Indexes

Started by Simon Riggsalmost 5 years ago47 messageshackers

simon@2ndQuadrant.com

almost 5 years ago

Hi,

I've been investigating hash indexes and have what I think is a clear
picture in my head, so time for discussion.

It would be very desirable to allow Hash Indexes to become Primary Key
Indexes, which requires both
amroutine->amcanunique = true;
amroutine->amcanmulticol = true;

Every other hash index TODO seems like performance tuning, so can wait
awhile, even if it is tempting to do that first.

1. Multi-column Indexes
seems to have floundered because of this thread "Multicolumn Hash Indexes",
/messages/by-id/29263.1506483172@sss.pgh.pa.us,
but those issues don't apply in most common cases and so they seem
acceptable restrictions, especially since some already apply to btrees
etc..
(And noting that Hash indexes already assume strict hash operators, so
that is not an issue).
For me, this separates into two sub-concerns:
1.1 Allow multi-columns to be defined for hash indexes
Enabling this a simple one line patch
amroutine->amcanmulticol = true;
which works just fine on current HEAD without further changes (manual
testing, as yet).
If we do this first, then any work on uniqueness checking can take
into account multiple columns.

1.2 Combining multi-column hashes into one hash value
Trivially, this is already how it works, in the sense that we just use
the first column, however many columns there are in the index! Doing
more is an already solved problem in Postgres,
[TupleHashTableHash_internal() in src/backend/executor/execGrouping.c]
as pointed out here: "Combining hash values"
/messages/by-id/CAEepm=3rdgjfxW4cKvJ0OEmya2-34B0qHNG1xV0vK7TGPJGMUQ@mail.gmail.com
though noting there was no discussion on that point [1]. This just
needs a little refactoring to improve things, but it seems more like a
nice to have than an essential aspect of hash indexes that need not
block us from enabling multi-column hash indexes.

2. Unique Hash Indexes have been summarized here:
/messages/by-id/CAA4eK1KATC1TA5bR5eobYQVO3RWsnH6djNpk3P376em4V8MuUA@mail.gmail.com
which also seems to have two parts to it.

2.1 Uniqueness Check
Amit: "to ensure that there is no duplicate entry we need to traverse
the whole bucket chain"
Agreed. That seems straightforward and can also be improved later.

2.2 Locking
Amit's idea of holding ExclusiveLock on the bucket page works for me,
but there was some doubt about splitting.

Splitting seems to be an awful behavior that users would wish to avoid
if they knew about the impact and duration. In my understanding,
splitting involves moving 50% of rows and likely touches all blocks
eventually. If the existing index is the wrong shape then just run
REINDEX. If we tune the index build, looks like REINDEX would be
quicker and easier way of growing an index than trying to split an
existing index. i.e. rely on ecdysis not internal growth. This is much
more viable now because of the CIC changes in PG14.

(I would argue that removing splitting completely is a good idea,
similar to the way we have removed the original VACUUM FULL algorithm,
but that will take a while to swallow that thought). Instead, I
suggest we introduce a new indexam option for hash indexes of
autosplit=on (default) | off, so that users can turn splitting off.
Which means we would probably also need another option for
initial_buckets=0 (default) means use number of existing rows to size,
or N=use that specific size. Note that turning off splitting does not
limit the size of the index, it just stops the index from re-freshing
its number of buckets. B-trees are the default for PKs, so Hash
indexes are an option for larger tables only, so there is less need to
have hash indexes cope with tables of unknown size - we wouldn't even
be using hash unless we already know it is a big table.

If splitting causes any difficulty at all, then we should simply say
that Unique Hash Index indexes should initially force autosplit=off,
so we don't have to worry about the correctness of the locking. I
suggest we implement that first and then decide if we really care
about splitting, cos I'd bet we don't. Yes, I consider uniqueness much
more desirable than splitting.

I've written a patch that refactors index build so we *never* need to
perform a split during index build, allowing us to more credibly skip
index splitting completely. (Incidentally, it avoids the need to
update the metapage for each row during the build, allowing us to
consider writing in batches to the index as a next step). So there
need be no *requirement* for splitting to be supported with
uniqueness, while build/reindex looks like it can be much faster. I
can post it if anyone wants to see it, but I don't want to distract us
from discussion of the main requirements.

I have other performance tuning ideas, but they can wait.

Anyway, that's what I think at present. Thoughts?

--
Simon Riggs http://www.EnterpriseDB.com/

Amit Kapila

amit.kapila16@gmail.com

almost 5 years ago

In reply to: Simon Riggs (#1)

Re: Next Steps with Hash Indexes

On Thu, Jul 15, 2021 at 10:11 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

2. Unique Hash Indexes have been summarized here:
/messages/by-id/CAA4eK1KATC1TA5bR5eobYQVO3RWsnH6djNpk3P376em4V8MuUA@mail.gmail.com
which also seems to have two parts to it.

2.1 Uniqueness Check
Amit: "to ensure that there is no duplicate entry we need to traverse
the whole bucket chain"
Agreed. That seems straightforward and can also be improved later.

2.2 Locking
Amit's idea of holding ExclusiveLock on the bucket page works for me,
but there was some doubt about splitting.

I think the main thing to think about for uniqueness check during
split (where we scan both the old and new buckets) was whether we need
to lock both the old (bucket_being_split) and new
(bucket_being_populated) buckets or just holding locks on one of them
(the current bucket in which we are inserting) is sufficient? During a
scan of the new bucket, we just retain pins on both the buckets (see
comments in _hash_first()) but if we need to retain locks on both
buckets then we need to do something different then we do it for
scans. But, I think it is sufficient to just hold an exclusive lock on
the primary bucket page in the bucket we are trying to insert and pin
on the other bucket (old bucket as we do for scans). Because no
concurrent inserter should try to insert into the old bucket and new
bucket the same tuple as before starting the split we always update
the metapage for hashm_lowmask and hashm_highmask which decides the
routing of the tuples.

Now, I think here the other problem we need to think about is that for
the hash index after finding the tuple in the index, we need to always
recheck in the heap as we don't store the actual value in the hash
index. For that in the scan, we get the tuple(s) from the index
(release locks) and then match qual after fetching tuple from the
heap. But we can't do that for uniqueness check because if we release
the locks on the index bucket page then another inserter could come
before we match it in heap. I think we need some mechanism that after
fetching TID from the index, we recheck the actual value in heap
before releasing the lock on the index bucket page.

The other thing could be that if we have unique support for hash index
then probably we can allow Insert ... ON Conflict if the user
specifies unique index column as conflict_target.

I am not sure if multicol index support is mandatory to allow
uniqueness for hash indexes, sure it would be good but I feel that can
be done as a separate patch as well.

--
With Regards,
Amit Kapila.

Amit Kapila

amit.kapila16@gmail.com

almost 5 years ago

In reply to: Amit Kapila (#2)

Re: Next Steps with Hash Indexes

On Tue, Jul 20, 2021 at 5:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 15, 2021 at 10:11 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

2. Unique Hash Indexes have been summarized here:
/messages/by-id/CAA4eK1KATC1TA5bR5eobYQVO3RWsnH6djNpk3P376em4V8MuUA@mail.gmail.com
which also seems to have two parts to it.

2.1 Uniqueness Check
Amit: "to ensure that there is no duplicate entry we need to traverse
the whole bucket chain"
Agreed. That seems straightforward and can also be improved later.

2.2 Locking
Amit's idea of holding ExclusiveLock on the bucket page works for me,
but there was some doubt about splitting.

I think the main thing to think about for uniqueness check during
split (where we scan both the old and new buckets) was whether we need
to lock both the old (bucket_being_split) and new
(bucket_being_populated) buckets or just holding locks on one of them
(the current bucket in which we are inserting) is sufficient? During a
scan of the new bucket, we just retain pins on both the buckets (see
comments in _hash_first()) but if we need to retain locks on both
buckets then we need to do something different then we do it for
scans. But, I think it is sufficient to just hold an exclusive lock on
the primary bucket page in the bucket we are trying to insert and pin
on the other bucket (old bucket as we do for scans). Because no
concurrent inserter should try to insert into the old bucket and new
bucket the same tuple as before starting the split we always update
the metapage for hashm_lowmask and hashm_highmask which decides the
routing of the tuples.

Now, I think here the other problem we need to think about is that for
the hash index after finding the tuple in the index, we need to always
recheck in the heap as we don't store the actual value in the hash
index. For that in the scan, we get the tuple(s) from the index
(release locks) and then match qual after fetching tuple from the
heap. But we can't do that for uniqueness check because if we release
the locks on the index bucket page then another inserter could come
before we match it in heap. I think we need some mechanism that after
fetching TID from the index, we recheck the actual value in heap
before releasing the lock on the index bucket page.

One more thing we need to think about here is when to find the right
bucket page in the chain where we can insert the new tuple. Do we
first try to complete the uniqueness check (which needs to scan
through the entire bucket chain) and then again scan the bucket with
space to insert or do we want to do it along with uniqueness check
scan and remember it?

--
With Regards,
Amit Kapila.

Simon Riggs

simon@2ndQuadrant.com

almost 5 years ago

In reply to: Amit Kapila (#2)

Re: Next Steps with Hash Indexes

On Tue, Jul 20, 2021 at 1:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 15, 2021 at 10:11 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

2. Unique Hash Indexes have been summarized here:
/messages/by-id/CAA4eK1KATC1TA5bR5eobYQVO3RWsnH6djNpk3P376em4V8MuUA@mail.gmail.com
which also seems to have two parts to it.

2.1 Uniqueness Check
Amit: "to ensure that there is no duplicate entry we need to traverse
the whole bucket chain"
Agreed. That seems straightforward and can also be improved later.

2.2 Locking
Amit's idea of holding ExclusiveLock on the bucket page works for me,
but there was some doubt about splitting.

I think the main thing to think about for uniqueness check during
split (where we scan both the old and new buckets) was whether we need
to lock both the old (bucket_being_split) and new
(bucket_being_populated) buckets or just holding locks on one of them
(the current bucket in which we are inserting) is sufficient? During a
scan of the new bucket, we just retain pins on both the buckets (see
comments in _hash_first()) but if we need to retain locks on both
buckets then we need to do something different then we do it for
scans. But, I think it is sufficient to just hold an exclusive lock on
the primary bucket page in the bucket we are trying to insert and pin
on the other bucket (old bucket as we do for scans). Because no
concurrent inserter should try to insert into the old bucket and new
bucket the same tuple as before starting the split we always update
the metapage for hashm_lowmask and hashm_highmask which decides the
routing of the tuples.

During an incomplete split, we need to scan both old and new. So
during insert, we need to scan both old and new, while holding
exclusive locks on both bucket pages. I've spent a few days looking at
the split behavior and this seems a complete approach. I'm working on
a patch now, still at hacking stage.

(BTW, my opinion of the split mechanism has now changed from bad to
very good. It works really well for unique data, but can be completely
ineffective for badly skewed data).

Now, I think here the other problem we need to think about is that for
the hash index after finding the tuple in the index, we need to always
recheck in the heap as we don't store the actual value in the hash
index. For that in the scan, we get the tuple(s) from the index
(release locks) and then match qual after fetching tuple from the
heap. But we can't do that for uniqueness check because if we release
the locks on the index bucket page then another inserter could come
before we match it in heap. I think we need some mechanism that after
fetching TID from the index, we recheck the actual value in heap
before releasing the lock on the index bucket page.

I don't think btree does that, so I'm not sure we do need that for
hash. Yes, there is a race condition, but someone will win. Do we care
who? Do we care enough to take the concurrency hit? Duplicate inserts
would be very rare in a declared unique index, so it would be a poor
trade-off.

The other thing could be that if we have unique support for hash index
then probably we can allow Insert ... ON Conflict if the user
specifies unique index column as conflict_target.

Yes, that looks doable.

I am not sure if multicol index support is mandatory to allow
uniqueness for hash indexes, sure it would be good but I feel that can
be done as a separate patch as well.

I have a patch for multicol support, attached.

--
Simon Riggs http://www.EnterpriseDB.com/

Simon Riggs

simon@2ndQuadrant.com

almost 5 years ago

In reply to: Amit Kapila (#3)

Re: Next Steps with Hash Indexes

On Tue, Jul 20, 2021 at 1:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

One more thing we need to think about here is when to find the right
bucket page in the chain where we can insert the new tuple. Do we
first try to complete the uniqueness check (which needs to scan
through the entire bucket chain) and then again scan the bucket with
space to insert or do we want to do it along with uniqueness check
scan and remember it?

The latter approach, but that is just a performance tweak for later.

On a unique hash index, regular splitting means there are almost no
bucket chains more than 2 long (bucket plus overflow), so it seems
like mostly wasted effort at this point.

--
Simon Riggs http://www.EnterpriseDB.com/

Amit Kapila

amit.kapila16@gmail.com

almost 5 years ago

In reply to: Simon Riggs (#4)

Re: Next Steps with Hash Indexes

On Tue, Jul 20, 2021 at 6:32 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

On Tue, Jul 20, 2021 at 1:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 15, 2021 at 10:11 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

I think the main thing to think about for uniqueness check during
split (where we scan both the old and new buckets) was whether we need
to lock both the old (bucket_being_split) and new
(bucket_being_populated) buckets or just holding locks on one of them
(the current bucket in which we are inserting) is sufficient? During a
scan of the new bucket, we just retain pins on both the buckets (see
comments in _hash_first()) but if we need to retain locks on both
buckets then we need to do something different then we do it for
scans. But, I think it is sufficient to just hold an exclusive lock on
the primary bucket page in the bucket we are trying to insert and pin
on the other bucket (old bucket as we do for scans). Because no
concurrent inserter should try to insert into the old bucket and new
bucket the same tuple as before starting the split we always update
the metapage for hashm_lowmask and hashm_highmask which decides the
routing of the tuples.

During an incomplete split, we need to scan both old and new. So
during insert, we need to scan both old and new, while holding
exclusive locks on both bucket pages.

It will surely work if we have an exclusive lock on both the buckets
(old and new) in this case but I think it is better if we can avoid
exclusive locking the old bucket (bucket_being_split) unless it is
really required. We need an exclusive lock on the primary bucket where
we are trying to insert to avoid any other inserter with the same key
but I think we don't need it for the old bucket because no inserter
with the same key can try to insert the key in an old bucket which
would belong to the new bucket.

I've spent a few days looking at
the split behavior and this seems a complete approach. I'm working on
a patch now, still at hacking stage.

(BTW, my opinion of the split mechanism has now changed from bad to
very good. It works really well for unique data, but can be completely
ineffective for badly skewed data).

Now, I think here the other problem we need to think about is that for
the hash index after finding the tuple in the index, we need to always
recheck in the heap as we don't store the actual value in the hash
index. For that in the scan, we get the tuple(s) from the index
(release locks) and then match qual after fetching tuple from the
heap. But we can't do that for uniqueness check because if we release
the locks on the index bucket page then another inserter could come
before we match it in heap. I think we need some mechanism that after
fetching TID from the index, we recheck the actual value in heap
before releasing the lock on the index bucket page.

I don't think btree does that, so I'm not sure we do need that for
hash. Yes, there is a race condition, but someone will win. Do we care
who? Do we care enough to take the concurrency hit?

I think if we don't care we might end up with duplicates. I might be
missing something here but let me explain the problem I see. Say,
while doing a unique check we found the same hash value in the bucket
we are trying to insert, we can't say unique key violation at this
stage and error out without checking the actual value in heap. This is
because there is always a chance that two different key values can map
to the same hash value. Now, after checking in the heap if we found
that the actual value doesn't match so we decide to insert the value
in the hash index, and in the meantime, another insert of the same key
value already performed these checks and ends up inserting the value
in hash index and that would lead to a duplicate value in the hash
index. I think btree doesn't have a similar risk so we don't need such
a mechanism for btree.

--
With Regards,
Amit Kapila.

Simon Riggs

simon@2ndQuadrant.com

almost 5 years ago

In reply to: Amit Kapila (#6)

Re: Next Steps with Hash Indexes

On Thu, 22 Jul 2021 at 06:10, Amit Kapila <amit.kapila16@gmail.com> wrote:

It will surely work if we have an exclusive lock on both the buckets
(old and new) in this case but I think it is better if we can avoid
exclusive locking the old bucket (bucket_being_split) unless it is
really required. We need an exclusive lock on the primary bucket where
we are trying to insert to avoid any other inserter with the same key
but I think we don't need it for the old bucket because no inserter
with the same key can try to insert the key in an old bucket which
would belong to the new bucket.

Agreed.

I don't think btree does that, so I'm not sure we do need that for
hash. Yes, there is a race condition, but someone will win. Do we care
who? Do we care enough to take the concurrency hit?

I think if we don't care we might end up with duplicates. I might be
missing something here but let me explain the problem I see. Say,
while doing a unique check we found the same hash value in the bucket
we are trying to insert, we can't say unique key violation at this
stage and error out without checking the actual value in heap. This is
because there is always a chance that two different key values can map
to the same hash value. Now, after checking in the heap if we found
that the actual value doesn't match so we decide to insert the value
in the hash index, and in the meantime, another insert of the same key
value already performed these checks and ends up inserting the value
in hash index and that would lead to a duplicate value in the hash
index. I think btree doesn't have a similar risk so we don't need such
a mechanism for btree.

Agreed, after thinking about it more while coding.

All of the above implemented in the patches below:

Complete patch for hash_multicol.v3.patch attached, slightly updated
from earlier patch.
Docs, tests, passes make check.

WIP for hash_unique.v4.patch attached, patch-on-patch, to allow us to
discuss flow of logic and shape of code.
The actual comparison is not implemented yet. Not trivial, but can
wait until we decide main logic.
Passes make check and executes attached tests.

Tests in separate file also attached, will eventually be merged into
src/test/regress/sql/hash_index.sql

No tests yet for splitting or deferred uniqueness checks. The latter
is because there are parse analysis changes needed to undo the
assumption that only btrees support uniqueness, but nothing there
looks like an issue.

Thanks for your input, have a good weekend.

--
Simon Riggs http://www.EnterpriseDB.com/

Dilip Kumar

dilipbalaut@gmail.com

almost 5 years ago

In reply to: Simon Riggs (#7)

Re: Next Steps with Hash Indexes

On Fri, Jul 23, 2021 at 6:16 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

On Thu, 22 Jul 2021 at 06:10, Amit Kapila <amit.kapila16@gmail.com> wrote:

Complete patch for hash_multicol.v3.patch attached, slightly updated
from earlier patch.
Docs, tests, passes make check.

I was looking into the hash_multicoul.v3.patch, I have a question

  <para>
-  Hash indexes support only single-column indexes and do not allow
-  uniqueness checking.
+  Hash indexes support uniqueness checking.
+  Hash indexes support multi-column indexes, but only store the hash value
+  for the first column, so multiple columns are useful only for uniquness
+  checking.
  </para>

The above comments say that we store hash value only for the first
column, my question is why don't we store for other columns as well?
I mean we can search the bucket based on the first column hash but the
hashes for the other column could be payload data and we can use that
to match the hash value for other key columns before accessing the
heap, as discussed here[1]/messages/by-id/7192.1506527843@sss.pgh.pa.us. IMHO, this will further reduce the heap
access.

[1]: /messages/by-id/7192.1506527843@sss.pgh.pa.us

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Simon Riggs (#1)

Re: Next Steps with Hash Indexes

On Thu, Jul 15, 2021 at 9:41 AM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

It would be very desirable to allow Hash Indexes to become Primary Key
Indexes, which requires both
amroutine->amcanunique = true;
amroutine->amcanmulticol = true;

Why do you say that? I don't think it's self-evident that it's desirable.

In general I don't think that hash indexes are all that compelling
compared to B-Trees. In practice the flexibility of B-Trees tends to
win out, even if B-Trees are slightly slower than hash indexes with
certain kinds of benchmarks that are heavy on point lookups and have
no locality.

I have no reason to object to any of this, and I don't object. I'm just asking.

--
Peter Geoghegan

#10

Amit Kapila

amit.kapila16@gmail.com

almost 5 years ago

In reply to: Dilip Kumar (#8)

Re: Next Steps with Hash Indexes

On Tue, Aug 10, 2021 at 6:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Jul 23, 2021 at 6:16 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

On Thu, 22 Jul 2021 at 06:10, Amit Kapila <amit.kapila16@gmail.com> wrote:

Complete patch for hash_multicol.v3.patch attached, slightly updated
from earlier patch.
Docs, tests, passes make check.

I was looking into the hash_multicoul.v3.patch, I have a question
<para>
-  Hash indexes support only single-column indexes and do not allow
-  uniqueness checking.
+  Hash indexes support uniqueness checking.
+  Hash indexes support multi-column indexes, but only store the hash value
+  for the first column, so multiple columns are useful only for uniquness
+  checking.
</para>
The above comments say that we store hash value only for the first
column, my question is why don't we store for other columns as well?
I mean we can search the bucket based on the first column hash but the
hashes for the other column could be payload data and we can use that
to match the hash value for other key columns before accessing the
heap, as discussed here[1]. IMHO, this will further reduce the heap
access.

True, the other idea could be that in the payload we store the value
after 'combining multi-column hashes into one hash value'. This will
allow us to satisfy queries where the search is on all columns of the
index efficiently provided the planner doesn't remove some of them in
which case we need to do more work.

One more thing which we need to consider is 'hashm_procid' stored in
meta page, currently, it works for the single-column index but for the
multi-column index, we might want to set it as InvalidOid.

--
With Regards,
Amit Kapila.

#11

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Dilip Kumar (#8)

Re: Next Steps with Hash Indexes

On Tue, Aug 10, 2021 at 8:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I was looking into the hash_multicoul.v3.patch, I have a question
<para>
-  Hash indexes support only single-column indexes and do not allow
-  uniqueness checking.
+  Hash indexes support uniqueness checking.
+  Hash indexes support multi-column indexes, but only store the hash value
+  for the first column, so multiple columns are useful only for uniquness
+  checking.
</para>
The above comments say that we store hash value only for the first
column, my question is why don't we store for other columns as well?

I suspect it would be hard to store multiple hash values, one per
column. It seems to me that what we ought to do is combine the hash
values for the individual columns using hash_combine(64) and store the
combined value. I can't really imagine why we would NOT do that. It
seems like it would be easy to do and make the behavior of the feature
way less surprising.

--
Robert Haas
EDB: http://www.enterprisedb.com

#12

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Robert Haas (#11)

Re: Next Steps with Hash Indexes

Robert Haas <robertmhaas@gmail.com> writes:

I suspect it would be hard to store multiple hash values, one per
column. It seems to me that what we ought to do is combine the hash
values for the individual columns using hash_combine(64) and store the
combined value. I can't really imagine why we would NOT do that.

That would make it impossible to use the index except with queries
that provide equality conditions on all the index columns. Maybe
that's fine, but it seems less flexible than other possible definitions.
It really makes me wonder why anyone would bother with a multicol
hash index.

regards, tom lane

#13

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Tom Lane (#12)

Re: Next Steps with Hash Indexes

On Wed, Aug 11, 2021 at 10:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I suspect it would be hard to store multiple hash values, one per
column. It seems to me that what we ought to do is combine the hash
values for the individual columns using hash_combine(64) and store the
combined value. I can't really imagine why we would NOT do that.

That would make it impossible to use the index except with queries
that provide equality conditions on all the index columns. Maybe
that's fine, but it seems less flexible than other possible definitions.
It really makes me wonder why anyone would bother with a multicol
hash index.

Hmm. That is a point I hadn't considered.

I have to admit that after working with Amit on all the work to make
hash indexes WAL-logged a few years ago, I was somewhat disillusioned
with the whole AM. It seems like a cool idea to me but it's just not
that well-implemented. For example, the strategy of just doubling the
number of buckets in one shot seems pretty terrible for large indexes,
and ea69a0dead5128c421140dc53fac165ba4af8520 will buy only a limited
amount of relief. Likewise, the fact that keys are stored in hash
value order within pages but that the bucket as a whole is not kept in
order seems like it's bad for search performance and really bad for
implementing unique indexes with reasonable amounts of locking. (I
don't know how the present patch tries to solve that problem.) It's
tempting to think that we should think about creating something
altogether new instead of hacking on the existing implementation, but
that's a lot of work and I'm not sure what specific design would be
best.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Robert Haas (#13)

Re: Next Steps with Hash Indexes

Robert Haas <robertmhaas@gmail.com> writes:

I have to admit that after working with Amit on all the work to make
hash indexes WAL-logged a few years ago, I was somewhat disillusioned
with the whole AM. It seems like a cool idea to me but it's just not
that well-implemented.

Yeah, agreed. The whole buckets-are-integral-numbers-of-pages scheme
is pretty well designed to ensure bloat, but trying to ameliorate that
by reducing the number of buckets creates its own problems (since, as
you mention, we have no scheme whatever for searching within a bucket).
I'm quite unimpressed with Simon's upthread proposal to turn off bucket
splitting without doing anything about the latter issue.

I feel like we'd be best off to burn the AM to the ground and start
over. I do not know what a better design would look like exactly,
but I feel like it's got to decouple buckets from pages somehow.
Along the way, I'd want to store 64-bit hash values (we still haven't
done that have we?).

As far as the specific point at hand is concerned, I think storing
a hash value per index column, while using only the first column's
hash for bucket selection, is what to do for multicol indexes.
We still couldn't set amoptionalkey=true for hash indexes, because
without a hash for the first column we don't know which bucket to
look in. But storing hashes for the additional columns would allow
us to check additional conditions in the index, and usually save
trips to the heap on queries that provide additional column
conditions. You could also imagine sorting the contents of a bucket
on all the hashes, which would ease uniqueness checks.

regards, tom lane

#15

John Naylor

john.naylor@enterprisedb.com

almost 5 years ago

In reply to: Robert Haas (#13)

Re: Next Steps with Hash Indexes

On Wed, Aug 11, 2021 at 10:54 AM Robert Haas <robertmhaas@gmail.com> wrote:

don't know how the present patch tries to solve that problem.) It's
tempting to think that we should think about creating something
altogether new instead of hacking on the existing implementation, but
that's a lot of work and I'm not sure what specific design would be
best.

(Standard disclaimer that I'm not qualified to design index AMs) I've seen
one mention in the literature about the possibility of simply having a
btree index over the hash values. That would require faster search within
pages, in particular using abbreviated keys in the ItemId array of internal
pages [1]https://wiki.postgresql.org/wiki/Key_normalization#Optimizations_enabled_by_key_normalization and interpolated search rather than pure binary search (which
should work reliably with high-entropy keys like hash values), but doing
that would speed up all btree indexes, so that much is worth doing
regardless of how hash indexes are implemented. In that scheme, the hash
index AM would just be around for backward compatibility.

[1]: https://wiki.postgresql.org/wiki/Key_normalization#Optimizations_enabled_by_key_normalization
https://wiki.postgresql.org/wiki/Key_normalization#Optimizations_enabled_by_key_normalization

--
John Naylor
EDB: http://www.enterprisedb.com

#16

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: John Naylor (#15)

Re: Next Steps with Hash Indexes

John Naylor <john.naylor@enterprisedb.com> writes:

(Standard disclaimer that I'm not qualified to design index AMs) I've seen
one mention in the literature about the possibility of simply having a
btree index over the hash values.

Yeah, that's been talked about in the past --- we considered it
moderately seriously back when the hash AM was really only a toy
for lack of WAL support. The main knock on it is that searching
a btree is necessarily O(log N), while in principle a hash probe
could be O(1). Of course, a badly-implemented hash AM could be
worse than O(1), but we'd basically be giving up on ever getting
to O(1).

There's a separate discussion to be had about whether there should
be an easier way for users to build indexes that are btrees of
hashes. You can do it today but the indexes aren't pleasant to
use, requiring query adjustment.

regards, tom lane

#17

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Tom Lane (#14)

Re: Next Steps with Hash Indexes

On Wed, Aug 11, 2021 at 11:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah, agreed. The whole buckets-are-integral-numbers-of-pages scheme
is pretty well designed to ensure bloat, but trying to ameliorate that
by reducing the number of buckets creates its own problems (since, as
you mention, we have no scheme whatever for searching within a bucket).
I'm quite unimpressed with Simon's upthread proposal to turn off bucket
splitting without doing anything about the latter issue.

Maybe. I don't think that it should be a huge problem to decide that
an occupied bucket has to consume an entire page; if that's a big
issue, you should just have fewer buckets. I do think it's a problem
that a bucket containing no tuples at all still consumes an entire
page, because the data is often going to be skewed so that many
buckets are entirely empty. I also think it's a problem that expanding
the directory from 2^{N} buckets to 2^{N+1} buckets requires 4
allocations of 2^{N-2} *consecutive* pages. That's a lot of
consecutive pages for even fairly modest values of N.

Imagine a design where we have a single-page directory with 1024
slots, each corresponding to one bucket. Each slot stores a block
number, which might be InvalidBlockNumber if there are no tuples in
that bucket. Given a tuple with hash value H, check slot H%1024 and
then go to that page to look further. If there are more tuples in that
bucket than can fit on the page, then it can link to another page. If
we assume for the sake of argument that 1024 is the right number of
buckets, this is going to use about as much space as the current
system when the data distribution is uniform, but considerably less
when it's skewed. The larger you make the number of buckets, the
better this kind of thing looks on skewed data.

Now, you can't just always have 1024 buckets, so we'd actually have to
do something a bit more clever, probably involving multiple levels of
directories. For example, suppose a directory page contains only 32
slots. That will leave a lot of empty space in the page, which can be
used to store tuples. An index search has to scan all tuples that are
stored directly in the page, and also use the first 5 bits of the hash
key to search the appropriate bucket. But the bucket is itself a
directory: it can have some tuples stored directly in the page, and
then it has 32 more slots and you use the next 5 bits of the hash key
to decide which one of those to search. Then it becomes possible to
incrementally expand the hash index: when the space available in a
directory page fills up, you can either create a sub-directory and
move as many tuples as you can into that page, or add an overflow page
that contains only tuples.

It's important to be able to do either one, because sometimes a bucket
fills up with tuples that have identical hash values, and sometimes a
bucket fills up with tuples that have a variety of hash values. The
current implementation tends to massively increase the number of
buckets even when it does very little to spread the index entries out.
("Hey, I doubled the number of buckets and the keys are still almost
all in one bucket ... let me double the number of buckets again and
see if it works better this time!") If we were going to create a
replacement, we'd want the index to respond differently to a bunch of
dups vs. a bunch of non-dups.

As far as the specific point at hand is concerned, I think storing
a hash value per index column, while using only the first column's
hash for bucket selection, is what to do for multicol indexes.
We still couldn't set amoptionalkey=true for hash indexes, because
without a hash for the first column we don't know which bucket to
look in. But storing hashes for the additional columns would allow
us to check additional conditions in the index, and usually save
trips to the heap on queries that provide additional column
conditions. You could also imagine sorting the contents of a bucket
on all the hashes, which would ease uniqueness checks.

That sounds reasonable I guess, but I don't know how hard it is to implement.

--
Robert Haas
EDB: http://www.enterprisedb.com

#18

Dilip Kumar

dilipbalaut@gmail.com

almost 5 years ago

In reply to: Tom Lane (#14)

Re: Next Steps with Hash Indexes

On Wed, Aug 11, 2021 at 8:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

As far as the specific point at hand is concerned, I think storing
a hash value per index column, while using only the first column's
hash for bucket selection, is what to do for multicol indexes.
We still couldn't set amoptionalkey=true for hash indexes, because
without a hash for the first column we don't know which bucket to
look in. But storing hashes for the additional columns would allow
us to check additional conditions in the index, and usually save
trips to the heap on queries that provide additional column
conditions. You could also imagine sorting the contents of a bucket
on all the hashes, which would ease uniqueness checks.

Earlier, I was thinking that we have two hashes, one for the first key
column that is for identifying the bucket, and one for the remaining
key columns which will further help with heap lookup and ordering for
uniqueness checking. But yeah if we have a hash value for each column
then it will make it really flexible.

I was also looking into other databases that how they support hash
indexes, then I see at least in MySQL[1]https://dev.mysql.com/doc/refman/8.0/en/index-btree-hash.html the multiple column index has
a limitation that you have to give all key columns in search for
selecting the index scan. IMHO, that limitation might be there
because they are storing just one hash value based on all key columns
and also selecting the bucket based on the same hash value.

[1]: https://dev.mysql.com/doc/refman/8.0/en/index-btree-hash.html

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#19

Amit Kapila

amit.kapila16@gmail.com

almost 5 years ago

In reply to: Dilip Kumar (#18)

Re: Next Steps with Hash Indexes

On Thu, Aug 12, 2021 at 9:09 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Aug 11, 2021 at 8:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

As far as the specific point at hand is concerned, I think storing
a hash value per index column, while using only the first column's
hash for bucket selection, is what to do for multicol indexes.
We still couldn't set amoptionalkey=true for hash indexes, because
without a hash for the first column we don't know which bucket to
look in. But storing hashes for the additional columns would allow
us to check additional conditions in the index, and usually save
trips to the heap on queries that provide additional column
conditions.

Yeah, this sounds reasonable but I think the alternative proposal by
Dilip (see below) and me [1]/messages/by-id/CAA4eK1JD1=nPDi0kDPGLC+JDGEYP8DgTanobvgve++KniQ68TA@mail.gmail.com also has merits.

You could also imagine sorting the contents of a bucket
on all the hashes, which would ease uniqueness checks.

Yeah, we can do that but the current design also seems to have merits
for uniqueness checks. For sorting all the hashes in the bucket, we
need to read all the overflow pages and then do sort, which could lead
to additional I/O in some cases. The other possibility is to keep all
the bucket pages sorted during insertion but that would also require
additional I/O. OTOH, in the current design, if the value is not found
in the current bucket page (which has hash values in sorted order),
only then we move to the next page.

Earlier, I was thinking that we have two hashes, one for the first key
column that is for identifying the bucket, and one for the remaining
key columns which will further help with heap lookup and ordering for
uniqueness checking.

I have also mentioned an almost similar idea yesterday [1]/messages/by-id/CAA4eK1JD1=nPDi0kDPGLC+JDGEYP8DgTanobvgve++KniQ68TA@mail.gmail.com. If we go
with a specification similar to MySQL and SQLServer then probably it
would be better than storing the hashes for all the columns.

But yeah if we have a hash value for each column

then it will make it really flexible.

I was also looking into other databases that how they support hash
indexes, then I see at least in MySQL[1] the multiple column index has
a limitation that you have to give all key columns in search for
selecting the index scan.

I see that SQLServer also has the same specification for multi-column
hash index [2]https://docs.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/indexes-for-memory-optimized-tables?view=sql-server-ver15. See the "Multi-column index" section. So it might not
be a bad idea to have a similar specification.

[1]: /messages/by-id/CAA4eK1JD1=nPDi0kDPGLC+JDGEYP8DgTanobvgve++KniQ68TA@mail.gmail.com
[2]: https://docs.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/indexes-for-memory-optimized-tables?view=sql-server-ver15

--
With Regards,
Amit Kapila.

#20

Amit Kapila

amit.kapila16@gmail.com

almost 5 years ago

In reply to: Tom Lane (#14)

Re: Next Steps with Hash Indexes

On Wed, Aug 11, 2021 at 8:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I have to admit that after working with Amit on all the work to make
hash indexes WAL-logged a few years ago, I was somewhat disillusioned
with the whole AM. It seems like a cool idea to me but it's just not
that well-implemented.

Yeah, agreed. The whole buckets-are-integral-numbers-of-pages scheme
is pretty well designed to ensure bloat, but trying to ameliorate that
by reducing the number of buckets creates its own problems (since, as
you mention, we have no scheme whatever for searching within a bucket).
I'm quite unimpressed with Simon's upthread proposal to turn off bucket
splitting without doing anything about the latter issue.

The design of the patch has changed since the initial proposal. It
tries to perform unique inserts by holding a write lock on the bucket
page to avoid duplicate inserts.

--
With Regards,
Amit Kapila.

#21

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Amit Kapila (#20)

#22

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: John Naylor (#15)

#23