Doc chapter for Hash Indexes

Started by Simon Riggsalmost 5 years ago13 messageshackers
Jump to latest
#1Simon Riggs
simon@2ndQuadrant.com

New chapter for Hash Indexes, designed to help users understand how
they work and when to use them.

Mostly newly written, but a few paras lifted from README when they were helpful.

--
Simon Riggs http://www.EnterpriseDB.com/

Attachments:

doc_hash_index.v1.patchapplication/octet-stream; name=doc_hash_index.v1.patchDownload+163-0
#2Justin Pryzby
pryzby@telsasoft.com
In reply to: Simon Riggs (#1)
Re: Doc chapter for Hash Indexes

On Mon, Jun 21, 2021 at 02:08:12PM +0100, Simon Riggs wrote:

New chapter for Hash Indexes, designed to help users understand how
they work and when to use them.

Mostly newly written, but a few paras lifted from README when they were helpful.

+ <para>
+  PostgreSQL includes an implementation of persistent on-disk hash indexes,
+  which are now fully crash recoverable. Any data type can be indexed by a

I don't see any need to mention that they're "now" crash safe.

+  Each hash index tuple stores just the 4-byte hash value, not the actual
+  column value. As a result, hash indexes may be much smaller than B-trees
+  when indexing longer data items such as UUIDs, URLs etc.. The absence of

comma:
URLs, etc.

+  the column value also makes all hash index scans lossy. Hash indexes may
+  take part in bitmap index scans and backward scans.

Isn't it more correct to say that it must use a bitmap scan?

+  through the tree until the leaf page is found. In tables with millions
+  of rows this descent can increase access time to data. The equivalent

rows comma

+ that hash value. When scanning a hash bucket during queries we need to

queries comma

+ <para>
+  As a result of the overflow cases, we can say that hash indexes are
+  most suitable for unique, nearly unique data or data with a low number
+  of rows per hash bucket will be suitable for hash indexes. One

The beginning and end of the sentence duplicate "suitable".

+  Each row in the table indexed is represented by a single index tuple in
+  the hash index. Hash index tuples are stored in the bucket pages, and if
+  they exist, the overflow pages. 

"the overflow pages" didn't sound right, but I was confused by the comma.
I think it should say ".. in bucket pages and overflow pages, if any."

--
Justin

#3Amit Kapila
amit.kapila16@gmail.com
In reply to: Justin Pryzby (#2)
Re: Doc chapter for Hash Indexes

On Tue, Jun 22, 2021 at 4:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Mon, Jun 21, 2021 at 02:08:12PM +0100, Simon Riggs wrote:

New chapter for Hash Indexes, designed to help users understand how
they work and when to use them.

Mostly newly written, but a few paras lifted from README when they were helpful.

..

+  the column value also makes all hash index scans lossy. Hash indexes may
+  take part in bitmap index scans and backward scans.

Isn't it more correct to say that it must use a bitmap scan?

Why? Hash indexes do support regular index scan.

--
With Regards,
Amit Kapila.

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#1)
Re: Doc chapter for Hash Indexes

On Mon, Jun 21, 2021 at 6:38 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

New chapter for Hash Indexes, designed to help users understand how
they work and when to use them.

Mostly newly written, but a few paras lifted from README when they were helpful.

Few comments
==============
1.
+  Hash indexes are best optimized for SELECTs and UPDATEs using equality
+  scans on larger tables.

Is there a reason to mention Selects and Updates but not Deletes?

2.
+  Like B-Trees, hash indexes perform simple index tuple deletion. This
+  is a deferred maintenance operation that deletes index tuples that are
+  known to be safe to delete (those whose item identifier's LP_DEAD bit
+  is already set). This is performed speculatively upon each insert,
+  though may not succeed if the page is pinned by another backend.

It is not very clear to me when we perform the simple index tuple
deletion from the above sentence. We perform it when there is no space
to accommodate a new tuple on the bucket page and as a result, we
might need to create an overflow page. Basically, I am not sure
saying: "This is performed speculatively upon each insert .." is
helpful.

3.
+  incrementally expanded.  When a new bucket is to be added to the index,
+  exactly one existing bucket will need to be "split", with some of its
+  tuples being transferred to the new bucket according to the updated
+  key-to-bucket-number mapping.  This is essentially the same hash table

In most places, the patch has used a single space after the full stop
but at some places like above, it has used two spaces after full stop.
I think it is better to be consistent.

4.
 This is essentially the same hash table
+  management technique embodied in src/backend/utils/hash/dynahash.c for
+  in-memory hash tables used within PostgreSQL internals.

I am not sure if there is a need to mention this in the user-facing
doc. I think README is a better place for this.

--
With Regards,
Amit Kapila.

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Amit Kapila (#4)
Re: Doc chapter for Hash Indexes

On Tue, Jun 22, 2021 at 7:15 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 6:38 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

New chapter for Hash Indexes, designed to help users understand how
they work and when to use them.

Mostly newly written, but a few paras lifted from README when they were helpful.

Few comments
==============
1.
+  Hash indexes are best optimized for SELECTs and UPDATEs using equality
+  scans on larger tables.

Is there a reason to mention Selects and Updates but not Deletes?

Deletes decrease the number of rows, so must eventually be matched with inserts.
So deletes imply inserts.

Perhaps it should say "update-heavy"

2.
+  Like B-Trees, hash indexes perform simple index tuple deletion. This
+  is a deferred maintenance operation that deletes index tuples that are
+  known to be safe to delete (those whose item identifier's LP_DEAD bit
+  is already set). This is performed speculatively upon each insert,
+  though may not succeed if the page is pinned by another backend.

It is not very clear to me when we perform the simple index tuple
deletion from the above sentence. We perform it when there is no space
to accommodate a new tuple on the bucket page and as a result, we
might need to create an overflow page. Basically, I am not sure
saying: "This is performed speculatively upon each insert .." is
helpful.

OK, thanks, will reword.

3.
+  incrementally expanded.  When a new bucket is to be added to the index,
+  exactly one existing bucket will need to be "split", with some of its
+  tuples being transferred to the new bucket according to the updated
+  key-to-bucket-number mapping.  This is essentially the same hash table

In most places, the patch has used a single space after the full stop
but at some places like above, it has used two spaces after full stop.
I think it is better to be consistent.

OK

4.
This is essentially the same hash table
+  management technique embodied in src/backend/utils/hash/dynahash.c for
+  in-memory hash tables used within PostgreSQL internals.

I am not sure if there is a need to mention this in the user-facing
doc. I think README is a better place for this.

OK, will remove. Thanks

I've reworded most things from both Amit and Justin; thanks for your reviews.

I attach both clean and compare versions.

--
Simon Riggs http://www.EnterpriseDB.com/

Attachments:

doc_hash_index.v1-v2.diffapplication/octet-stream; name=doc_hash_index.v1-v2.diffDownload+22-20
doc_hash_index.v2.patchapplication/octet-stream; name=doc_hash_index.v2.patchDownload+165-0
#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#5)
Re: Doc chapter for Hash Indexes

On Tue, Jun 22, 2021 at 2:31 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

I attach both clean and compare versions.

Do we want to hold this work for PG15 or commit in PG14 and backpatch
it till v10 where we have made hash indexes crash-safe? I would vote
for committing in PG14 and backpatch it till v10, however, I am fine
if we want to commit just to PG14 or PG15.

--
With Regards,
Amit Kapila.

#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Amit Kapila (#6)
Re: Doc chapter for Hash Indexes

On Wed, Jun 23, 2021 at 5:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 22, 2021 at 2:31 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

I attach both clean and compare versions.

Do we want to hold this work for PG15 or commit in PG14 and backpatch
it till v10 where we have made hash indexes crash-safe? I would vote
for committing in PG14 and backpatch it till v10, however, I am fine
if we want to commit just to PG14 or PG15.

Backpatch makes sense to me, but since not everyone will be reading
this thread, I would look towards PG15 only first. We may yet pick up
additional corrections or additions before a backpatch, if that is
agreed.

--
Simon Riggs http://www.EnterpriseDB.com/

#8Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#7)
Re: Doc chapter for Hash Indexes

aOn Wed, Jun 23, 2021 at 12:56:51PM +0100, Simon Riggs wrote:

On Wed, Jun 23, 2021 at 5:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 22, 2021 at 2:31 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

I attach both clean and compare versions.

Do we want to hold this work for PG15 or commit in PG14 and backpatch
it till v10 where we have made hash indexes crash-safe? I would vote
for committing in PG14 and backpatch it till v10, however, I am fine
if we want to commit just to PG14 or PG15.

Backpatch makes sense to me, but since not everyone will be reading
this thread, I would look towards PG15 only first. We may yet pick up
additional corrections or additions before a backpatch, if that is
agreed.

Yeah, I think backpatching makes sense.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Bruce Momjian (#8)
Re: Doc chapter for Hash Indexes

On Fri, Jun 25, 2021 at 1:29 AM Bruce Momjian <bruce@momjian.us> wrote:

aOn Wed, Jun 23, 2021 at 12:56:51PM +0100, Simon Riggs wrote:

On Wed, Jun 23, 2021 at 5:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 22, 2021 at 2:31 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

I attach both clean and compare versions.

Do we want to hold this work for PG15 or commit in PG14 and backpatch
it till v10 where we have made hash indexes crash-safe? I would vote
for committing in PG14 and backpatch it till v10, however, I am fine
if we want to commit just to PG14 or PG15.

Backpatch makes sense to me, but since not everyone will be reading
this thread, I would look towards PG15 only first. We may yet pick up
additional corrections or additions before a backpatch, if that is
agreed.

Yeah, I think backpatching makes sense.

I checked and found that there are two commits (7c75ef5715 and
22c5e73562) in the hash index code in PG-11 which might have impacted
what we write in the documentation. However, AFAICS, nothing proposed
in the patch would change due to those commits. Even, if we don't want
to back patch, is there any harm in committing this to PG-14?

--
With Regards,
Amit Kapila.

#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Amit Kapila (#9)
Re: Doc chapter for Hash Indexes

On Fri, Jun 25, 2021 at 4:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 25, 2021 at 1:29 AM Bruce Momjian <bruce@momjian.us> wrote:

aOn Wed, Jun 23, 2021 at 12:56:51PM +0100, Simon Riggs wrote:

On Wed, Jun 23, 2021 at 5:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 22, 2021 at 2:31 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

I attach both clean and compare versions.

Do we want to hold this work for PG15 or commit in PG14 and backpatch
it till v10 where we have made hash indexes crash-safe? I would vote
for committing in PG14 and backpatch it till v10, however, I am fine
if we want to commit just to PG14 or PG15.

Backpatch makes sense to me, but since not everyone will be reading
this thread, I would look towards PG15 only first. We may yet pick up
additional corrections or additions before a backpatch, if that is
agreed.

Yeah, I think backpatching makes sense.

I checked and found that there are two commits (7c75ef5715 and
22c5e73562) in the hash index code in PG-11 which might have impacted
what we write in the documentation. However, AFAICS, nothing proposed
in the patch would change due to those commits. Even, if we don't want
to back patch, is there any harm in committing this to PG-14?

I've reviewed those commits and the related code, so I agree.

As a result, I've tweaked the wording around VACUUM slightly.

Clean and compare patches attached.

--
Simon Riggs http://www.EnterpriseDB.com/

Attachments:

doc_hash_index.v3.patchapplication/octet-stream; name=doc_hash_index.v3.patchDownload+167-0
doc_hash_index.v2-v3.patchapplication/octet-stream; name=doc_hash_index.v2-v3.patchDownload+5-3
#11Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#10)
Re: Doc chapter for Hash Indexes

On Fri, Jun 25, 2021 at 3:11 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

On Fri, Jun 25, 2021 at 4:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 25, 2021 at 1:29 AM Bruce Momjian <bruce@momjian.us> wrote:

aOn Wed, Jun 23, 2021 at 12:56:51PM +0100, Simon Riggs wrote:

On Wed, Jun 23, 2021 at 5:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 22, 2021 at 2:31 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

I attach both clean and compare versions.

Do we want to hold this work for PG15 or commit in PG14 and backpatch
it till v10 where we have made hash indexes crash-safe? I would vote
for committing in PG14 and backpatch it till v10, however, I am fine
if we want to commit just to PG14 or PG15.

Backpatch makes sense to me, but since not everyone will be reading
this thread, I would look towards PG15 only first. We may yet pick up
additional corrections or additions before a backpatch, if that is
agreed.

Yeah, I think backpatching makes sense.

I checked and found that there are two commits (7c75ef5715 and
22c5e73562) in the hash index code in PG-11 which might have impacted
what we write in the documentation. However, AFAICS, nothing proposed
in the patch would change due to those commits. Even, if we don't want
to back patch, is there any harm in committing this to PG-14?

I've reviewed those commits and the related code, so I agree.

Do you agree to just commit this to PG-14 or to commit in PG-14 and
backpatch till PG-10?

As a result, I've tweaked the wording around VACUUM slightly.

Thanks, the changes look good to me.

--
With Regards,
Amit Kapila.

#12Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#11)
Re: Doc chapter for Hash Indexes

On Sat, Jun 26, 2021 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 25, 2021 at 3:11 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

On Fri, Jun 25, 2021 at 4:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 25, 2021 at 1:29 AM Bruce Momjian <bruce@momjian.us> wrote:

aOn Wed, Jun 23, 2021 at 12:56:51PM +0100, Simon Riggs wrote:

On Wed, Jun 23, 2021 at 5:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 22, 2021 at 2:31 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

I attach both clean and compare versions.

Do we want to hold this work for PG15 or commit in PG14 and backpatch
it till v10 where we have made hash indexes crash-safe? I would vote
for committing in PG14 and backpatch it till v10, however, I am fine
if we want to commit just to PG14 or PG15.

Backpatch makes sense to me, but since not everyone will be reading
this thread, I would look towards PG15 only first. We may yet pick up
additional corrections or additions before a backpatch, if that is
agreed.

Yeah, I think backpatching makes sense.

I checked and found that there are two commits (7c75ef5715 and
22c5e73562) in the hash index code in PG-11 which might have impacted
what we write in the documentation. However, AFAICS, nothing proposed
in the patch would change due to those commits. Even, if we don't want
to back patch, is there any harm in committing this to PG-14?

I've reviewed those commits and the related code, so I agree.

Do you agree to just commit this to PG-14 or to commit in PG-14 and
backpatch till PG-10?

I am planning to go through the patch once again and would like to
commit and backpatch till v10 in a day to two unless someone thinks
otherwise.

--
With Regards,
Amit Kapila.

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#12)
Re: Doc chapter for Hash Indexes

On Tue, Jun 29, 2021 at 2:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Jun 26, 2021 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I am planning to go through the patch once again and would like to
commit and backpatch till v10 in a day to two unless someone thinks
otherwise.

Pushed.

--
With Regards,
Amit Kapila.