Doc chapter for Hash Indexes

Started by Simon Riggsover 4 years ago13 messages
#1Simon Riggs
simon.riggs@enterprisedb.com
1 attachment(s)

New chapter for Hash Indexes, designed to help users understand how
they work and when to use them.

Mostly newly written, but a few paras lifted from README when they were helpful.

--
Simon Riggs http://www.EnterpriseDB.com/

Attachments:

doc_hash_index.v1.patchapplication/octet-stream; name=doc_hash_index.v1.patchDownload
commit 487fbecb3411a9eb0559af856cc9779f242cdc3e
Author: Simon Riggs <simon.riggs@enterprisedb.com>
Date:   Mon Jun 21 13:57:27 2021 +0100

    New doc chapter for Hash Indexes

diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 45b701426b..459add557b 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -89,6 +89,7 @@
 <!ENTITY spgist     SYSTEM "spgist.sgml">
 <!ENTITY gin        SYSTEM "gin.sgml">
 <!ENTITY brin       SYSTEM "brin.sgml">
+<!ENTITY hash       SYSTEM "hash.sgml">
 <!ENTITY planstats    SYSTEM "planstats.sgml">
 <!ENTITY tableam    SYSTEM "tableam.sgml">
 <!ENTITY indexam    SYSTEM "indexam.sgml">
diff --git a/doc/src/sgml/hash.sgml b/doc/src/sgml/hash.sgml
new file mode 100644
index 0000000000..6830b384bf
--- /dev/null
+++ b/doc/src/sgml/hash.sgml
@@ -0,0 +1,161 @@
+<!-- doc/src/sgml/hash.sgml -->
+
+<chapter id="hash-index">
+<title>Hash Indexes</title>
+
+   <indexterm>
+    <primary>index</primary>
+    <secondary>Hash</secondary>
+   </indexterm>
+
+<sect1 id="hash-intro">
+ <title>Overview</title>
+
+ <para>
+  PostgreSQL includes an implementation of persistent on-disk hash indexes,
+  which are now fully crash recoverable. Any data type can be indexed by a
+  hash index, including data types that do not have a well-defined linear
+  ordering. Hash indexes store only the hash value of the data being
+  indexed, thus there are no restrictions on the size of the data column
+  being indexed.
+ </para>
+
+ <para>
+  Hash indexes support only single-column indexes and do not allow
+  uniqueness checking.
+ </para>
+
+ <para>
+  Hash indexes support only the = operator, so WHERE clauses that specify
+  range operations will not be able to take advantage of hash indexes.
+ </para>
+
+ <para>
+  Each hash index tuple stores just the 4-byte hash value, not the actual
+  column value. As a result, hash indexes may be much smaller than B-trees
+  when indexing longer data items such as UUIDs, URLs etc.. The absence of
+  the column value also makes all hash index scans lossy. Hash indexes may
+  take part in bitmap index scans and backward scans.
+ </para>
+
+ <para>
+  Hash indexes are best optimized for SELECTs and UPDATEs using equality
+  scans on larger tables. In a B-tree index, searches must descend
+  through the tree until the leaf page is found. In tables with millions
+  of rows this descent can increase access time to data. The equivalent
+  of a leaf page in a hash index is referred to as a bucket page. In
+  contrast, a hash index allows accessing the bucket pages directly,
+  thereby potentially reducing index access time in larger tables. This
+  reduction in "logical I/O" becomes even more pronounced on indexes/data
+  larger than shared_buffers/RAM. 
+ </para>
+
+ <para>
+  Hash indexes have been designed to cope with uneven distributions of
+  hash values. Direct access to the bucket pages works well if the hash
+  values are evenly distributed. When inserts mean that the bucket page
+  becomes full, additional overflow pages are chained to that specific
+  bucket page, locally expanding the storage for index tuples that match
+  that hash value. When scanning a hash bucket during queries we need to
+  scan through all of the overflow pages. Thus an unbalanced hash index
+  might actually be worse than a B-tree in terms of number of block
+  accesses required, for some data.
+ </para>
+
+ <para>
+  As a result of the overflow cases, we can say that hash indexes are
+  most suitable for unique, nearly unique data or data with a low number
+  of rows per hash bucket will be suitable for hash indexes. One
+  possible way to avoid problems is to exclude highly non-unique values
+  from the index using a partial index condition, but this may not be
+  suitable in many cases.
+ </para>
+
+ <para>
+  Like B-Trees, hash indexes perform simple index tuple deletion. This
+  is a deferred maintenance operation that deletes index tuples that are
+  known to be safe to delete (those whose item identifier's LP_DEAD bit
+  is already set). This is performed speculatively upon each insert,
+  though may not succeed if the page is pinned by another backend.
+  Deletion of dead index pointers also occurs during VACUUM.  If an
+  overflow page becomes empty, overflow pages can be recycled for reuse
+  in other buckets, though we never return them to the operating system.
+ </para>
+
+ <para>
+  There is currently no provision to shrink a hash index, other than by
+  rebuilding it with REINDEX.  
+  There is no provision for reducing the number of buckets, either.
+ </para>
+
+ <para>
+  Hash indexes may expand the number of bucket pages as the number of
+  rows indexed grows.
+  The hash key-to-bucket-number mapping is chosen so that the index can be
+  incrementally expanded.  When a new bucket is to be added to the index,
+  exactly one existing bucket will need to be "split", with some of its
+  tuples being transferred to the new bucket according to the updated
+  key-to-bucket-number mapping.  This is essentially the same hash table
+  management technique embodied in src/backend/utils/hash/dynahash.c for
+  in-memory hash tables used within PostgreSQL internals.
+ </para>
+
+ <para>
+  The expansion occurs in the foreground, which could increase execution
+  time for user inserts. Thus, hash indexes may not be suitable for tables
+  with rapidly increasing number of rows.
+ </para>
+
+</sect1>
+
+<sect1 id="hash-implementation">
+ <title>Implementation</title>
+
+ <para>
+  There are four kinds of pages in a hash index: the meta page (page zero),
+  which contains statically allocated control information; primary bucket
+  pages; overflow pages; and bitmap pages, which keep track of overflow
+  pages that have been freed and are available for re-use.  For addressing
+  purposes, bitmap pages are regarded as a subset of the overflow pages.
+ </para>
+
+ <para>
+  Both scanning the index and inserting tuples require locating the bucket
+  where a given tuple ought to be located.  To do this, we need the bucket
+  count, highmask, and lowmask from the metapage; however, it's undesirable
+  for performance reasons to have to have to lock and pin the metapage for
+  every such operation.  Instead, we retain a cached copy of the metapage
+  in each backend's relcache entry.  This will produce the correct
+  bucket mapping as long as the target bucket hasn't been split since the
+  last cache refresh.
+ </para>
+
+ <para>
+  Primary bucket pages and overflow pages are allocated independently since
+  any given index might need more or fewer overflow pages relative to its
+  number of buckets.  The hash code uses an interesting set of addressing
+  rules to support a variable number of overflow pages while not having to
+  move primary bucket pages around after they are created.
+ </para>
+
+ <para>
+  Each row in the table indexed is represented by a single index tuple in
+  the hash index. Hash index tuples are stored in the bucket pages, and if
+  they exist, the overflow pages. 
+  We speed up searches by keeping the index entries in any 
+  one index page sorted by hash code, thus allowing binary search to be used
+  within an index page.  Note however that there is *no* assumption about the
+  relative ordering of hash codes across different index pages of a bucket.
+ </para>
+
+ <para>
+  The bucket splitting algorithms to expand the hash index are too complex to
+  be worthy of mention here, though are described in more detail in
+  src/backend/access/hash/README.
+  The split algorithm is crash safe and can be restarted if not completed
+  successfully.
+ </para>
+
+</sect1>
+
+</chapter>
diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml
index d453be3909..dba9cf413f 100644
--- a/doc/src/sgml/postgres.sgml
+++ b/doc/src/sgml/postgres.sgml
@@ -266,6 +266,7 @@ break is not needed in a wider output rendering.
   &spgist;
   &gin;
   &brin;
+  &hash;
   &storage;
   &bki;
   &planstats;
#2Justin Pryzby
pryzby@telsasoft.com
In reply to: Simon Riggs (#1)
Re: Doc chapter for Hash Indexes

On Mon, Jun 21, 2021 at 02:08:12PM +0100, Simon Riggs wrote:

New chapter for Hash Indexes, designed to help users understand how
they work and when to use them.

Mostly newly written, but a few paras lifted from README when they were helpful.

+ <para>
+  PostgreSQL includes an implementation of persistent on-disk hash indexes,
+  which are now fully crash recoverable. Any data type can be indexed by a

I don't see any need to mention that they're "now" crash safe.

+  Each hash index tuple stores just the 4-byte hash value, not the actual
+  column value. As a result, hash indexes may be much smaller than B-trees
+  when indexing longer data items such as UUIDs, URLs etc.. The absence of

comma:
URLs, etc.

+  the column value also makes all hash index scans lossy. Hash indexes may
+  take part in bitmap index scans and backward scans.

Isn't it more correct to say that it must use a bitmap scan?

+  through the tree until the leaf page is found. In tables with millions
+  of rows this descent can increase access time to data. The equivalent

rows comma

+ that hash value. When scanning a hash bucket during queries we need to

queries comma

+ <para>
+  As a result of the overflow cases, we can say that hash indexes are
+  most suitable for unique, nearly unique data or data with a low number
+  of rows per hash bucket will be suitable for hash indexes. One

The beginning and end of the sentence duplicate "suitable".

+  Each row in the table indexed is represented by a single index tuple in
+  the hash index. Hash index tuples are stored in the bucket pages, and if
+  they exist, the overflow pages. 

"the overflow pages" didn't sound right, but I was confused by the comma.
I think it should say ".. in bucket pages and overflow pages, if any."

--
Justin

#3Amit Kapila
amit.kapila16@gmail.com
In reply to: Justin Pryzby (#2)
Re: Doc chapter for Hash Indexes

On Tue, Jun 22, 2021 at 4:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Mon, Jun 21, 2021 at 02:08:12PM +0100, Simon Riggs wrote:

New chapter for Hash Indexes, designed to help users understand how
they work and when to use them.

Mostly newly written, but a few paras lifted from README when they were helpful.

..

+  the column value also makes all hash index scans lossy. Hash indexes may
+  take part in bitmap index scans and backward scans.

Isn't it more correct to say that it must use a bitmap scan?

Why? Hash indexes do support regular index scan.

--
With Regards,
Amit Kapila.

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#1)
Re: Doc chapter for Hash Indexes

On Mon, Jun 21, 2021 at 6:38 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

New chapter for Hash Indexes, designed to help users understand how
they work and when to use them.

Mostly newly written, but a few paras lifted from README when they were helpful.

Few comments
==============
1.
+  Hash indexes are best optimized for SELECTs and UPDATEs using equality
+  scans on larger tables.

Is there a reason to mention Selects and Updates but not Deletes?

2.
+  Like B-Trees, hash indexes perform simple index tuple deletion. This
+  is a deferred maintenance operation that deletes index tuples that are
+  known to be safe to delete (those whose item identifier's LP_DEAD bit
+  is already set). This is performed speculatively upon each insert,
+  though may not succeed if the page is pinned by another backend.

It is not very clear to me when we perform the simple index tuple
deletion from the above sentence. We perform it when there is no space
to accommodate a new tuple on the bucket page and as a result, we
might need to create an overflow page. Basically, I am not sure
saying: "This is performed speculatively upon each insert .." is
helpful.

3.
+  incrementally expanded.  When a new bucket is to be added to the index,
+  exactly one existing bucket will need to be "split", with some of its
+  tuples being transferred to the new bucket according to the updated
+  key-to-bucket-number mapping.  This is essentially the same hash table

In most places, the patch has used a single space after the full stop
but at some places like above, it has used two spaces after full stop.
I think it is better to be consistent.

4.
 This is essentially the same hash table
+  management technique embodied in src/backend/utils/hash/dynahash.c for
+  in-memory hash tables used within PostgreSQL internals.

I am not sure if there is a need to mention this in the user-facing
doc. I think README is a better place for this.

--
With Regards,
Amit Kapila.

#5Simon Riggs
simon.riggs@enterprisedb.com
In reply to: Amit Kapila (#4)
2 attachment(s)
Re: Doc chapter for Hash Indexes

On Tue, Jun 22, 2021 at 7:15 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 6:38 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

New chapter for Hash Indexes, designed to help users understand how
they work and when to use them.

Mostly newly written, but a few paras lifted from README when they were helpful.

Few comments
==============
1.
+  Hash indexes are best optimized for SELECTs and UPDATEs using equality
+  scans on larger tables.

Is there a reason to mention Selects and Updates but not Deletes?

Deletes decrease the number of rows, so must eventually be matched with inserts.
So deletes imply inserts.

Perhaps it should say "update-heavy"

2.
+  Like B-Trees, hash indexes perform simple index tuple deletion. This
+  is a deferred maintenance operation that deletes index tuples that are
+  known to be safe to delete (those whose item identifier's LP_DEAD bit
+  is already set). This is performed speculatively upon each insert,
+  though may not succeed if the page is pinned by another backend.

It is not very clear to me when we perform the simple index tuple
deletion from the above sentence. We perform it when there is no space
to accommodate a new tuple on the bucket page and as a result, we
might need to create an overflow page. Basically, I am not sure
saying: "This is performed speculatively upon each insert .." is
helpful.

OK, thanks, will reword.

3.
+  incrementally expanded.  When a new bucket is to be added to the index,
+  exactly one existing bucket will need to be "split", with some of its
+  tuples being transferred to the new bucket according to the updated
+  key-to-bucket-number mapping.  This is essentially the same hash table

In most places, the patch has used a single space after the full stop
but at some places like above, it has used two spaces after full stop.
I think it is better to be consistent.

OK

4.
This is essentially the same hash table
+  management technique embodied in src/backend/utils/hash/dynahash.c for
+  in-memory hash tables used within PostgreSQL internals.

I am not sure if there is a need to mention this in the user-facing
doc. I think README is a better place for this.

OK, will remove. Thanks

I've reworded most things from both Amit and Justin; thanks for your reviews.

I attach both clean and compare versions.

--
Simon Riggs http://www.EnterpriseDB.com/

Attachments:

doc_hash_index.v1-v2.diffapplication/octet-stream; name=doc_hash_index.v1-v2.diffDownload
diff --git a/doc/src/sgml/hash.sgml b/doc/src/sgml/hash.sgml
index 6830b384bf..b4545a92ef 100644
--- a/doc/src/sgml/hash.sgml
+++ b/doc/src/sgml/hash.sgml
@@ -12,8 +12,9 @@
  <title>Overview</title>
 
  <para>
-  PostgreSQL includes an implementation of persistent on-disk hash indexes,
-  which are now fully crash recoverable. Any data type can be indexed by a
+  <productname>PostgreSQL</productname>
+  includes an implementation of persistent on-disk hash indexes,
+  which are fully crash recoverable. Any data type can be indexed by a
   hash index, including data types that do not have a well-defined linear
   ordering. Hash indexes store only the hash value of the data being
   indexed, thus there are no restrictions on the size of the data column
@@ -26,23 +27,25 @@
  </para>
 
  <para>
-  Hash indexes support only the = operator, so WHERE clauses that specify
+  Hash indexes support only the <literal>=</literal> operator,
+  so WHERE clauses that specify
   range operations will not be able to take advantage of hash indexes.
  </para>
 
  <para>
   Each hash index tuple stores just the 4-byte hash value, not the actual
   column value. As a result, hash indexes may be much smaller than B-trees
-  when indexing longer data items such as UUIDs, URLs etc.. The absence of
+  when indexing longer data items such as UUIDs, URLs, etc.. The absence of
   the column value also makes all hash index scans lossy. Hash indexes may
   take part in bitmap index scans and backward scans.
  </para>
 
  <para>
-  Hash indexes are best optimized for SELECTs and UPDATEs using equality
-  scans on larger tables. In a B-tree index, searches must descend
+  Hash indexes are best optimized for SELECT and UPDATE-heavy workloads
+  that use equality scans on larger tables.
+  In a B-tree index, searches must descend
   through the tree until the leaf page is found. In tables with millions
-  of rows this descent can increase access time to data. The equivalent
+  of rows, this descent can increase access time to data. The equivalent
   of a leaf page in a hash index is referred to as a bucket page. In
   contrast, a hash index allows accessing the bucket pages directly,
   thereby potentially reducing index access time in larger tables. This
@@ -56,7 +59,7 @@
   values are evenly distributed. When inserts mean that the bucket page
   becomes full, additional overflow pages are chained to that specific
   bucket page, locally expanding the storage for index tuples that match
-  that hash value. When scanning a hash bucket during queries we need to
+  that hash value. When scanning a hash bucket during queries, we need to
   scan through all of the overflow pages. Thus an unbalanced hash index
   might actually be worse than a B-tree in terms of number of block
   accesses required, for some data.
@@ -65,18 +68,19 @@
  <para>
   As a result of the overflow cases, we can say that hash indexes are
   most suitable for unique, nearly unique data or data with a low number
-  of rows per hash bucket will be suitable for hash indexes. One
-  possible way to avoid problems is to exclude highly non-unique values
-  from the index using a partial index condition, but this may not be
-  suitable in many cases.
+  of rows per hash bucket.
+  One possible way to avoid problems is to exclude highly non-unique
+  values from the index using a partial index condition, but this may
+  not be suitable in many cases.
  </para>
 
  <para>
   Like B-Trees, hash indexes perform simple index tuple deletion. This
   is a deferred maintenance operation that deletes index tuples that are
   known to be safe to delete (those whose item identifier's LP_DEAD bit
-  is already set). This is performed speculatively upon each insert,
-  though may not succeed if the page is pinned by another backend.
+  is already set). If an insert finds no space is available on a page we
+  try to avoid creating a new overflow page by attempting to remove dead
+  index tuples. Removal cannot occur if the page is pinned at that time.
   Deletion of dead index pointers also occurs during VACUUM.  If an
   overflow page becomes empty, overflow pages can be recycled for reuse
   in other buckets, though we never return them to the operating system.
@@ -95,9 +99,7 @@
   incrementally expanded.  When a new bucket is to be added to the index,
   exactly one existing bucket will need to be "split", with some of its
   tuples being transferred to the new bucket according to the updated
-  key-to-bucket-number mapping.  This is essentially the same hash table
-  management technique embodied in src/backend/utils/hash/dynahash.c for
-  in-memory hash tables used within PostgreSQL internals.
+  key-to-bucket-number mapping.
  </para>
 
  <para>
@@ -140,8 +142,8 @@
 
  <para>
   Each row in the table indexed is represented by a single index tuple in
-  the hash index. Hash index tuples are stored in the bucket pages, and if
-  they exist, the overflow pages. 
+  the hash index. Hash index tuples are stored in bucket pages, and if
+  they exist, overflow pages. 
   We speed up searches by keeping the index entries in any 
   one index page sorted by hash code, thus allowing binary search to be used
   within an index page.  Note however that there is *no* assumption about the
@@ -151,7 +153,7 @@
  <para>
   The bucket splitting algorithms to expand the hash index are too complex to
   be worthy of mention here, though are described in more detail in
-  src/backend/access/hash/README.
+  <filename>src/backend/access/hash/README</filename>.
   The split algorithm is crash safe and can be restarted if not completed
   successfully.
  </para>
doc_hash_index.v2.patchapplication/octet-stream; name=doc_hash_index.v2.patchDownload
commit fe57286ca9596ccea4566e9cff7d6e1d9b17c8cc
Author: Simon Riggs <simon.riggs@enterprisedb.com>
Date:   Tue Jun 22 09:54:16 2021 +0100

    New doc chapter for Hash Indexes

diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 45b701426b..459add557b 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -89,6 +89,7 @@
 <!ENTITY spgist     SYSTEM "spgist.sgml">
 <!ENTITY gin        SYSTEM "gin.sgml">
 <!ENTITY brin       SYSTEM "brin.sgml">
+<!ENTITY hash       SYSTEM "hash.sgml">
 <!ENTITY planstats    SYSTEM "planstats.sgml">
 <!ENTITY tableam    SYSTEM "tableam.sgml">
 <!ENTITY indexam    SYSTEM "indexam.sgml">
diff --git a/doc/src/sgml/hash.sgml b/doc/src/sgml/hash.sgml
new file mode 100644
index 0000000000..b4545a92ef
--- /dev/null
+++ b/doc/src/sgml/hash.sgml
@@ -0,0 +1,163 @@
+<!-- doc/src/sgml/hash.sgml -->
+
+<chapter id="hash-index">
+<title>Hash Indexes</title>
+
+   <indexterm>
+    <primary>index</primary>
+    <secondary>Hash</secondary>
+   </indexterm>
+
+<sect1 id="hash-intro">
+ <title>Overview</title>
+
+ <para>
+  <productname>PostgreSQL</productname>
+  includes an implementation of persistent on-disk hash indexes,
+  which are fully crash recoverable. Any data type can be indexed by a
+  hash index, including data types that do not have a well-defined linear
+  ordering. Hash indexes store only the hash value of the data being
+  indexed, thus there are no restrictions on the size of the data column
+  being indexed.
+ </para>
+
+ <para>
+  Hash indexes support only single-column indexes and do not allow
+  uniqueness checking.
+ </para>
+
+ <para>
+  Hash indexes support only the <literal>=</literal> operator,
+  so WHERE clauses that specify
+  range operations will not be able to take advantage of hash indexes.
+ </para>
+
+ <para>
+  Each hash index tuple stores just the 4-byte hash value, not the actual
+  column value. As a result, hash indexes may be much smaller than B-trees
+  when indexing longer data items such as UUIDs, URLs, etc.. The absence of
+  the column value also makes all hash index scans lossy. Hash indexes may
+  take part in bitmap index scans and backward scans.
+ </para>
+
+ <para>
+  Hash indexes are best optimized for SELECT and UPDATE-heavy workloads
+  that use equality scans on larger tables.
+  In a B-tree index, searches must descend
+  through the tree until the leaf page is found. In tables with millions
+  of rows, this descent can increase access time to data. The equivalent
+  of a leaf page in a hash index is referred to as a bucket page. In
+  contrast, a hash index allows accessing the bucket pages directly,
+  thereby potentially reducing index access time in larger tables. This
+  reduction in "logical I/O" becomes even more pronounced on indexes/data
+  larger than shared_buffers/RAM. 
+ </para>
+
+ <para>
+  Hash indexes have been designed to cope with uneven distributions of
+  hash values. Direct access to the bucket pages works well if the hash
+  values are evenly distributed. When inserts mean that the bucket page
+  becomes full, additional overflow pages are chained to that specific
+  bucket page, locally expanding the storage for index tuples that match
+  that hash value. When scanning a hash bucket during queries, we need to
+  scan through all of the overflow pages. Thus an unbalanced hash index
+  might actually be worse than a B-tree in terms of number of block
+  accesses required, for some data.
+ </para>
+
+ <para>
+  As a result of the overflow cases, we can say that hash indexes are
+  most suitable for unique, nearly unique data or data with a low number
+  of rows per hash bucket.
+  One possible way to avoid problems is to exclude highly non-unique
+  values from the index using a partial index condition, but this may
+  not be suitable in many cases.
+ </para>
+
+ <para>
+  Like B-Trees, hash indexes perform simple index tuple deletion. This
+  is a deferred maintenance operation that deletes index tuples that are
+  known to be safe to delete (those whose item identifier's LP_DEAD bit
+  is already set). If an insert finds no space is available on a page we
+  try to avoid creating a new overflow page by attempting to remove dead
+  index tuples. Removal cannot occur if the page is pinned at that time.
+  Deletion of dead index pointers also occurs during VACUUM.  If an
+  overflow page becomes empty, overflow pages can be recycled for reuse
+  in other buckets, though we never return them to the operating system.
+ </para>
+
+ <para>
+  There is currently no provision to shrink a hash index, other than by
+  rebuilding it with REINDEX.  
+  There is no provision for reducing the number of buckets, either.
+ </para>
+
+ <para>
+  Hash indexes may expand the number of bucket pages as the number of
+  rows indexed grows.
+  The hash key-to-bucket-number mapping is chosen so that the index can be
+  incrementally expanded.  When a new bucket is to be added to the index,
+  exactly one existing bucket will need to be "split", with some of its
+  tuples being transferred to the new bucket according to the updated
+  key-to-bucket-number mapping.
+ </para>
+
+ <para>
+  The expansion occurs in the foreground, which could increase execution
+  time for user inserts. Thus, hash indexes may not be suitable for tables
+  with rapidly increasing number of rows.
+ </para>
+
+</sect1>
+
+<sect1 id="hash-implementation">
+ <title>Implementation</title>
+
+ <para>
+  There are four kinds of pages in a hash index: the meta page (page zero),
+  which contains statically allocated control information; primary bucket
+  pages; overflow pages; and bitmap pages, which keep track of overflow
+  pages that have been freed and are available for re-use.  For addressing
+  purposes, bitmap pages are regarded as a subset of the overflow pages.
+ </para>
+
+ <para>
+  Both scanning the index and inserting tuples require locating the bucket
+  where a given tuple ought to be located.  To do this, we need the bucket
+  count, highmask, and lowmask from the metapage; however, it's undesirable
+  for performance reasons to have to have to lock and pin the metapage for
+  every such operation.  Instead, we retain a cached copy of the metapage
+  in each backend's relcache entry.  This will produce the correct
+  bucket mapping as long as the target bucket hasn't been split since the
+  last cache refresh.
+ </para>
+
+ <para>
+  Primary bucket pages and overflow pages are allocated independently since
+  any given index might need more or fewer overflow pages relative to its
+  number of buckets.  The hash code uses an interesting set of addressing
+  rules to support a variable number of overflow pages while not having to
+  move primary bucket pages around after they are created.
+ </para>
+
+ <para>
+  Each row in the table indexed is represented by a single index tuple in
+  the hash index. Hash index tuples are stored in bucket pages, and if
+  they exist, overflow pages. 
+  We speed up searches by keeping the index entries in any 
+  one index page sorted by hash code, thus allowing binary search to be used
+  within an index page.  Note however that there is *no* assumption about the
+  relative ordering of hash codes across different index pages of a bucket.
+ </para>
+
+ <para>
+  The bucket splitting algorithms to expand the hash index are too complex to
+  be worthy of mention here, though are described in more detail in
+  <filename>src/backend/access/hash/README</filename>.
+  The split algorithm is crash safe and can be restarted if not completed
+  successfully.
+ </para>
+
+</sect1>
+
+</chapter>
diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml
index d453be3909..dba9cf413f 100644
--- a/doc/src/sgml/postgres.sgml
+++ b/doc/src/sgml/postgres.sgml
@@ -266,6 +266,7 @@ break is not needed in a wider output rendering.
   &spgist;
   &gin;
   &brin;
+  &hash;
   &storage;
   &bki;
   &planstats;
#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#5)
Re: Doc chapter for Hash Indexes

On Tue, Jun 22, 2021 at 2:31 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

I attach both clean and compare versions.

Do we want to hold this work for PG15 or commit in PG14 and backpatch
it till v10 where we have made hash indexes crash-safe? I would vote
for committing in PG14 and backpatch it till v10, however, I am fine
if we want to commit just to PG14 or PG15.

--
With Regards,
Amit Kapila.

#7Simon Riggs
simon.riggs@enterprisedb.com
In reply to: Amit Kapila (#6)
Re: Doc chapter for Hash Indexes

On Wed, Jun 23, 2021 at 5:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 22, 2021 at 2:31 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

I attach both clean and compare versions.

Do we want to hold this work for PG15 or commit in PG14 and backpatch
it till v10 where we have made hash indexes crash-safe? I would vote
for committing in PG14 and backpatch it till v10, however, I am fine
if we want to commit just to PG14 or PG15.

Backpatch makes sense to me, but since not everyone will be reading
this thread, I would look towards PG15 only first. We may yet pick up
additional corrections or additions before a backpatch, if that is
agreed.

--
Simon Riggs http://www.EnterpriseDB.com/

#8Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#7)
Re: Doc chapter for Hash Indexes

aOn Wed, Jun 23, 2021 at 12:56:51PM +0100, Simon Riggs wrote:

On Wed, Jun 23, 2021 at 5:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 22, 2021 at 2:31 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

I attach both clean and compare versions.

Do we want to hold this work for PG15 or commit in PG14 and backpatch
it till v10 where we have made hash indexes crash-safe? I would vote
for committing in PG14 and backpatch it till v10, however, I am fine
if we want to commit just to PG14 or PG15.

Backpatch makes sense to me, but since not everyone will be reading
this thread, I would look towards PG15 only first. We may yet pick up
additional corrections or additions before a backpatch, if that is
agreed.

Yeah, I think backpatching makes sense.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Bruce Momjian (#8)
Re: Doc chapter for Hash Indexes

On Fri, Jun 25, 2021 at 1:29 AM Bruce Momjian <bruce@momjian.us> wrote:

aOn Wed, Jun 23, 2021 at 12:56:51PM +0100, Simon Riggs wrote:

On Wed, Jun 23, 2021 at 5:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 22, 2021 at 2:31 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

I attach both clean and compare versions.

Do we want to hold this work for PG15 or commit in PG14 and backpatch
it till v10 where we have made hash indexes crash-safe? I would vote
for committing in PG14 and backpatch it till v10, however, I am fine
if we want to commit just to PG14 or PG15.

Backpatch makes sense to me, but since not everyone will be reading
this thread, I would look towards PG15 only first. We may yet pick up
additional corrections or additions before a backpatch, if that is
agreed.

Yeah, I think backpatching makes sense.

I checked and found that there are two commits (7c75ef5715 and
22c5e73562) in the hash index code in PG-11 which might have impacted
what we write in the documentation. However, AFAICS, nothing proposed
in the patch would change due to those commits. Even, if we don't want
to back patch, is there any harm in committing this to PG-14?

--
With Regards,
Amit Kapila.

#10Simon Riggs
simon.riggs@enterprisedb.com
In reply to: Amit Kapila (#9)
2 attachment(s)
Re: Doc chapter for Hash Indexes

On Fri, Jun 25, 2021 at 4:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 25, 2021 at 1:29 AM Bruce Momjian <bruce@momjian.us> wrote:

aOn Wed, Jun 23, 2021 at 12:56:51PM +0100, Simon Riggs wrote:

On Wed, Jun 23, 2021 at 5:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 22, 2021 at 2:31 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

I attach both clean and compare versions.

Do we want to hold this work for PG15 or commit in PG14 and backpatch
it till v10 where we have made hash indexes crash-safe? I would vote
for committing in PG14 and backpatch it till v10, however, I am fine
if we want to commit just to PG14 or PG15.

Backpatch makes sense to me, but since not everyone will be reading
this thread, I would look towards PG15 only first. We may yet pick up
additional corrections or additions before a backpatch, if that is
agreed.

Yeah, I think backpatching makes sense.

I checked and found that there are two commits (7c75ef5715 and
22c5e73562) in the hash index code in PG-11 which might have impacted
what we write in the documentation. However, AFAICS, nothing proposed
in the patch would change due to those commits. Even, if we don't want
to back patch, is there any harm in committing this to PG-14?

I've reviewed those commits and the related code, so I agree.

As a result, I've tweaked the wording around VACUUM slightly.

Clean and compare patches attached.

--
Simon Riggs http://www.EnterpriseDB.com/

Attachments:

doc_hash_index.v3.patchapplication/octet-stream; name=doc_hash_index.v3.patchDownload
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 45b701426b..459add557b 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -89,6 +89,7 @@
 <!ENTITY spgist     SYSTEM "spgist.sgml">
 <!ENTITY gin        SYSTEM "gin.sgml">
 <!ENTITY brin       SYSTEM "brin.sgml">
+<!ENTITY hash       SYSTEM "hash.sgml">
 <!ENTITY planstats    SYSTEM "planstats.sgml">
 <!ENTITY tableam    SYSTEM "tableam.sgml">
 <!ENTITY indexam    SYSTEM "indexam.sgml">
diff --git a/doc/src/sgml/hash.sgml b/doc/src/sgml/hash.sgml
new file mode 100644
index 0000000000..617e10b3f2
--- /dev/null
+++ b/doc/src/sgml/hash.sgml
@@ -0,0 +1,165 @@
+<!-- doc/src/sgml/hash.sgml -->
+
+<chapter id="hash-index">
+<title>Hash Indexes</title>
+
+   <indexterm>
+    <primary>index</primary>
+    <secondary>Hash</secondary>
+   </indexterm>
+
+<sect1 id="hash-intro">
+ <title>Overview</title>
+
+ <para>
+  <productname>PostgreSQL</productname>
+  includes an implementation of persistent on-disk hash indexes,
+  which are fully crash recoverable. Any data type can be indexed by a
+  hash index, including data types that do not have a well-defined linear
+  ordering. Hash indexes store only the hash value of the data being
+  indexed, thus there are no restrictions on the size of the data column
+  being indexed.
+ </para>
+
+ <para>
+  Hash indexes support only single-column indexes and do not allow
+  uniqueness checking.
+ </para>
+
+ <para>
+  Hash indexes support only the <literal>=</literal> operator,
+  so WHERE clauses that specify
+  range operations will not be able to take advantage of hash indexes.
+ </para>
+
+ <para>
+  Each hash index tuple stores just the 4-byte hash value, not the actual
+  column value. As a result, hash indexes may be much smaller than B-trees
+  when indexing longer data items such as UUIDs, URLs, etc.. The absence of
+  the column value also makes all hash index scans lossy. Hash indexes may
+  take part in bitmap index scans and backward scans.
+ </para>
+
+ <para>
+  Hash indexes are best optimized for SELECT and UPDATE-heavy workloads
+  that use equality scans on larger tables.
+  In a B-tree index, searches must descend
+  through the tree until the leaf page is found. In tables with millions
+  of rows, this descent can increase access time to data. The equivalent
+  of a leaf page in a hash index is referred to as a bucket page. In
+  contrast, a hash index allows accessing the bucket pages directly,
+  thereby potentially reducing index access time in larger tables. This
+  reduction in "logical I/O" becomes even more pronounced on indexes/data
+  larger than shared_buffers/RAM. 
+ </para>
+
+ <para>
+  Hash indexes have been designed to cope with uneven distributions of
+  hash values. Direct access to the bucket pages works well if the hash
+  values are evenly distributed. When inserts mean that the bucket page
+  becomes full, additional overflow pages are chained to that specific
+  bucket page, locally expanding the storage for index tuples that match
+  that hash value. When scanning a hash bucket during queries, we need to
+  scan through all of the overflow pages. Thus an unbalanced hash index
+  might actually be worse than a B-tree in terms of number of block
+  accesses required, for some data.
+ </para>
+
+ <para>
+  As a result of the overflow cases, we can say that hash indexes are
+  most suitable for unique, nearly unique data or data with a low number
+  of rows per hash bucket.
+  One possible way to avoid problems is to exclude highly non-unique
+  values from the index using a partial index condition, but this may
+  not be suitable in many cases.
+ </para>
+
+ <para>
+  Like B-Trees, hash indexes perform simple index tuple deletion. This
+  is a deferred maintenance operation that deletes index tuples that are
+  known to be safe to delete (those whose item identifier's LP_DEAD bit
+  is already set). If an insert finds no space is available on a page we
+  try to avoid creating a new overflow page by attempting to remove dead
+  index tuples. Removal cannot occur if the page is pinned at that time.
+  Deletion of dead index pointers also occurs during VACUUM.
+ </para>
+
+ <para>
+  If it can, VACUUM will also try to squeeze the index tuples onto as
+  few overflow pages as possible, minimizing the overflow chain. If an
+  overflow page becomes empty, overflow pages can be recycled for reuse
+  in other buckets, though we never return them to the operating system.
+  There is currently no provision to shrink a hash index, other than by
+  rebuilding it with REINDEX.  
+  There is no provision for reducing the number of buckets, either.
+ </para>
+
+ <para>
+  Hash indexes may expand the number of bucket pages as the number of
+  rows indexed grows.
+  The hash key-to-bucket-number mapping is chosen so that the index can be
+  incrementally expanded.  When a new bucket is to be added to the index,
+  exactly one existing bucket will need to be "split", with some of its
+  tuples being transferred to the new bucket according to the updated
+  key-to-bucket-number mapping.
+ </para>
+
+ <para>
+  The expansion occurs in the foreground, which could increase execution
+  time for user inserts. Thus, hash indexes may not be suitable for tables
+  with rapidly increasing number of rows.
+ </para>
+
+</sect1>
+
+<sect1 id="hash-implementation">
+ <title>Implementation</title>
+
+ <para>
+  There are four kinds of pages in a hash index: the meta page (page zero),
+  which contains statically allocated control information; primary bucket
+  pages; overflow pages; and bitmap pages, which keep track of overflow
+  pages that have been freed and are available for re-use.  For addressing
+  purposes, bitmap pages are regarded as a subset of the overflow pages.
+ </para>
+
+ <para>
+  Both scanning the index and inserting tuples require locating the bucket
+  where a given tuple ought to be located.  To do this, we need the bucket
+  count, highmask, and lowmask from the metapage; however, it's undesirable
+  for performance reasons to have to have to lock and pin the metapage for
+  every such operation.  Instead, we retain a cached copy of the metapage
+  in each backend's relcache entry.  This will produce the correct
+  bucket mapping as long as the target bucket hasn't been split since the
+  last cache refresh.
+ </para>
+
+ <para>
+  Primary bucket pages and overflow pages are allocated independently since
+  any given index might need more or fewer overflow pages relative to its
+  number of buckets.  The hash code uses an interesting set of addressing
+  rules to support a variable number of overflow pages while not having to
+  move primary bucket pages around after they are created.
+ </para>
+
+ <para>
+  Each row in the table indexed is represented by a single index tuple in
+  the hash index. Hash index tuples are stored in bucket pages, and if
+  they exist, overflow pages. 
+  We speed up searches by keeping the index entries in any 
+  one index page sorted by hash code, thus allowing binary search to be used
+  within an index page.  Note however that there is *no* assumption about the
+  relative ordering of hash codes across different index pages of a bucket.
+ </para>
+
+ <para>
+  The bucket splitting algorithms to expand the hash index are too complex to
+  be worthy of mention here, though are described in more detail in
+  <filename>src/backend/access/hash/README</filename>.
+  The split algorithm is crash safe and can be restarted if not completed
+  successfully.
+ </para>
+
+</sect1>
+
+</chapter>
diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml
index d453be3909..dba9cf413f 100644
--- a/doc/src/sgml/postgres.sgml
+++ b/doc/src/sgml/postgres.sgml
@@ -266,6 +266,7 @@ break is not needed in a wider output rendering.
   &spgist;
   &gin;
   &brin;
+  &hash;
   &storage;
   &bki;
   &planstats;
doc_hash_index.v2-v3.patchapplication/octet-stream; name=doc_hash_index.v2-v3.patchDownload
diff --git a/doc/src/sgml/hash.sgml b/doc/src/sgml/hash.sgml
index b4545a92ef..617e10b3f2 100644
--- a/doc/src/sgml/hash.sgml
+++ b/doc/src/sgml/hash.sgml
@@ -81,12 +81,14 @@
   is already set). If an insert finds no space is available on a page we
   try to avoid creating a new overflow page by attempting to remove dead
   index tuples. Removal cannot occur if the page is pinned at that time.
-  Deletion of dead index pointers also occurs during VACUUM.  If an
-  overflow page becomes empty, overflow pages can be recycled for reuse
-  in other buckets, though we never return them to the operating system.
+  Deletion of dead index pointers also occurs during VACUUM.
  </para>
 
  <para>
+  If it can, VACUUM will also try to squeeze the index tuples onto as
+  few overflow pages as possible, minimizing the overflow chain. If an
+  overflow page becomes empty, overflow pages can be recycled for reuse
+  in other buckets, though we never return them to the operating system.
   There is currently no provision to shrink a hash index, other than by
   rebuilding it with REINDEX.  
   There is no provision for reducing the number of buckets, either.
#11Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#10)
Re: Doc chapter for Hash Indexes

On Fri, Jun 25, 2021 at 3:11 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

On Fri, Jun 25, 2021 at 4:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 25, 2021 at 1:29 AM Bruce Momjian <bruce@momjian.us> wrote:

aOn Wed, Jun 23, 2021 at 12:56:51PM +0100, Simon Riggs wrote:

On Wed, Jun 23, 2021 at 5:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 22, 2021 at 2:31 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

I attach both clean and compare versions.

Do we want to hold this work for PG15 or commit in PG14 and backpatch
it till v10 where we have made hash indexes crash-safe? I would vote
for committing in PG14 and backpatch it till v10, however, I am fine
if we want to commit just to PG14 or PG15.

Backpatch makes sense to me, but since not everyone will be reading
this thread, I would look towards PG15 only first. We may yet pick up
additional corrections or additions before a backpatch, if that is
agreed.

Yeah, I think backpatching makes sense.

I checked and found that there are two commits (7c75ef5715 and
22c5e73562) in the hash index code in PG-11 which might have impacted
what we write in the documentation. However, AFAICS, nothing proposed
in the patch would change due to those commits. Even, if we don't want
to back patch, is there any harm in committing this to PG-14?

I've reviewed those commits and the related code, so I agree.

Do you agree to just commit this to PG-14 or to commit in PG-14 and
backpatch till PG-10?

As a result, I've tweaked the wording around VACUUM slightly.

Thanks, the changes look good to me.

--
With Regards,
Amit Kapila.

#12Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#11)
Re: Doc chapter for Hash Indexes

On Sat, Jun 26, 2021 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 25, 2021 at 3:11 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

On Fri, Jun 25, 2021 at 4:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 25, 2021 at 1:29 AM Bruce Momjian <bruce@momjian.us> wrote:

aOn Wed, Jun 23, 2021 at 12:56:51PM +0100, Simon Riggs wrote:

On Wed, Jun 23, 2021 at 5:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 22, 2021 at 2:31 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

I attach both clean and compare versions.

Do we want to hold this work for PG15 or commit in PG14 and backpatch
it till v10 where we have made hash indexes crash-safe? I would vote
for committing in PG14 and backpatch it till v10, however, I am fine
if we want to commit just to PG14 or PG15.

Backpatch makes sense to me, but since not everyone will be reading
this thread, I would look towards PG15 only first. We may yet pick up
additional corrections or additions before a backpatch, if that is
agreed.

Yeah, I think backpatching makes sense.

I checked and found that there are two commits (7c75ef5715 and
22c5e73562) in the hash index code in PG-11 which might have impacted
what we write in the documentation. However, AFAICS, nothing proposed
in the patch would change due to those commits. Even, if we don't want
to back patch, is there any harm in committing this to PG-14?

I've reviewed those commits and the related code, so I agree.

Do you agree to just commit this to PG-14 or to commit in PG-14 and
backpatch till PG-10?

I am planning to go through the patch once again and would like to
commit and backpatch till v10 in a day to two unless someone thinks
otherwise.

--
With Regards,
Amit Kapila.

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#12)
Re: Doc chapter for Hash Indexes

On Tue, Jun 29, 2021 at 2:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Jun 26, 2021 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I am planning to go through the patch once again and would like to
commit and backpatch till v10 in a day to two unless someone thinks
otherwise.

Pushed.

--
With Regards,
Amit Kapila.