Hash Indexes

commit 487fbecb3411a9eb0559af856cc9779f242cdc3e
Author: Simon Riggs <simon.riggs@enterprisedb.com>
Date:   Mon Jun 21 13:57:27 2021 +0100

    New doc chapter for Hash Indexes

diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 45b701426b..459add557b 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -89,6 +89,7 @@
 <!ENTITY spgist     SYSTEM "spgist.sgml">
 <!ENTITY gin        SYSTEM "gin.sgml">
 <!ENTITY brin       SYSTEM "brin.sgml">
+<!ENTITY hash       SYSTEM "hash.sgml">
 <!ENTITY planstats    SYSTEM "planstats.sgml">
 <!ENTITY tableam    SYSTEM "tableam.sgml">
 <!ENTITY indexam    SYSTEM "indexam.sgml">
diff --git a/doc/src/sgml/hash.sgml b/doc/src/sgml/hash.sgml
new file mode 100644
index 0000000000..6830b384bf
--- /dev/null
+++ b/doc/src/sgml/hash.sgml
@@ -0,0 +1,161 @@
+<!-- doc/src/sgml/hash.sgml -->
+
+<chapter id="hash-index">
+<title>Hash Indexes</title>
+
+   <indexterm>
+    <primary>index</primary>
+    <secondary>Hash</secondary>
+   </indexterm>
+
+<sect1 id="hash-intro">
+ <title>Overview</title>
+
+ <para>
+  PostgreSQL includes an implementation of persistent on-disk hash indexes,
+  which are now fully crash recoverable. Any data type can be indexed by a
+  hash index, including data types that do not have a well-defined linear
+  ordering. Hash indexes store only the hash value of the data being
+  indexed, thus there are no restrictions on the size of the data column
+  being indexed.
+ </para>
+
+ <para>
+  Hash indexes support only single-column indexes and do not allow
+  uniqueness checking.
+ </para>
+
+ <para>
+  Hash indexes support only the = operator, so WHERE clauses that specify
+  range operations will not be able to take advantage of hash indexes.
+ </para>
+
+ <para>
+  Each hash index tuple stores just the 4-byte hash value, not the actual
+  column value. As a result, hash indexes may be much smaller than B-trees
+  when indexing longer data items such as UUIDs, URLs etc.. The absence of
+  the column value also makes all hash index scans lossy. Hash indexes may
+  take part in bitmap index scans and backward scans.
+ </para>
+
+ <para>
+  Hash indexes are best optimized for SELECTs and UPDATEs using equality
+  scans on larger tables. In a B-tree index, searches must descend
+  through the tree until the leaf page is found. In tables with millions
+  of rows this descent can increase access time to data. The equivalent
+  of a leaf page in a hash index is referred to as a bucket page. In
+  contrast, a hash index allows accessing the bucket pages directly,
+  thereby potentially reducing index access time in larger tables. This
+  reduction in "logical I/O" becomes even more pronounced on indexes/data
+  larger than shared_buffers/RAM. 
+ </para>
+
+ <para>
+  Hash indexes have been designed to cope with uneven distributions of
+  hash values. Direct access to the bucket pages works well if the hash
+  values are evenly distributed. When inserts mean that the bucket page
+  becomes full, additional overflow pages are chained to that specific
+  bucket page, locally expanding the storage for index tuples that match
+  that hash value. When scanning a hash bucket during queries we need to
+  scan through all of the overflow pages. Thus an unbalanced hash index
+  might actually be worse than a B-tree in terms of number of block
+  accesses required, for some data.
+ </para>
+
+ <para>
+  As a result of the overflow cases, we can say that hash indexes are
+  most suitable for unique, nearly unique data or data with a low number
+  of rows per hash bucket will be suitable for hash indexes. One
+  possible way to avoid problems is to exclude highly non-unique values
+  from the index using a partial index condition, but this may not be
+  suitable in many cases.
+ </para>
+
+ <para>
+  Like B-Trees, hash indexes perform simple index tuple deletion. This
+  is a deferred maintenance operation that deletes index tuples that are
+  known to be safe to delete (those whose item identifier's LP_DEAD bit
+  is already set). This is performed speculatively upon each insert,
+  though may not succeed if the page is pinned by another backend.
+  Deletion of dead index pointers also occurs during VACUUM.  If an
+  overflow page becomes empty, overflow pages can be recycled for reuse
+  in other buckets, though we never return them to the operating system.
+ </para>
+
+ <para>
+  There is currently no provision to shrink a hash index, other than by
+  rebuilding it with REINDEX.  
+  There is no provision for reducing the number of buckets, either.
+ </para>
+
+ <para>
+  Hash indexes may expand the number of bucket pages as the number of
+  rows indexed grows.
+  The hash key-to-bucket-number mapping is chosen so that the index can be
+  incrementally expanded.  When a new bucket is to be added to the index,
+  exactly one existing bucket will need to be "split", with some of its
+  tuples being transferred to the new bucket according to the updated
+  key-to-bucket-number mapping.  This is essentially the same hash table
+  management technique embodied in src/backend/utils/hash/dynahash.c for
+  in-memory hash tables used within PostgreSQL internals.
+ </para>
+
+ <para>
+  The expansion occurs in the foreground, which could increase execution
+  time for user inserts. Thus, hash indexes may not be suitable for tables
+  with rapidly increasing number of rows.
+ </para>
+
+</sect1>
+
+<sect1 id="hash-implementation">
+ <title>Implementation</title>
+
+ <para>
+  There are four kinds of pages in a hash index: the meta page (page zero),
+  which contains statically allocated control information; primary bucket
+  pages; overflow pages; and bitmap pages, which keep track of overflow
+  pages that have been freed and are available for re-use.  For addressing
+  purposes, bitmap pages are regarded as a subset of the overflow pages.
+ </para>
+
+ <para>
+  Both scanning the index and inserting tuples require locating the bucket
+  where a given tuple ought to be located.  To do this, we need the bucket
+  count, highmask, and lowmask from the metapage; however, it's undesirable
+  for performance reasons to have to have to lock and pin the metapage for
+  every such operation.  Instead, we retain a cached copy of the metapage
+  in each backend's relcache entry.  This will produce the correct
+  bucket mapping as long as the target bucket hasn't been split since the
+  last cache refresh.
+ </para>
+
+ <para>
+  Primary bucket pages and overflow pages are allocated independently since
+  any given index might need more or fewer overflow pages relative to its
+  number of buckets.  The hash code uses an interesting set of addressing
+  rules to support a variable number of overflow pages while not having to
+  move primary bucket pages around after they are created.
+ </para>
+
+ <para>
+  Each row in the table indexed is represented by a single index tuple in
+  the hash index. Hash index tuples are stored in the bucket pages, and if
+  they exist, the overflow pages. 
+  We speed up searches by keeping the index entries in any 
+  one index page sorted by hash code, thus allowing binary search to be used
+  within an index page.  Note however that there is *no* assumption about the
+  relative ordering of hash codes across different index pages of a bucket.
+ </para>
+
+ <para>
+  The bucket splitting algorithms to expand the hash index are too complex to
+  be worthy of mention here, though are described in more detail in
+  src/backend/access/hash/README.
+  The split algorithm is crash safe and can be restarted if not completed
+  successfully.
+ </para>
+
+</sect1>
+
+</chapter>
diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml
index d453be3909..dba9cf413f 100644
--- a/doc/src/sgml/postgres.sgml
+++ b/doc/src/sgml/postgres.sgml
@@ -266,6 +266,7 @@ break is not needed in a wider output rendering.
   &spgist;
   &gin;
   &brin;
+  &hash;
   &storage;
   &bki;
   &planstats;