From 2c7dd20a47f539f2de514706a32b02207a04e5a6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@heroku.com>
Date: Wed, 27 Aug 2014 15:16:11 -0700
Subject: [PATCH 4/4] Internal documentation for INSERT ... ON CONFLICT {UPDATE
 | IGNORE}

Includes documentation for both the nbtree and executor READMEs, as well
as the user-facing SGML documentation around index AMs.  The latter two
additions should make clear how the AM interface has been revised for
amcanunique AMs in the abstract, without reference to the only current
implementation, nbtree.
---
 doc/src/sgml/indexam.sgml        | 145 ++++++++++++++++++++++++++++++++++++---
 src/backend/access/nbtree/README |  75 ++++++++++++++++++++
 src/backend/executor/README      |  34 +++++++++
 3 files changed, 244 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 157047a..5cd1f14 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -176,13 +176,73 @@ ambuildempty (Relation indexRelation);
 
   <para>
 <programlisting>
+SpeculativeState *state
+amlock (Relation  uniqueIndexRelation,
+        Datum     values,
+        bool     *isnull,
+        Relation  heapRel,
+        List     *otherSpecStates,
+        bool      priorConflict
+);
+</programlisting>
+   Lock index tuple values as part of the first phase of speculative
+   insertion.  If all unique index values are concurrently locked, the
+   core system will attempt a proper insertion.  Otherwise, it may
+   lock a <quote>rejector row</> TID returned through the state return
+   value.
+  </para>
+  <para>
+   The core system will not hold on to value locks acquired by AMs for
+   more than an instant.  It will only hold index value locks across
+   insertion of the corresponding heap row, perhaps including
+   toasting, and across multiple <structfield>amcanunique</> lock
+   acquisitions, plus the latter phase of speculative insertion.  AM
+   authors should carefully analyze the deadlock hazards, including
+   interaction with system catalogs that may be innocously queried
+   with little provocation, and including calls to arbitrary
+   user-defined functions with value locks held during
+   <function>FormIndexDatum</> with expression indexes.
+   Conventionally, value locks are implemented using heavyweight locks
+   on pages, managed by the lock manager.  The core code calls the
+   function in reverse order to the usual insertion order.  The
+   <literal>priorConflict</> field hints to the implementation that
+   there was recently a conflict on the constant proposed for
+   insertion, which can optionally be used to implement opportunistic
+   lock strength reduction optimizations.
+  </para>
+
+  <para>
+   The function's state result value is used in a subsequent call to
+   <function>aminsert</> if and only if the core system resolves to
+   proceed with speculative insertion of a slot proposed for
+   insertion, which it will do if all value locks are successfully
+   acquired (though even that condition does not ensure that the
+   system will not need to restart from the very beginning of value
+   locking for all unique indexes -- <structfield>amcanunique</>
+   methods can only assume that insertion will proceed when there is a
+   corresponding <function>aminsert</> call, and can probably never
+   reasonably ascertain that insertion will not proceed, since that's
+   the core code's concern.  The core code will only release index
+   value locks through callback, however).  It also provides callbacks
+   for both the core code and all <structfield>amcanunique</> access
+   methods to release all value locks. The routine may need to do so
+   for all value locks held while it blocks pending the outcome of a
+   conflicting transaction.  The AM may then report an unsuccessful
+   attempt at value locking, obliging the core code to restart all
+   value locking for all unique indexes. see <xref
+   linkend="index-unique-checks"> for more details.
+  </para>
+
+  <para>
+<programlisting>
 bool
 aminsert (Relation indexRelation,
           Datum *values,
           bool *isnull,
           ItemPointer heap_tid,
           Relation heapRelation,
-          IndexUniqueCheck checkUnique);
+          IndexUniqueCheck checkUnique,
+          SpeculativeState *state);
 </programlisting>
    Insert a new tuple into an existing index.  The <literal>values</> and
    <literal>isnull</> arrays give the key values to be indexed, and
@@ -190,11 +250,15 @@ aminsert (Relation indexRelation,
    If the access method supports unique indexes (its
    <structname>pg_am</>.<structfield>amcanunique</> flag is true) then
    <literal>checkUnique</> indicates the type of uniqueness check to
-   perform.  This varies depending on whether the unique constraint is
-   deferrable; see <xref linkend="index-unique-checks"> for details.
-   Normally the access method only needs the <literal>heapRelation</>
-   parameter when performing uniqueness checking (since then it will have to
-   look into the heap to verify tuple liveness).
+   perform, and <literal>state</> indicates state that is passed
+   backwards and forwards between the AM and core code to manage
+   speculative insertion state, and to provide a callback to release
+   all value locks held.  <literal>checkUnique</> varies depending on
+   whether the unique constraint is deferrable; see <xref
+   linkend="index-unique-checks"> for details.  Normally the access
+   method only needs the <literal>heapRelation</> parameter when
+   performing uniqueness checking (since then it will have to look
+   into the heap to verify tuple liveness).
   </para>
 
   <para>
@@ -766,7 +830,15 @@ amrestrpos (IndexScanDesc scan);
    using <firstterm>unique indexes</>, which are indexes that disallow
    multiple entries with identical keys.  An access method that supports this
    feature sets <structname>pg_am</>.<structfield>amcanunique</> true.
-   (At present, only b-tree supports it.)
+   (At present, only b-tree supports it.)  Note that
+   <structfield>amcanunique</> access methods are also obliged to
+   support <quote>speculative insertion</> via phased locking in
+   advance of heap tuple insertion, where <quote>value locks</>
+   on values proposed for insertion are held across multiple
+   <structfield>amcanunique</> indexes concurrently.  Speculative
+   insertion is used by <command>INSERT</command> statements when
+   <literal>ON CONFLICT UPDATE</literal> or <literal>ON CONFLICT
+   IGNORE</literal> is specified.
   </para>
 
   <para>
@@ -793,14 +865,38 @@ amrestrpos (IndexScanDesc scan);
        commits.  If it rolls back then there is no conflict.  If it commits
        without deleting the conflicting row again, there is a uniqueness
        violation.  (In practice we just wait for the other transaction to
-       end and then redo the visibility check in toto.)
+       end and then redo the visibility check in toto).  However,
+       during speculative insertion, where a unique constraint
+       violation cannot be thrown, but it is unlikely to be practical
+       to hold on to a value lock indefinitely pending the outcome of
+       another transaction, a callback is provided to instruct the
+       core code to release each index value lock.  When a wait on the
+       conflicting transaction is complete, the AM specifies that the
+       core code should retry its slot insertion from scratch,
+       possibly reacquiring value locks on previously locked values on
+       seperate, earlier-locked unique indexes.
+      </para>
+      <para>
+       Note that the unlocking callback is called twice in the event
+       of locking/updating being indicated -- the first call (before
+       locking/updating the slot) releases functional value locks, and
+       the second call (after locking/updating) releases an interlock
+       against VACUUM (this should only be necessary for the merge-on
+       unique index).  This is necessary because it would be bad news
+       to attempt to lock/update a row that happens to occupy the same
+       heap slot as a row version that the AM has made a determination
+       about the non-uniqueness of. XXX:  There are open questions
+       around VACUUM interlocking.
       </para>
      </listitem>
      <listitem>
       <para>
        Similarly, if a conflicting valid row has been deleted by an
        as-yet-uncommitted transaction, the would-be inserter must wait
-       for that transaction to commit or abort, and then repeat the test.
+       for that transaction to commit or abort, and then repeat the test,
+       or in the case of speculative insertion, start value locking
+       for all values proposed for insertion by the current executor
+       slot from the very beginning.
       </para>
      </listitem>
     </itemizedlist>
@@ -814,6 +910,7 @@ amrestrpos (IndexScanDesc scan);
    ordinary scenario of inserting a row that's just been created by
    the current transaction.  It can happen during
    <command>CREATE UNIQUE INDEX CONCURRENTLY</>, however.)
+   Speculative inserters need not bother with this.
   </para>
 
   <para>
@@ -865,7 +962,8 @@ amrestrpos (IndexScanDesc scan);
        method must allow duplicate entries into the index, and report any
        potential duplicates by returning FALSE from <function>aminsert</>.
        For each row for which FALSE is returned, a deferred recheck will
-       be scheduled.
+       be scheduled.  Speculative insertion into deferred unique indexes is not
+       supported.
       </para>
 
       <para>
@@ -900,6 +998,33 @@ amrestrpos (IndexScanDesc scan);
        for the same tuple values as were used in the original insertion.
       </para>
      </listitem>
+     <listitem>
+      <para>
+       <literal>UNIQUE_CHECK_SPEC</> indicates that the call is the
+       second phase of speculative insertion, corresponding to an
+       earlier <function>amlock</> call.  Value locks should already
+       be held from the first phae for unique index insertion to pick
+       up from, with a heap tuple pointer now past to finish the
+       insertion with.  The core code will have previously called
+       <function>amlock</> in respect of each index value as part of
+       this slot insertion.  Conceptually, the implementation performs
+       the index tuple insertion in a staggered manner, with the work
+       broken out into discrete locking and insertion phases, with the
+       possibility of not continuing with insertion (to lock a row for
+       update, for example) in the event of total consensus to proceed
+       among unique indexes not emerging.
+      </para>
+
+      <para>
+       Precautions must be taken around value locking.  These are
+       separately noted in the <function>amlock</> documentation.  If
+       insertion proceeds in respect of a given slot, and
+       <function>aminsert</> is called with a
+       <literal>UNIQUE_CHECK_SPEC</> argument, the
+       <function>aminsert</> implementation is obligated to release
+       its value lock on the unique index as part of insertion.
+      </para>
+     </listitem>
     </itemizedlist>
   </para>
 
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 4820f76..2e6627a 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -568,6 +568,81 @@ item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
 than key.
 
+Notes about value locking for speculative insertion
+---------------------------------------------------
+
+As an amcanunique AM, the btree implementation is required to support
+"speculative insertion".  This means that the value locking method
+through which unique index enforcement conventionally occurs is
+extended and generalized, such that insertion is staggered:  the core
+code attempts to get full consensus on whether values proposed for
+insertion will not cause duplicate key violations.  Speculative
+insertion is only possible for unique index insertion without deferred
+uniqueness checking (since speculative insertion into a deferred
+unique constraint's index is a contradiction in terms).
+
+For conventional unique index insertion, the Btree implementation
+exclusive locks a buffer holding the first page that the value to be
+inserted could possibly be on, though only for an instant, during and
+shortly after uniqueness verification.  It would not be acceptable to
+hold this lock across complex operations for the duration of the
+remainder of the first phase of speculative insertion.  Therefore, we
+convert this exclusive buffer lock to an exclusive page lock managed
+by the lock manager, thereby greatly ameliorating the consequences of
+undiscovered deadlocking implementation bugs (though deadlocks are not
+expected), and minimizing the impact on system interruptibility, while
+not affecting index scans.
+
+It may be useful to informally think of the page lock type acquired by
+speculative insertion as similar to an intention exclusive lock, a
+type of lock found in some legacy 2PL database systems that use
+multi-granularity locking.  A session establishes the exclusive right
+to subsequently establish a full write lock, without actually blocking
+reads of the page unless and until a lock conversion actually occurs,
+at which point both reads and writes are blocked.  Under this mental
+model, buffer shared locks can be thought of as intention shared
+locks.
+
+As implemented, these heavyweight locks are only relevant to the
+insertion case;  at no other point are they actually considered, since
+insertion is the only way through which new values are introduced.
+The first page a value proposed for insertion into an index could be
+on represents a natural choke point for our extended, though still
+rather limited system of value locking.  Naturally, when we perform a
+"lock escalation" and acquire an exclusive buffer lock, all other
+buffer locks on the same buffer are blocked, which is how the
+implementation localizes knowledge about the heavyweight lock to
+insertion-related routines.  Apart from deletion, all exclusive
+locking of Btree buffers happen as a direct or indirect result of
+insertion, so this approach is sufficient.  (Actually, an exclusive
+lock may still be acquired without insertion to initialize a root
+page, but that hardly matters).  Note that holding a buffer pin for
+the duration of value locking is necessary to ensure that
+non-speculative inserters notice that it is necessary to acquire a
+heavyweight lock.
+
+All value locks are dropped immediately as speculative insertion is
+aborted, as the implementation waits on the outcome of another xact,
+or as "insertion proper" occurs.  These page-level locks are not
+intended to last more than an instant.  In general, the protocol for
+heavyweight locking Btree pages is that heavyweight locks are acquired
+before any buffer locks are held, while the locks are only released
+after all buffer locks are released.  While not a hard and fast rule,
+presently we avoid heavyweight page locking more than one page per
+unique index concurrently.
+
+The core code inserts into indexes in a well-defined order:  Sorted by
+OID.  This is a general precaution against deadlock hazards in AMs
+that acquire exclusive locks.  However, during the first phase of
+speculative insertion, corresponding locking operations first occur in
+reverse order (for unique indexes only).  This has the benefit of
+reducing lock contention, by guaranteeing that only heap tuple
+insertion occurs between value locking and index tuple insertion once
+consensus to proceed emerges.  The protocol for releasing value locks
+is that the core system always does so in reverse order to the
+original locking order (so "right-way-around"), either during
+insertion or when a would-be duplicate conflict is found.
+
 Notes to Operator Class Implementors
 ------------------------------------
 
diff --git a/src/backend/executor/README b/src/backend/executor/README
index 8afa1e3..1aa21fd 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -200,3 +200,37 @@ is no explicit prohibition on SRFs in UPDATE, but the net effect will be
 that only the first result row of an SRF counts, because all subsequent
 rows will result in attempts to re-update an already updated target row.
 This is historical behavior and seems not worth changing.)
+
+Speculative insertion
+---------------------
+
+Speculative insertion is a process that the executor manages for the benefit of
+INSERT...ON CONFLICT UPDATE... .  The basic idea is that values within AMs
+(that do not currently exist) are speculatively locked.  If a consensus to
+insert emerges among all unique indexes, we proceed with physical index tuple
+insertion for each unique index in turn, releasing value locks as each physical
+insertion is performed.  Otherwise, we must UPDATE the existing value (or IGNORE).
+
+When we UPDATE, value locks are released before an opportunistic attempt at
+locking a conclusively visible conflicting tuple occurs. If this process fails,
+we retry.  We may retry indefinitely.  Failing to release value locks serves no
+practical purpose, since they don't prevent many types of conflicts that the
+UPDATE case must care about, and is actively harmful, since it will result in
+unprincipled deadlocking under high concurrency.
+
+The representation of the UPDATE query tree is as a separate query tree,
+auxiliary to the main INSERT query tree, and its plan is not formally a subplan
+of the parent INSERT's.  Rather, the plan's state is used selectively by its
+parent.
+
+Having successfully locked a definitively visible tuple, we update it, applying
+the EvalPlanQual() query execution mechanism to the latest (at just determined
+by an amcanunique AM) conclusively visible, now locked tuple.  Earlier versions
+are not evaluated against our qual, and we never directly walk the update chain
+in the event of the tuple being deleted/updated (which is conceptually a
+conflict).  The process simply restarts without making useful progress in the
+present iteration.  It is sometimes necessary to UPDATE a row where no row
+version is visible, so it seems inconsistent to require that earlier versions
+(including a version that may exist that is visible to our command's MVCC
+snapshot) must satisfy the qual just because there happened to be a version
+visible, where otherwise no evaluation would occur.
-- 
1.9.1

